Can't set threshold=1 when using MinHashLSH? #268
-
|
Is it can't set threshold=1 when using MinHashLSH? I got error when I using here is the code: from datasketch import MinHash, MinHashLSH
set1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'datasets'])
set2 = set(['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents'])
set3 = set(['minhash', 'is', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents'])
m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
for d in set1:
m1.update(d.encode('utf8'))
for d in set2:
m2.update(d.encode('utf8'))
for d in set3:
m3.update(d.encode('utf8'))
# Create LSH index
lsh = MinHashLSH(threshold=1, num_perm=128)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("Approximate neighbours with Jaccard similarity > 0.5", result)and got error:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
There's no need to use LSH for threshold=1.0 since that would only match items with the exact same MinHash. Instead, you can store items in a dict keyed by the hash of the entire MinHash, or even skip using MinHash and just use a single hash of the entire item. |
Beta Was this translation helpful? Give feedback.
There's no need to use LSH for threshold=1.0 since that would only match items with the exact same MinHash. Instead, you can store items in a dict keyed by the hash of the entire MinHash, or even skip using MinHash and just use a single hash of the entire item.