Minhash provides a measure of the resemblence between sets of data (Jaccard similarity). MinHash was created by Andrei Z. Broder [paper]:
Minhash |
Outline
Jaccard similarity coefficient is used to measure the similarity between data sets. In case we take two sentences, and create two sets based on these:
from datasketch import MinHash import sys str1 = "hello how are you" str2 = "i think you are lucky" data1 =str1.split(" ") data2 = str2.split(" ") print ("String 1:\t",str1) print ("String 2:\t",str2) print ("===================") m1, m2 = MinHash(), MinHash() for d in data1: m1.update(d.encode('utf8')) for d in data2: m2.update(d.encode('utf8')) print ("Estimated Jaccard:\t", m1.jaccard(m2)) s1 = set(data1) s2 = set(data2) actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2))) print ("Actual Jaccard:\t\t", actual_jaccard)
A sample run:
String 1: this is the first string String 2: this is the first string =================== Estimated Jaccard: 1.0 Actual Jaccard: 1.0 String 1: this is the first string String 2: this is the string first =================== Estimated Jaccard: 1.0 Actual Jaccard: 1.0 String 1: this is the first string String 2: this is the string help =================== Estimated Jaccard: 0.65625 Actual Jaccard: 0.666666666667 String 1: this is the first string String 2: this keep the string help =================== Estimated Jaccard: 0.4453125 Actual Jaccard: 0.428571428571 String 1: this is the first string String 2: a totally different sentence =================== Estimated Jaccard: 0.0 Actual Jaccard: 0.0