LSH allows you to precompute a hash code that is then quickly and easily compared to another precomputed LSH hash code to determine if two objects should be compared in more detail or quickly discarded.

We don't actually calculate a LSH hash code as such, but the idea of a boiling down a complex object (like our collection of minhash codes) into something that is quickly and easily compared with other complex objects is still applicable.

The sets Instead we break a document down into what are known as shingles.

Each shingle contains a set number of words, and a document is broken down into total words - single length 1 number of shingles.

If you know the math behind this, please leave a comment.

At this point instead of 200 randomly selected shingles, we have 200 integer hash values.

This tells us that these two documents should be compared for their similarity.

This is useful when you have a document, and you want to know which other documents to compare to it for similarity.

This means when I use terms like Set, I am referring to the group of classes that implement a Set in Java.The short answer is that you XOR the value returned by Code() with 199 random numbers to generate the 199 other hash code values.Just make sure that you are using the same 199 random numbers across all the documents.So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles : So we now have a way to compare two documents for similarity, but it is not an efficient process.To find similar documents to document A in a directory of 10000 documents, we need compare each pair individually. What we can do to reduce some cycles is compare sets of randomly selected shingles from two documents.

