In this project we develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach provides a powerful technology for retrieval tasks such as plagiarism analysis, near-duplicate detection, and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 1.5 million English Wikipedia articles and 0.5 million German articles. This index - along with a search interface and a servlet engine - fits on a conventional CD (0.7 gigabyte). Download the index along with the search interface as a bootable CD image. [demo]
The ingredients of our indexing technology are similarity hashing and minimal perfect hashing. With respect to the former we have developed a new kind of tolerant ("fuzzy") hashcodes for larger portions of text (50-1000 words). Fuzziness is introduced by generalizing divergence properties of a document's terms as a small set of parameters, which in turn are encoded as - what is here called - fingerprint. The probability that the fingerprints of two documents are equal is close to one for documents with a high cosine similarity.
Fuzzy hashcodes of indexed documents are mapped to memory positions, which point to a postlist referencing all documents associated with the hashcode in question. An optimal solution for this problem is a hash function which (i) maps all keys without hash collisions to the available memory positions, and (ii) has an upper bound for the average number of additional positions needed to ensure the first property. A hash function with the first property is called perfect; moreover, it is called minimal if the second property's upper bound is a small constant. For a previously known set of keys, minimal perfect hash functions can be constructed in linear time and space. This prerequisite is fulfilled for Wikipedia in the Pocket because all documents to be indexed are known in advance. We call this a half-open retrieval situation.
Benno Stein. Principles of Hash-based Text Retrieval. In Charles Clarke et al, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 527-534, July2007. ACM. ISBN 987-1-59593-597-7. [doi] [paper] [bib]
Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In M. Spiliopoulou et al, editors, From Data and Information Analysis to Knowledge Engineering. Selected papers from the 29th Annual Conference of the German Classification Society (GFKL 05), Studies in Classification, Data Analysis, and Knowledge Organization, pages 430-437, Berlin Heidelberg New York, 2006. Springer. ISBN 1431-8814. [doi] [paper] [bib]