In this project we develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach provides a powerful technology for retrieval tasks such as plagiarism analysis, near-duplicate detection, and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 1.5 million English Wikipedia articles and 0.5 million German articles. This index - along with a search interface and a servlet engine - fits on a conventional CD (0.7 gigabyte). Download the index along with the search interface as a bootable CD image.


The ingredients of our indexing technology are similarity hashing and minimal perfect hashing. With respect to the former we have developed a new kind of tolerant ("fuzzy") hashcodes for larger portions of text (50-1000 words). Fuzziness is introduced by generalizing divergence properties of a document's terms as a small set of parameters, which in turn are encoded as - what is here called - fingerprint. The probability that the fingerprints of two documents are equal is close to one for documents with a high cosine similarity.

Fuzzy hashcodes of indexed documents are mapped to memory positions, which point to a postlist referencing all documents associated with the hashcode in question. An optimal solution for this problem is a hash function which (i) maps all keys without hash collisions to the available memory positions, and (ii) has an upper bound for the average number of additional positions needed to ensure the first property. A hash function with the first property is called perfect; moreover, it is called minimal if the second property's upper bound is a small constant. For a previously known set of keys, minimal perfect hash functions can be constructed in linear time and space. This prerequisite is fulfilled for Wikipedia in the Pocket because all documents to be indexed are known in advance. We call this a half-open retrieval situation.


Students: Dennis Hoppe, Martin Trenkmann


Martin Potthast. Wikipedia in the Pocket - Indexing Technology for Near-duplicate Detection and High Similarity Search. In Charles Clarke et al, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 909, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
Benno Stein. Principles of Hash-based Text Retrieval. In Charles Clarke et al, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 527-534, July 2007. ACM. ISBN 987-1-59593-597-7. [doi] [paper] [bib]
Benno Stein and Martin Potthast. Applying Hash-based Indexing in Text-Based Information Retrieval. In Marie-Francine Moens, Tinne Tuytelaars, and Arjen P. de Vries, editors, 7th Dutch-Belgian Information Retrieval Workshop (DIR 07), pages 29-35, Leuven, Belgium, March 2007. Faculty of Engineering, Universiteit Leuven. ISBN 978-90-5682-771-7. [paper] [bib] [slides]
Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In M. Spiliopoulou et al, editors, From Data and Information Analysis to Knowledge Engineering. Selected papers from the 29th Annual Conference of the German Classification Society (GFKL 05), Studies in Classification, Data Analysis, and Knowledge Organization, pages 430-437, Berlin Heidelberg New York, 2006. Springer. ISBN 1431-8814. [doi] [paper] [bib]
Benno Stein and Martin Potthast. Hashing-basierte Indizierung: Anwendungsszenarien, Theorie und Methoden. In Norbert Fuhr, Sebastian Goeser, and Thomas Mandl, editors, Workshop Special Interest Group Information Retrieval (FGIR 06), Hildesheimer Informatikberichte, pages 159-166, October 2006. University of Hildesheim, Germany. ISSN 0941-3014. [publisher] [paper] [bib] [slides]
Benno Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. In Klaus Tochtermann and Hermann Maurer, editors, 5th International Conference on Knowledge Management (I-KNOW 05), Journal of Universal Computer Science, pages 572-579, Graz, Austria, July 2005. Know-Center. ISSN 0948-695x. [paper] [bib]