Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Wikipedia in the Pocket

    Synopsis

    In this project we develop and implement a new indexing technology which allows one to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach provides a powerful technology for retrieval tasks such as plagiarism analysis, near-duplicate detection, and high similarity search.

    To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 1.5 million English Wikipedia articles and 0.5 million German articles. This index - along with a search interface and a servlet engine - fits on a conventional CD (0.7 gigabyte).

    Demo

    You can download our index along with the search interface as a bootable CD image (version 1.1).

     

    Project Outline

    The ingredients of our indexing technology are similarity hashing and minimal perfect hashing. With respect to the former we have developed a new kind of tolerant ("fuzzy") hashcodes for larger portions of text (50-1000 words). Fuzziness is introduced by generalizing divergence properties of a document's terms as a small set of parameters, which in turn are encoded as - what is here called - fingerprint. The probability that the fingerprints of two documents are equal is close to one for documents with a high cosine similarity.

    Fuzzy hashcodes of indexed documents are mapped to memory positions, which point to a postlist referencing all documents associated with the hashcode in question. An optimal solution for this problem is a hash function which (i) maps all keys without hash collisions to the available memory positions, and (ii) has an upper bound for the average number of additional positions needed to ensure the first property. A hash function with the first property is called perfect; moreover, it is called minimal if the second property's upper bound is a small constant. For a previously known set of keys, minimal perfect hash functions can be constructed in linear time and space. This prerequisite is fulfilled for Wikipedia in the Pocket because all documents to be indexed are known in advance. We call this a half-open retrieval situation.

    People

    Students: Dennis Hoppe, Martin Trenkmann

    Related Publications

    Martin Potthast. Wikipedia in the Pocket - Indexing Technology for Near-duplicate Detection and High Similarity Search. In Charles Clarke, Norbert Fuhr, Noriko Kando, Wessel Kraaij, and Arjen P. de Vries, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 909, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
    Benno Stein. Principles of Hash-based Text Retrieval. In Charles Clarke, Norbert Fuhr, Noriko Kando, Wessel Kraaij, and Arjen P. de Vries, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 527-534, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
    Benno Stein and Martin Potthast. Applying Hash-based Indexing in Text-Based Information Retrieval. In Moens, Tuytelaars, and de Vries, editors, 7th Dutch-Belgian Information Retrieval Workshop (DIR 07), pages 29-35, March 2007. Faculty of Engineering, Universiteit Leuven. ISBN 978-90-5682-771-7. [paper] [bib] [slides]
    Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering. Selected papers from the 29th Annual Conference of the German Classification Society (GfKl 05), Magdeburg, Germany, Studies in Classification, Data Analysis, and Knowledge Organization, pages 430-437, 2006. Springer. ISBN 978-3-540-31313-7. [doi] [paper] [bib]
    Benno Stein and Martin Potthast. Hashing-basierte Indizierung: Anwendungsszenarien, Theorie und Methoden. In Norbert Fuhr and Sebastian Goeser and Thomas Mandl, editors, Workshop Special Interest Group Information Retrieval (FGIR 06), Hildesheimer Informatikberichte, pages 159-166, October 2006. Universität Hildesheim. ISSN 0941-3014. [publisher] [paper] [bib] [slides]
    Benno Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. In Klaus Tochtermann and Hermann Maurer, editors, 5th International Conference on Knowledge Management (I-KNOW 05), Graz, Austria, Journal of Universal Computer Science, pages 572-579, July 2005. Know-Center. ISSN 0948-695x. [paper] [bib]

    Content signature