Retrieval & Mining

The retrieval and mining lab is a dedicated place for scientists who want to do state-of-the-art research in data analytics, and, in particular, who are keen to take the next steps beyond. In this regard we work on the well-known challenges and problems of the information society such as Web search, information overload, and information personalization. However, our true mission is to identify new research directions and exciting questions by combining technical means such as cluster computing, human authored text and artificial data in Peta-size scale, and advanced computer science methods from Information Retrieval, Machine Learning, Natural Language Processing, and Artificial Intelligence. To be more specific, we have raised (and seriously think about) the following and other questions:

  • Who wrote the Web?
  • How to quantify the usefulness of product reviews?
  • Up to which level can human debating skills be automated?
  • Is it possible to interpret simulation data better than engineers can do?
  • Which elements of the Wikipedia encyclopedia can be algorithmically improved?

To (partially) answer these and related questions we follow several lines of research. Among others, we scale-up Natural Language Processing towards Big Data, render Machine Learning technology more robust in terms of domains and noise, analyze and improve unsupervised Data Mining methods, develop tailored retrieval technology for social media and specific genres, and integrate semantic concepts into text analytics algorithms.

The computing hardware of the retrieval and mining lab has been designed to perfectly support the research outlined above. It is comprised of 135 rack servers with 1,620 cores, 8.6 TB main memory, about 2.5 Peta byte disk space, and a 10Gb network allowing nearly arbitrary switching configurations at a blocking factor of one. Moreover, we host a large number of important corpora and, with »common crawl«, the currently largest available Web corpus. The retrieval and mining cluster became operational in spring 2015.

Selected Projects

Large Scale Text Analytics (Big Data Information Retrieval)

This research area exploits the Big Data computing facilities in combination with the corpus resources that have been built up in the Webis research group. Our achievements in this research area can be subsumed as follows.

Dynamic taxonomies in digital libraries. We have proposed the so-called "Keyqueries Paradigm", a concept to lay the grounds for future taxonomy systems. Given a document, a keyquery is a set of few keywords for which the document achieves a high relevance score. Keyqueries can hence be viewed as a general and concise description of the returned retrieval results. Our keyquery framework addresses important problems of static classification systems such as overlarge classes and overly complex taxonomy structures. Since queries are well-known to library users from their daily web search experience, keyqueries are promising concept to increase the structural complexity in a transparent way.

Query understanding. We have developed the currently most effective algorithms for query segmentation and session detection. Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases. Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem include large-scale corpora, more robust evaluation measures, and highly effective segmentation strategies. Query session detection aims at identifying consecutive queries that a user submits for the same information need. Detecting such search sessions is of major interest since they offer the possibility to analyze potential techniques for supporting users stuck in longer sessions, to learn from the users' query reformulation patterns, or to obtain knowledge on how users behave when their initial queries were not satisfactory. Our session detection methods involve different steps that form a cascade in the sense that computationally costly and hence time-consuming features are applied only after cheap features failed. This approach is superior to previous session detection methods.

Clustering and labeling large document sets. Clustering is a key technology to support users when exploring (browsing) large document collections, and hence a lot of research has been done and is still done in this direction. We have contributed original ideas and technologies that focus on the search result presentation of Web search engines: a competence partitioning strategy where we combine the best of two worlds, namely the ranking expertise of the big search engines along with the best-performing text clustering algorithms. Our technology improves the access to documents in the - typically long - result list  tail as follows: by avoiding the unwanted effect of query aspect repetition, which we call shadowing, and by avoiding extreme clusterings due a cluster labeling that considers the topic diversity found within the top-ranked results.

Digital Text Forensics

In times of ubiquitous information access and information generation this research field is becoming both exciting and challenging. The Webis research group contributes original ideas and algorithms to a number of text forensics tasks: text reuse, author identification, author profiling, Wikipedia vandalism analysis, Wikipedia quality analysis, and Wikipedia edit wars. In this regard, the most successful cross-language plagiarism retrieval model, called CL-ESA, has been developed in our group. Other contributions relate to effective Wikipedia retrieval models and analysis technology such as PU learning, and, as a kind of community service, the construction of more than 15 research corpora that are used by expert groups all over the world. Our internationally most visible activity is the PAN evaluation lab series, which we have initiated in 2007, and which we co-chair, scientifically and technically supervise, and host on our compute clusters.

Closely related to the text forensics activities is our Science 2.0 initiative: we are engaged in the ongoing discussion to render research results reproducible for the public and to render algorithms comparable. For this purpose we have been developing the experiment execution platform TIRA since five~years. TIRA provides strong features for the organization of shared tasks, such as a remote and completely sand-boxed execution of submitted code. Meanwhile, TIRA is used by several research groups in universities and research labs.

Argumentation and Language Technology

We believe that future information systems require the growing together of Information Retrieval (IR), Information Extraction (IE), Natural Language Processing (NLP), and selected fields from Artificial Intelligence (AI). This will become manifest, among others, in form of intelligent systems that automate basic capabilities for argumentation and debating. The Webis group has significantly intensified its research activities in this area and approaches this development from three perspectives. Firstly, by our paraphrasing research, including corpus construction and provision, machine learning analyses, and the development of highly efficient rephrasing technology such as Netspeak. Secondly, by developing effective, light-weight, and robust argumentation models that can be applied in the wild. Thirdly, by combining heuristic search technology with natural language expertise. Finally, we are also involved in the organization of the Dagstuhl Seminar 15512 on Debating Technologies.

Digital Engineering

Under this umbrella term we comprise those research activities of the Webis group where advanced information technologies are used to tackle complex engineering tasks. Examples include model-based diagnosis of large machines, interactive design exploration and expert critiquing in civil engineering, as well as tools for the exploration and automated analysis of huge amounts of simulation data. Obviously, most of these research activities require the Big Data analytics expertise of our group. Aside from new algorithms and paradigms also patents for simulation and diagnosis technology have been granted in the last years.


Tim Gollub, Michael Völske, Matthias Hagen, and Benno Stein. Dynamic Taxonomy Composition via Keyqueries. In Digital Libraries 2014: 14th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014), 18th International Conference on Theory and Practice of Digital Libraries (TPDL 2014), September 2014. ACM/IEEE.

Tim Gollub, Matthias Hagen, Maximilian Michel, and Benno Stein. From Keywords to Keyqueries: Content Descriptors for the Web. In Cathal Gurrin et al, editors, 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 13), pages 981-984, July 2013. ACM

Matthias Hagen, Martin Potthast, Anna Beyer, and Benno Stein. Towards Optimum Query Segmentation: In Doubt Without. In Xuewen Chen, Guy Lebanon, Haixun Wang, and Mohammed J. Zaki, editors, 21st ACM International Conference on Information and Knowledge Management (CIKM 12), pages 1015-1024, October 2012. ACM. ISBN 978-1-4503-1156-4.

Matthias Hagen, Benno Stein, and Tino Rüb. Query Session Detection as a Cascade. In Bettina Berendt et al, editors, 20th ACM International Conference on Information and Knowledge Management (CIKM 11), pages 147-152, October 2011. ACM. ISBN 978-1-4503-0717-8.

Benno Stein, Tim Gollub, and Dennis Hoppe. Search Result Presentation Based on Faceted Clustering. In Xuewen Chen, Guy Lebanon, Haixun Wang, and Mohammed J. Zaki, editors, 21st ACM International Conference on Information and Knowledge Management (CIKM 12), pages 1940-1944, October 2012. ACM. ISBN 978-1-4503-1156-4.

Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent Trends in Digital Text Forensics and its Evaluation. In Pamela Forner et al, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 13), pages 282-302, Berlin
Heidelberg New York, September 2013. Springer. ISBN 978-3-642-40801-4. ISSN 0302-9743.

Martin Potthast, Matthias Hagen, Michael Völske, and Benno Stein. Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In Pascale Fung and Massimo Poesio, editors, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 13), pages 1212-1221, August 2013. ACL.

Maik Anderka, Benno Stein, and Nedim Lipka. Predicting Quality Flaws in User-generated Content: The Case of Wikipedia. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), pages 981-990, August 2012. ACM. ISBN 978-1-4503-1472-5.

Martin Potthast, Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In Evangelos Kanoulas et al, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14), pages 268-299, Berlin Heidelberg New York, September 2014. Springer. ISBN 978-3-319-11381-4.

Tim Gollub, Benno Stein, and Steven Burrows. Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), pages 1125-1126, August 2012. ACM. ISBN 978-1-4503-1472-5.

Steven Burrows, Iryna Gurevych, and Benno Stein. The Eras and Trends of Automatic Short Answer Grading. Artificial Intelligence in Education, 25 (1) : 60-117, March 2015.

Benno Stein, Matthias Hagen, and Christof Bräutigam. Generating Acrostics via Paraphrasing and Heuristic Search. In Junichi Tsujii and Jan Hajic, editors, 25th International Conference on Computational Linguistics (COLING 14), pages 2018-2029, August 2014. Association for Computational Linguistics.

Henning Wachsmuth, Martin Trenkmann, Benno Stein, and Gregor Engels. Modeling Review Argumentation for Robust Sentiment Analysis. In Junichi Tsujii and Jan Hajic, editors, 25th International Conference on Computational Linguistics (COLING 14), pages 553-564, August 2014. Association for Computational Linguistics.

Steven Burrows, Martin Potthast, and Benno Stein. Paraphrase Acquisition via Crowdsourcing and Machine Learning. Transactions on Intelligent Systems and Technology (ACM TIST), 4 (3) : 43:1-43:21, June 2013.

Patrick Riehmann, Henning Gruendl, Martin Potthast, Martin Trenkmann, Benno Stein, and Bernd Froehlich. WORDGRAPH: Keyword-in-Context Visualization for NETSPEAK's Wildcard Search. IEEE Transactions on Visualization and Computer Graphics, 18 (9) : 1411-1423, September 2012.

Oliver Niggemann, Stefan Windmann, Sören Volgmann, Andreas Bunte, and Benno Stein. Using Learned Models for the Root Cause Analysis of Cyber-Physical Production Systems. In 25th International Workshop on Principles of Diagnosis (DX 2014), September 2014.

Steven Burrows, Jörg Frochte, Michael Völske, Ana Belén Martinez Torres, and Benno Stein. Learning Overlap Optimization for Domain Decomposition Methods. In Jian Pei et al, editors, 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 13), pages 438-449, Berlin Heidelberg New York, April 2013. Springer.

Oliver Niggemann, Benno Stein, Asmir Vodencarevic, Alexander Maier, and Hans Kleine Büning. Learning Behavior Models for Hybrid Timed Systems. In Jörg Hoffmann and Bart Selman, editors, 26th International Conference on Artificial Intelligence (AAAI 12), pages 1083-1090, Palo Alto, California, July 2012. AAAI. ISBN 978-1-57735-568-7.

Steven Burrows, Benno Stein, Jörg Frochte, David Wiesner, and Katja Müller. Simulation Data Mining for Supporting Bridge Design. In Peter Christen et al, editors, 9th Australasian Data Mining Conference (AusDM 2011) volume 121 of CRPIT, pages 163-170, New York, December 2011. ACM. ISBN 978-1-921770-02-9.