The retrieval and mining lab is a dedicated place for scientists who want to do state-of-the-art research in data analytics, and, in particular, who are keen to take the next steps beyond. In this regard we work on the well-known challenges and problems of the information society such as Web search, information overload, and information personalization. However, our true mission is to identify new research directions and exciting questions by combining technical means such as cluster computing, human authored text and artificial data in Peta-size scale, and advanced computer science methods from Information Retrieval, Machine Learning, Natural Language Processing, and Artificial Intelligence. To be more specific, we have raised (and seriously think about) the following and other questions:
- Who wrote the Web?
- How to quantify the usefulness of product reviews?
- Up to which level can human debating skills be automated?
- Is it possible to interpret simulation data better than engineers can do?
- Which elements of the Wikipedia encyclopedia can be algorithmically improved?
To (partially) answer these and related questions we follow several lines of research. Among others, we scale-up Natural Language Processing towards Big Data, render Machine Learning technology more robust in terms of domains and noise, analyze and improve unsupervised Data Mining methods, develop tailored retrieval technology for social media and specific genres, and integrate semantic concepts into text analytics algorithms.
The computing hardware of the retrieval and mining lab has been designed to perfectly support the research outlined above. It is comprised of 135 rack servers with 1,620 cores, 8.6 TB main memory, about 2.5 Peta byte disk space, and a 10Gb network allowing nearly arbitrary switching configurations at a blocking factor of one. Moreover, we host a large number of important corpora and, with »common crawl«, the currently largest available Web corpus. The retrieval and mining cluster became operational in spring 2015.
Large Scale Text Analytics (Big Data Information Retrieval)
This research area exploits the Big Data computing facilities in combination with the corpus resources that have been built up in the Webis research group. Our achievements in this research area can be subsumed as follows.
Dynamic taxonomies in digital libraries. We have proposed the so-called "Keyqueries Paradigm", a concept to lay the grounds for future taxonomy systems. Given a document, a keyquery is a set of few keywords for which the document achieves a high relevance score. Keyqueries can hence be viewed as a general and concise description of the returned retrieval results. Our keyquery framework addresses important problems of static classification systems such as overlarge classes and overly complex taxonomy structures. Since queries are well-known to library users from their daily web search experience, keyqueries are promising concept to increase the structural complexity in a transparent way.
Query understanding. We have developed the currently most effective algorithms for query segmentation and session detection. Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases. Such segments can help a search engine to better interpret a user's intents and to tailor the search results more appropriately. Our contributions to this problem include large-scale corpora, more robust evaluation measures, and highly effective segmentation strategies. Query session detection aims at identifying consecutive queries that a user submits for the same information need. Detecting such search sessions is of major interest since they offer the possibility to analyze potential techniques for supporting users stuck in longer sessions, to learn from the users' query reformulation patterns, or to obtain knowledge on how users behave when their initial queries were not satisfactory. Our session detection methods involve different steps that form a cascade in the sense that computationally costly and hence time-consuming features are applied only after cheap features failed. This approach is superior to previous session detection methods.
Clustering and labeling large document sets. Clustering is a key technology to support users when exploring (browsing) large document collections, and hence a lot of research has been done and is still done in this direction. We have contributed original ideas and technologies that focus on the search result presentation of Web search engines: a competence partitioning strategy where we combine the best of two worlds, namely the ranking expertise of the big search engines along with the best-performing text clustering algorithms. Our technology improves the access to documents in the - typically long - result list tail as follows: by avoiding the unwanted effect of query aspect repetition, which we call shadowing, and by avoiding extreme clusterings due a cluster labeling that considers the topic diversity found within the top-ranked results.
Digital Text Forensics
In times of ubiquitous information access and information generation this research field is becoming both exciting and challenging. The Webis research group contributes original ideas and algorithms to a number of text forensics tasks: text reuse, author identification, author profiling, Wikipedia vandalism analysis, Wikipedia quality analysis, and Wikipedia edit wars. In this regard, the most successful cross-language plagiarism retrieval model, called CL-ESA, has been developed in our group. Other contributions relate to effective Wikipedia retrieval models and analysis technology such as PU learning, and, as a kind of community service, the construction of more than 15 research corpora that are used by expert groups all over the world. Our internationally most visible activity is the PAN evaluation lab series, which we have initiated in 2007, and which we co-chair, scientifically and technically supervise, and host on our compute clusters.
Closely related to the text forensics activities is our Science 2.0 initiative: we are engaged in the ongoing discussion to render research results reproducible for the public and to render algorithms comparable. For this purpose we have been developing the experiment execution platform TIRA since five~years. TIRA provides strong features for the organization of shared tasks, such as a remote and completely sand-boxed execution of submitted code. Meanwhile, TIRA is used by several research groups in universities and research labs.
Argumentation and Language Technology
We believe that future information systems require the growing together of Information Retrieval (IR), Information Extraction (IE), Natural Language Processing (NLP), and selected fields from Artificial Intelligence (AI). This will become manifest, among others, in form of intelligent systems that automate basic capabilities for argumentation and debating. The Webis group has significantly intensified its research activities in this area and approaches this development from three perspectives. Firstly, by our paraphrasing research, including corpus construction and provision, machine learning analyses, and the development of highly efficient rephrasing technology such as Netspeak. Secondly, by developing effective, light-weight, and robust argumentation models that can be applied in the wild. Thirdly, by combining heuristic search technology with natural language expertise. Finally, we are also involved in the organization of the Dagstuhl Seminar 15512 on Debating Technologies.
Under this umbrella term we comprise those research activities of the Webis group where advanced information technologies are used to tackle complex engineering tasks. Examples include model-based diagnosis of large machines, interactive design exploration and expert critiquing in civil engineering, as well as tools for the exploration and automated analysis of huge amounts of simulation data. Obviously, most of these research activities require the Big Data analytics expertise of our group. Aside from new algorithms and paradigms also patents for simulation and diagnosis technology have been granted in the last years.