Synopsis

CAIR is a cooperative research project between the Information Engineering Group (Universität Duisburg-Essen) and our webis group. Cluster analysis combines an object model, a similarity measure, and a merging strategy. Though a good deal of existing research focuses on merging it is clear that successful cluster analysis requires the integration of knowledge about the domain, the task, and the users. This understanding of a "semantic cluster analysis" can produce solutions for relevant information retrieval (IR)  tasks that are more effective than existing approaches. The objective of CAIR is the theoretical, methodological, and experimental study of cluster analysis in information retrieval, whereas semantics is investigated in different respects: (1) in the form of specialized retrieval models that consider knowledge of the IR task, (2) for multi-objective and interactive analyses that employ an explicit user model, (3) within hybrid merging strategies that combine algorithms, and (4) for improved cluster labeling. [demo]

The project is funded by the German Research Foundation (DFG).

Research

One of the project outcomes is the concept of "keyqueries" as document descriptors. Representing documents in terms of the search queries for which they are most relevant has natural applications in cluster analysis. Given a document collection, it allows the automatic generation of a hierarchical taxonomy with good cluster labels.

As part of our project, we organized the following events:

The following projects are related to our project:

The following corpora were developed in our project:

Further information can be found on the project page of the Information Engineering Group.

People

Students: Johannes Kiesel

Publications

Michael Völske, Tim Gollub, Matthias Hagen, and Benno Stein. A Keyquery-Based Classification System for CORE. In Laurence Lannom, editors, 3rd International Workshop on Mining Scientific Publications (WOSP 2014) volume 20 of, September 2014. Corporation for National Research Initiatives (CNRI). ISSN 1082-9873. [doi] [paper] [bib]
Tim Gollub, Michael Völske, Matthias Hagen, and Benno Stein. Dynamic Taxonomy Composition via Keyqueries. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 14), pages 39-48, September 2014. ACM/IEEE. ISBN 978-1-4799-5569-5. [publisher] [paper] [bib]
Benno Stein, Dennis Hoppe, and Tim Gollub. The Impact of Spelling Errors on Patent Search. In Walter Daelemans, editors, 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 12), pages 570-579, April 2012. Association for Computational Linguistics. ISBN 978-1-937284-19-0. [publisher] [paper] [bib]
Thomas Gottron, Maik Anderka, and Benno Stein. Insights into Explicit Semantic Analysis. In Bettina Berendt et al, editors, 20th ACM International Conference on Information and Knowledge Management (CIKM 11), pages 1961-1964, October 2011. ACM. ISBN 978-1-4503-0717-8. [doi] [paper] [bib] [wikipedia]
Benno Stein, Tim Gollub, and Dennis Hoppe. Beyond Precision@10: Clustering the Long Tail of Web Search Results. In Bettina Berendt et al, editors, 20th ACM International Conference on Information and Knowledge Management (CIKM 11), pages 2141-2144, October 2011. ACM. ISBN 978-1-4503-0717-8. [doi] [paper] [bib]
Nedim Lipka and Benno Stein. Robust Models in Information Retrieval. In A Min Tjoa and Roland Wagner, editors, 8th International Workshop on Text-Based Information Retrieval (TIR 11) at DEXA volume 0 of, pages 185-189, Los Alamitos, California, September 2011. IEEE. ISBN 978-0-7695-4486-1. ISSN 1529-4188. [doi] [paper] [bib] [slides]
Matthias Hagen and Benno Stein. Applying the User-over-Ranking Hypothesis to Query Formulation. In Advances in Information Retrieval Theory. 3rd International Conference on the Theory of Information Retrieval (ICTIR 11) volume 6931 of Lecture Notes in Computer Science, pages 225-237, Berlin Heidelberg New York, September 2011. Springer. [doi] [paper] [bib] [slides]
Hamish Cunningham, Norbert Fuhr, and Benno Stein. Challenges in Document Mining (Dagstuhl Seminar 11171). Dagstuhl Reports, 1 (4) : 65-99, August 2011. [doi] [article] [bib]
Benno Stein and Matthias Hagen. Introducing the User-over-Ranking Hypothesis. In Advances in Information Retrieval. 33rd European Conference on IR Research (ECIR 11) volume 6611 of Lecture Notes in Computer Science, pages 503-509, Berlin Heidelberg New York, April 2011. Springer. [doi] [paper] [bib] [slides]
Norbert Fuhr, Marc Lechtenfeld, Benno Stein, and Tim Gollub. The Optimum Clustering Framework: Implementing the Cluster Hypothesis. Information Retrieval, 15 (2) : 93-115, July 2011/2012 online/print. [doi] [article] [bib]
Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. Query Segmentation Revisited. In Sadagopan Srinivasan et al, editors, 20th International Conference on World Wide Web (WWW 11), pages 97-106, March 2011. ACM. [doi] [paper] [bib] [slides]
Tim Gollub and Benno Stein. Unsupervised Sparsification of Similarity Graphs. In Hermann Locarek-Junge and Claus Weihs, editors, Classification as a Tool for Research. Selected papers from the 11th IFCS Biennial Conference and 33rd Annual Conference of the German Classification Society (GFKL), Studies in Classification, Data Analysis, and Knowledge Organization, pages 71-79, Berlin Heidelberg New York, 2010. Springer. ISBN 978-3-642-10744-3. [doi] [paper] [bib]
Matthias Hagen and Benno Stein. Capacity-Constrained Query Formulation. In Mounia Lalmas et al, editors, Research and Advanced Technology for Digital Libraries. 14th European Conference on Digital Libraries (ECDL 10) volume 6273 of Lecture Notes in Computer Science, pages 384-388, Berlin Heidelberg New York, September 2010. Springer. ISBN 978-3-642-15463-8. [doi] [paper] [bib]
Benno Stein and Maik Anderka. Collection-Relative Representations: A Unifying View to Retrieval Models. In A Min Tjoa and Roland Wagner, editors, 6th International Workshop on Text-Based Information Retrieval (TIR 09) at DEXA, pages 383-387, September 2009. IEEE. ISBN 978-0-7695-3763-4. ISSN 1529-4188. [doi] [paper] [bib]