WEGA (Web Genre Analysis) is a technology for enriching Internet search results with genre information. For each snippet in a result list WEGA analyzes the type, purpose, or target group (= genre) of the underlying document, and labels the snippet as <discussion page>, <article>, <online shop>, <download site>, <private homepage>, <commercial homepage>, or <help site>. Since genre information is generally accepted as positive or negative filtering criterion, it simplifies finding the most relevant results. [video]


Result list with genre information.

The WEGA project addresses the following challenges:

Conceptual Genre Palette. WEGA aims at helping information seekers, and hence the genre palette should address a "typical" user's information needs. Based on a user study, we chose to support the following genres: <discussion page>, <online shop>, <download site>, <private homepage>, <commercial homepage>, and <help site>.

Novel Retrieval Models for Genre Classification. Retrieval models for Web genre classification have been proposed since the year 2000. These models are based on HTML tag statistics, linguistic analyses, simple text statistics, and manually compiled word lists. However, the linguistic statistics are often expensive to compute, and hypotheses learned from HTML tags do not generalize well since the diversity in the Web is difficult to be reflected by a training corpus. WEGA addresses these issues with a new retrieval model that is based on the analysis of core vocabulary distributions. Our genre retrieval model allows for efficient feature computation while providing an acceptable classification performance at the same time.

Development of a Firefox Add-On. The current WEGA prototype is implemented as an Add-On for the popular Firefox browser and labels Google search result lists. Unlike previous versions no additional server technology is needed: each document in a result list is loaded in the browser and analyzed in the background with JavaScript. Document download, analysis, and labeling happen asynchronously and do not hamper the user.

Development of a Genre Acquisition Tool


Students: Martin Kausche, Hagen-Christian Tönnies, and David Wiesner


Nedim Lipka. Modeling Non-Standard Text Classification Tasks. Dissertation, Bauhaus-Universität Weimar, March 2013. [publisher] [paper] [bib]
Sven Meyer zu Eißen. On Information Need and Categorizing Search. Dissertation, University of Paderborn, February 2007. [publisher] [paper] [bib]
Benno Stein and Sven Meyer zu Eißen. Distinguishing Topic from Genre. In Klaus Tochtermann and Hermann Maurer, editors, 6th International Conference on Knowledge Management (I-KNOW 06), Journal of Universal Computer Science, pages 449-456, Berlin Heidelberg New York, September 2006. Springer. ISSN 0948-695x. [paper] [bib]
Benno Stein and Sven Meyer zu Eißen. Is Web Genre Identification Feasible?. In Gerhard Brewka, Silvia Coradeschi, Anna Perini, and Paolo Traverso, editors, 17th European Conference on Artificial Intelligence (ECAI 06), pages 815-816, Amsterdam, Berlin, August 2006. IOS Press. ISBN 1-58603-642-4. ISSN 0922-6389. [paper] [bib]
Sven Meyer zu Eißen and Benno Stein. Genre Classification of Web Pages: User Study and Feasibility Analysis. In Susanne Biundo, Thom Frühwirth, and Günther Palm, editors, Advances in Artificial Intelligence. 27th Annual German Conference on AI (KI 04) volume 3228 of Lecture Notes in Artificial Intelligence, pages 256-269, Berlin Heidelberg New York, September 2004. Springer. ISBN 0302-9743. [doi] [paper] [bib]