Additional Content
Main Content
WEGA
Synopsis
WEGA (Web Genre Analysis) is a technology for enriching Internet search results with genre information. For each snippet in a result list WEGA analyzes the type, purpose, or target group (= genre) of the underlying document, and labels the snippet as <discussion page>, <article>, <online shop>, <download site>, <private homepage>, <commercial homepage>, or <help site>. Since genre information is generally accepted as positive or negative filtering criterion, it simplifies finding the most relevant results.
The screenshot below shows WEGA in action.

Demo
Watch the Wega demo video.
Project Outline
The WEGA project addresses the following challenges:
Conceptual Genre Palette. WEGA aims at helping information seekers, and hence the genre palette should address a "typical" user's information needs. Based on a user study, we chose to support the following genres: <discussion page>, <article>, <online shop>, <download site>, <private homepage>, <commercial homepage>, and <help site>.
Novel Retrieval Models for Genre Classification. Retrieval models for Web genre classification have been proposed since the year 2000. These models are based on HTML tag statistics, linguistic analyses, simple text statistics, and manually compiled word lists. However, the linguistic statistics are often expensive to compute, and hypotheses learned from HTML tags do not generalize well since the diversity in the Web is difficult to be reflected by a training corpus. WEGA addresses these issues with a new retrieval model that is based on the analysis of core vocabulary distributions. Our genre retrieval model allows for efficient feature computation while providing an acceptable classification performance at the same time.
Development of a Firefox Add-On. The current WEGA prototype is implemented as an Add-On for the popular Firefox browser and labels Google search result lists. Unlike previous versions no additional server technology is needed: each document in a result list is loaded in the browser and analyzed in the background with JavaScript. Document download, analysis, and labeling happen asynchronously and do not hamper the user.
Development of a Genre Acquisition Tool. For training and evaluation purposes one needs large Web document corpora whose elements are manually classified by humans. For this purpose we developed a multi-user acquisition tool that allows for an intuitive, distributed corpus compilation. Our corpora will be published for the research community on our Web site soon.
People
Students: Martin Kausche, Hagen-Christian Tönnies, and David Wiesner
Related Publications
Content signature
© Fakultät Medien 03.08.2012 / Kontakt / Impressum / Datenschutz / Bemerkung zu dieser Seite
Die Bauhaus-Universität Weimar verwendet Piwik zur Web-Analyse.


