Novel Retrieval Models for Genre Classification. Retrieval models for Web genre classification have been proposed since the year 2000. These models are based on HTML tag statistics, linguistic analyses, simple text statistics, and manually compiled word lists. However, the linguistic statistics are often expensive to compute, and hypotheses learned from HTML tags do not generalize well since the diversity in the Web is difficult to be reflected by a training corpus. WEGA addresses these issues with a new retrieval model that is based on the analysis of core vocabulary distributions. Our genre retrieval model allows for efficient feature computation while providing an acceptable classification performance at the same time.
Development of a Genre Acquisition Tool