Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Webis-CLS-10

    Synopsis

    The Cross-Lingual Sentiment (CLS) dataset comprises about 800.000 Amazon product reviews in the four languages English, German, French, and Japanese.

    For more information on the construction of the dataset see (Prettenhofer and Stein, 2010) or the enclosed readme files. If you have a question after reading the paper and the readme files, please contact Peter Prettenhofer.

    Download

    We provide the dataset in two formats: 1) a processed format which corresponds to the preprocessing (tokenization, etc.) in (Prettenhofer and Stein, 2010); 2) an unprocessed format which contains the full text of the reviews (e.g., for machine translation or feature engineering).

    A note: if you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus as follows:

    Peter Prettenhofer and Benno Stein. Cross-Language Text Classification using Structural Correspondence Learning. In 48th Annual Meeting of the Association of Computational Linguistics (ACL 10), 1118-1127, July 2010. Association for Computational Linguistics. [publisher] [paper] [bib]

    Corpus Outline

    The dataset was first used by (Prettenhofer and Stein, 2010). It consists of Amazon product reviews for three product categories---books, dvds and music---written in four different languages: English, German, French, and Japanese. The German, French, and Japanese reviews were crawled from Amazon in November, 2009. The English reviews were sampled from the Multi-Domain Sentiment Dataset (Blitzer et. al., 2007). For each language-category pair there exist three sets of training documents, test documents, and unlabeled documents. The training and test sets comprise 2.000 documents each, whereas the number of unlabeled documents varies from 9.000 - 170.000.

    For more information on the construction of the dataset see (Prettenhofer and Stein, 2010) and the enclosed readme files.

    People

    We kindly thank Mark Dredze and John Blitzer for the permission to include a sample of the Multi-Domain Sentiment Dataset (Blitzer et. al., 2007) in our dataset.

    Related Publications

    Peter Prettenhofer and Benno Stein. Cross-Language Text Classification using Structural Correspondence Learning. In 48th Annual Meeting of the Association of Computational Linguistics (ACL 10), 1118-1127, July 2010. Association for Computational Linguistics. [publisher] [paper] [bib]

    Content signature