Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    PAN-WVC-11

    Synopsis

    The PAN Wikipedia vandalism corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. For research purposes the corpus can be used free of charge.

    This corpus supplements the PAN-WVC-10, which features only English edits. Both corpora should be used to get more representative results.

    Download

    To download the corpus use the following link:

    A note: if you use the corpus in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus as follows:

    Martin Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In Hsin-Hsi Chen, Efthimis N. Efthimiadis, Jaques Savoy, Fabio Crestani, and Stéphane Marchand-Maillet, editors, 33rd International ACM Conference on Research and Development in Information Retrieval (SIGIR 10), pages 789-790, July 2010. ACM. ISBN 978-1-4503-0153-4. [doi] [paper] [bib] [poster]

    Corpus Outline

    As part of our research on automatic vandalism detection we have compiled a corpus of vandalism cases found in Wikipedia. The corpus compiles 29949 edits on 24351 Wikipedia articles, among which 2813 vandalism edits have been identified. The corpus features 9985 English edits, 9990 German edits, and 9974 Spanish edits. To annotate the corpus we have used Amazon's Mechanical Turk; each edit was presented to a number of annotators who were asked to decide whether it is vandalism or regular, and the agreement of the annotators was analyzed in order to label an edit.

    The corpus has been successfully employed in the 2nd International Competition on Wikipedia Vandalism Detection, PAN'11, which was held in conjunction with the CLEF'11 conference.

    Previous Corpus Versions. This corpus supplements the PAN-WVC-10 corpus which consists of more than 30000 English article edits. While the edits for both corpora have been chosen from the same time frame, they do not intersect, so that both corpora may be combined to allow for a more representative evaluation.

    People

    Students: Teresa Holfeld

    Related Publications

    Martin Potthast and Teresa Holfeld. Overview of the 2nd International Competition on Wikipedia Vandalism Detection. In Vivien Petras, Pamela Forner, and Paul D. Clough, editors, Notebook Papers of CLEF 11 Labs and Workshops, September 2011. ISBN 978-88-904810-1-7. [publisher] [paper] [bib]
    Benno Stein, Martin Potthast, Alberto Barrón-Cedeño, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 10). SIGIR Forum, 45 (1) : 45-48, June 2011. ACM. ISSN 0163-5840. [doi] [paper] [bib]
    Martin Potthast, Benno Stein, and Teresa Holfeld. Overview of the 1st International Competition on Wikipedia Vandalism Detection. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, Notebook Papers of CLEF 10 Labs and Workshops, September 2010. ISBN 978-88-904810-2-4. [publisher] [paper] [bib] [slides]
    Martin Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In Hsin-Hsi Chen, Efthimis N. Efthimiadis, Jaques Savoy, Fabio Crestani, and Stéphane Marchand-Maillet, editors, 33rd International ACM Conference on Research and Development in Information Retrieval (SIGIR 10), pages 789-790, July 2010. ACM. ISBN 978-1-4503-0153-4. [doi] [paper] [bib] [poster]
    Martin Potthast, Benno Stein and Robert Gerling. Automatic Vandalism Detection in Wikipedia. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Advances in Information Retrieval. 30th European Conference on IR Research (ECIR 08), 4956 of Lecture Notes in Computer Science, pages 663-668, 2008. Springer. ISBN 978-3-540-78645-0. [doi] [paper] [bib] [poster]
    Martin Potthast and Robert Gerling. Wikipedia Vandalism Corpus Webis-WVC-07. http://www.uni-weimar.de/medien/webis/research/corpora, 2007. [corpus] [bib]

    Content signature