Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    PAN-PC-09

    Synopsis

    This corpus is outdated. Please use its successor PAN-PC-11.

    The PAN plagiarism corpus 2009 (PAN-PC-09) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

    Download

    To download the corpus use the following links
    (consider to use a download manager):

    All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.

    A note: if you use the corpus in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus as follows:

    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In 23rd International Conference on Computational Linguistics (COLING 10), August 2010. Association for Computational Linguistics. [paper] [bib] [poster]

    You might also be interested the following items:

    Corpus Outline

    The PAN-PC-09 can be used to evaluate two retrieval tasks pertaining to automatic plagiarism detection:

    • External Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and their respective source sections in the source documents.
    • Intrinsic Plagiarism Detection. Given only a set of suspicious documents, the task is to identify all plagiarized sections, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.

    The PAN-PC-09 contains documents in which artificial plagiarism has been inserted automatically. The plagiarism cases have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of random variables. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, the length of a single plagiarized section, and the degree of obfuscation per plagiarized section.

    A detailed description of the corpus construction can be found in the corpus readme file and in the related publications.

    Previous Corpus Versions. There have been two corpus versions prior to this one. The first version was the Webis-PC-08, in which we experimented for the first time with generating plagiarism semi-automatically. The second version was developed for the 1st International Competition on Plagiarism Detection at the PAN'09 workshop, which has been released in two steps as training corpus and test corpus. Both versions are still available upon request, but we recommend to use the current version in your research.

    People

    • Martin Potthast
    • Benno Stein
    • Alberto Barrón-Cedeño (NLEL at Universidad Polytécnica de Valencia)
    • Paolo Rosso (NLEL at Universidad Polytécnica de Valencia)

    Students: Andreas Eiselt

    Related Publications

    Martin Potthast, Andreas Eiselt, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso. Overview of the 3rd International Competition on Plagiarism Detection. In Vivien Petras and Paul Clough, editors, Notebook Papers of CLEF 11 Labs and Workshops, September 2011. ISBN 978-88-904810-1-7. [paper] [bib] [slides]
    Benno Stein, Martin Potthast, Alberto Barrón-Cedeño, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 10). SIGIR Forum, 45 (1) : 45-48, June 2011. ACM. ISSN 0163-5840. [doi] [paper] [bib]
    Martin Potthast, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, and Paolo Rosso. Overview of the 2nd International Competition on Plagiarism Detection. In Martin Braschler and Donna Harman, editors, Notebook Papers of CLEF 10 Labs and Workshops, September 2010. ISBN 978-88-904810-0-0. [paper] [bib] [slides]
    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In 23rd International Conference on Computational Linguistics (COLING 10), August 2010. Association for Computational Linguistics. [paper] [bib] [poster]
    Alberto Barrón-Cedeño, Martin Potthast, Paolo Rosso, Benno Stein, and Andreas Eiselt. Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner and Daniel Tapias, editors, 7th Conference on International Language Resources and Evaluation (LREC 10), May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. [doi] [paper] [bib] [slides]
    Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. PAN Plagiarism Corpus PAN-PC-09. http://www.uni-weimar.de/medien/webis/research/corpora, 2009. [corpus] [bib]
    Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1-9, September 2009. CEUR-WS.org. ISSN 1613-0073. [publisher] [paper] [bib] [slides]
    Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Webis Plagiarism Corpus Webis-PC-08. http://www.uni-weimar.de/medien/webis/research/corpora, 2008. [corpus] [bib]
    Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Plagiarism Detection without Reference Collections. In Reinhold Decker and Hans J. Lenz, editors, Advances in Data Analysis. Selected papers from the 30th Annual Conference of the German Classification Society (GfKl 06), Studies in Classification, Data Analysis, and Knowledge Organization, pages 359-366, 2007. Springer. ISBN 978-3-540-70980-0. [doi] [paper] [bib]
    Sven Meyer zu Eißen and Benno Stein. Intrinsic Plagiarism Detection. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika. and A. Yavlinsky, editors, Advances in Information Retrieval. 28th European Conference on IR Research (ECIR 06), London, UK, 3936 of Lecture Notes in Computer Science, pages 565-569, 2006. Springer. ISBN 3-540-33347-9. [doi] [paper] [bib]

    Content signature