PAN-PC-09

Synopsis

This corpus is outdated. Please use its successor PAN-PC-11.

The PAN plagiarism corpus 2009 (PAN-PC-09) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

Download

To download the corpus use the following links: (consider to use a download manager):

All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].

You might also be interested the following items:

Research

The PAN-PC-09 can be used to evaluate two retrieval tasks pertaining to automatic plagiarism detection:

  • External Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and their respective source sections in the source documents.
  • Intrinsic Plagiarism Detection. Given only a set of suspicious documents, the task is to identify all plagiarized sections, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.

The PAN-PC-09 contains documents in which artificial plagiarism has been inserted automatically. The plagiarism cases have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of random variables. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, the length of a single plagiarized section, and the degree of obfuscation per plagiarized section.

A detailed description of the corpus construction can be found in the corpus readme file and in the Publications.

Previous Corpus Versions. There have been two corpus versions prior to this one. The first version was the Webis-PC-08, in which we experimented for the first time with generating plagiarism semi-automatically. The second version was developed for the 1st International Competition on Plagiarism Detection at the PAN'09 workshop, which has been released in two steps as training corpus and test corpus. Both versions are still available upon request, but we recommend to use the current version in your research.

People

Students: Andreas Eiselt

Publications

Martin Potthast, Tim Gollub, Matthias Hagen, Martin Tippmann, Johannes Kiesel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Overview of the 5th International Competition on Plagiarism Detection. In Pamela Forner, Roberto Navigli, and Dan Tufis, editors, Working Notes Papers of the CLEF 2013 Evaluation Labs, September 2013. ISBN 978-88-904810-3-1. ISSN 2038-4963. [publisher] [paper] [bib] [slides]
Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent Trends in Digital Text Forensics and its Evaluation. In Pamela Forner et al, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 13), pages 282-302, Berlin Heidelberg New York, September 2013. Springer. ISBN 978-3-642-40801-4. ISSN 0302-9743. [doi] [paper] [bib] [slides]
Martin Potthast. Technologien zur Wiederverwendung von Texten aus dem Web. In Steffen Hölldobler et al, editors, Ausgezeichnete Informatikdissertationen 2011 volume D-12 LNI of Lecture Notes in Informatics, pages 141-150, December 2012. Gesellschaft für Informatik. ISBN 978-3-88579-416-5. [publisher] [paper] [bib] [slides]
Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, and Benno Stein. Overview of the 4th International Competition on Plagiarism Detection. In Pamela Forner, Jussi Karlgren, and Christa Womser-Hacker, editors, Working Notes Papers of the CLEF 2012 Evaluation Labs, September 2012. ISBN 978-88-904810-3-1. ISSN 2038-4963. [publisher] [paper] [bib] [slides]
Martin Potthast, Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel, Martin Tippmann, and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09 Corpus. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), pages 1004, August 2012. ACM. ISBN 978-1-4503-1472-5. [doi] [paper] [bib]
Martin Potthast. Technologies for Reusing Text from the Web. Dissertation, Bauhaus-Universität Weimar, December 2011. [publisher] [paper] [bib] [video] [slides]
Martin Potthast, Andreas Eiselt, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso. Overview of the 3rd International Competition on Plagiarism Detection. In Vivien Petras, Pamela Forner, and Paul D. Clough, editors, Working Notes Papers of the CLEF 2011 Evaluation Labs, September 2011. ISBN 978-88-904810-1-7. ISSN 2038-4963. [publisher] [paper] [bib] [slides]
Benno Stein, Martin Potthast, Alberto Barrón-Cedeño, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 10). SIGIR Forum, 45 (1) : 45-48, June 2011. [doi] [article] [bib]
Martin Potthast, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, and Paolo Rosso. Overview of the 2nd International Competition on Plagiarism Detection. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, Working Notes Papers of the CLEF 2010 Evaluation Labs, September 2010. ISBN 978-88-904810-2-4. ISSN 2038-4963. [publisher] [paper] [bib] [slides]
Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In Chu-Ren Huang and Dan Jurafsky, editors, 23rd International Conference on Computational Linguistics (COLING 10), pages 997-1005, Stroudsburg, Pennsylvania, August 2010. Association for Computational Linguistics. [paper] [bib] [poster]
Alberto Barrón-Cedeño, Martin Potthast, Paolo Rosso, Benno Stein, and Andreas Eiselt. Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Nicoletta Calzolari et al, editors, 7th Conference on International Language Resources and Evaluation (LREC 10), May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. [paper] [bib] [slides]
Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. PAN Plagiarism Corpus PAN-PC-09. http://www.uni-weimar.de/medien/webis/corpora, 2009. [corpus] [bib]
Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein et al, editors, SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1-9, September 2009. CEUR-WS.org. ISSN 1613-0073. [publisher] [paper] [bib] [slides]
Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Webis Plagiarism Corpus Webis-PC-08. http://www.uni-weimar.de/medien/webis/research/corpora, 2008. [corpus] [bib]
Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Plagiarism Detection without Reference Collections. In Reinhold Decker and Hans J. Lenz, editors, Advances in Data Analysis. Selected papers from the 30th Annual Conference of the German Classification Society (GFKL 06), Studies in Classification, Data Analysis, and Knowledge Organization, pages 359-366, Berlin Heidelberg New York, 2007. Springer. ISBN 978-3-540-70980-0. ISSN 1431-8814. [doi] [paper] [bib]
Sven Meyer zu Eißen and Benno Stein. Intrinsic Plagiarism Detection. In Mounia Lalmas et al, editors, Advances in Information Retrieval. 28th European Conference on IR Research (ECIR 06) volume 3936 of Lecture Notes in Computer Science, pages 565-569, Berlin Heidelberg New York, 2006. Springer. ISBN 3-540-33347-9. ISSN 0302-9743. [doi] [paper] [bib]