This corpus is outdated. Please use its successor PAN-PC-11.
The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.
To download the corpus use the following links (consider to use a download manager):
(1 GB, MD5 sum: 66e4f2801f097da2c1537453d6edf4ee), and
(667.8 MB, MD5 sum: 629861d970aeda647ff7b7c4c1cc70f4).
All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
You might also be interested the following items:
- The corpus readme file: pan-pc-10-readme.txt.
- The results of the 1st International Competition on Plagiarism Detection.
- The results of the 2nd International Competition on Plagiarism Detection.
- The reference implementation of the plagiarism detection performance measures used in the above competitions.
The PAN-PC-10 can be used to evaluate the following retrieval task:
- Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source section.
The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.
A detailed description of the corpus construction can be found in the associated publication.
- Martin Potthast
- Benno Stein
- Alberto Barrón-Cedeño (NLEL at Universidad Polytécnica de Valencia)
- Paolo Rosso (NLEL at Universidad Polytécnica de Valencia)
Students: Andreas Eiselt