PAN Plagiarism Corpus 2009 Copyright 2009 Bauhaus-Universität Weimar & Universidad Polytéchnica de Valencia All rights reserved. Table of Contents: 1. Introduction 2. Plagiarism Construction 3. Corpus Statistics 4. Corpus Format 5. Acknowledging the Corpus 6. Contact Information 1. Introduction This corpus contains documents in which artificial plagiarism has been inserted automatically. We believe these documents will be useful to evaluate automatic plagiarism detection algorithms. 1.1 Source Documents The documents in the corpus are based on books from the Project Gutenberg (www.gutenberg.org). In total it is based on 22,135 English books, 527 German books, and 211 Spanish books. 1.2 Licence of the Corpus All of the texts contained in this corpus are, to the best of our knowledge, public domain. The corpus can therefore be used free of charge and without any liabilities. 1.3 Plagiarism in the Corpus As far as we know this corpus does not contain any real plagiarism cases. By contrast, all of the annotated plagiarism cases are completely artificial, i.e., generated by a computer program. We emphasize that we do not claim that any author whose work is contained in this corpus has actually committed plagiarism. Any resemblance to real plagiarism cases is coincidental. 1.5 Algorithms to be Evaluated with the Corpus The corpus can be used to evaluate two kinds of plagiarism detection tasks: (i) Intrinsic plagiarism detection: Given a set of suspicious documents the task is to identify all plagiarized text passages, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task. (ii) External plagiarism detection: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. 2. Plagiarism Construction The plagiarism cases in this corpus have been generated automatically which is why we call them "artificial plagiarism cases". Other kinds of plagiarism cases are simulated plagiarism cases and real plagiarism cases. The former are cases which have been purposefully constructed by a human and the latter are cases where a real plagiarist was at work. This corpus does not contain any simulated or real plagiarism cases. 2.1 The Random Plagiarist In order to create plagiarism cases we have set up a computer program which chooses text passages from a given set of source documents at random and inserts them into another set of documents. This so-called random plagiarist constructs the plagiarism cases according to to a number of random variables whose distributions can be chosen by its operator. The variables include the percentage of plagiarism in the whole corpus, the percentage of plagiarism per document, and the length of a single plagiarized passage. Another important variable is the degree of obfuscation of a single plagiarized passage. Real plagiarists often try to obfuscate their plagiarism by rephrasing the passages they copy. The random plagiarist tries to simulate this action by combining a number of different automatic obfuscation strategies. Observe in this connection, that the obfuscated plagiarism cases produced by the random plagiarist are not necessarily human-readable. The reason for this is that writing a text automatically is still an unsolved problem. However, with respect to standard text similarity retrieval models of information retrieval, such as the vector space model, the random plagiarist creates cases where the obfuscated passage does not remotely resemble the original but whose measured similarity is still high. Note that in the part of the corpus intended for intrinsic plagiarism detection evaluation the plagiarism case have not been obfuscated. 2.2 Plagiarism Obfuscation Strategies The random plagiarist employs random combinations of the following strategies, and each strategy with varying strength. 2.2.1 Paraphrasing Given a sequence of tokens from a passage of text, each token is replaced by one of its synonyms, antonyms, hyponyms, or hypernyms, chosen at random. If neither are available for a given token the token is retained. 2.2.2 Parts-of-Speech Reordering Given a sequence of tokens from a passage of text, its sequence of parts of speech is determined. Then the tokens from the text are reordered at random while their original sequence of parts of speech is maintained. 2.2.3 Random Text Operations Given a sequence of tokens from a passage of text, words or short phrases are shuffled, removed, inserted, or replaced at random until a halting criterion is reached. Insertions and replacements may come from the new context in which the obfuscated passage will be inserted, or from other sources. 3. Corpus Statistics Corpus size: 20 611 suspicious documents, 20 612 source documents. Document lengths: small (up to paper size), medium, large (up to book size). Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities). Plagiarized passage length: short (few sentences), medium, long (many pages). Plagiarism types: monolingual (obfuscation degrees none, low, and high), and multilingual (automatic translation). 4. Corpus Format 4.1 Contents of the Top-level Directory Directories "external-analysis-corpus" and "intrinsic-analysis-corpus": the directories contain the parts of the corpus which are intended to be used when evaluating external and intrinsic plagiarism detection algorithms respectively. File "document.xsd": XML schema for the XML files found in the corpus. 4.2 Contents of the "external-analysis-corpus" directory Directories "source-documents" and "suspicious-documents": documents which may have been used as a source for plagiarism and documents which may contain plagiarism, respectively. File "retrieval-task.txt": description of the plagiarism detection retrieval task to be evaluated on the basis of this corpus. 4.2.1 Contents of the "source-documents" directory Directories "partX", where X ranges from 1 to 8, which contain the following kinds of files: File "source-documentKKKKKK.txt": a text file encoded in UTF-8 which may have been used as a source for plagiarism cases. File "source-documentKKKKKK.xml": an XML file which gives meta information about the corresponding text file. 4.2.1 Contents of the "suspicious-documents" directory Directories "partX", where X ranges from 1 to 8, which contain the following kinds of files: File "suspicious-documentKKKKKK.txt": a text file encoded in UTF-8 in which plagiarism may have been inserted. File "suspicious-documentKKKKKK.xml":an XML file which annotates the plagiarism inserted in the corresponding text file. Note that a text file may not contain any plagiarism. 4.3 Contents of the "intrinsic-analysis-corpus" directory Directory "suspicious-documents": documents which may contain plagiarism. File "retrieval-task.txt": description of the plagiarism detection retrieval task(s) to be evaluated on the basis of this corpus. 4.3.1 Contents of the "suspicious-documents" directory Directories "partX", where X ranges from 1 to 4, which contain the following kinds of files: File "suspicious-documentKKKKKK.txt": a text file encoded in UTF-8 in which plagiarism may have been inserted. File "suspicious-documentKKKKKK.xml": an XML file which annotates the plagiarism inserted in the corresponding text file. Note that a text file may not contain any plagiarism. 5. Acknowledging the Corpus We are very pleased to release this corpus, and it is our hope that it will foster the development of new plagiarism detection approaches. We would be happy to hear from you about how and with what success you used the corpus. If you use the corpus we kindly ask you to refer to it in your publications as follows: Citation template: Webis at Bauhaus-Universität Weimar and NLEL at Universidad Polytécnica de Valencia. Plagiarism Corpus PAN-PC-09. http://www.webis.de/research/corpora, 2009. Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso (editors). Bibtex: @MISC{webis:2009, AUTHOR = {{Webis at Bauhaus-Universität Weimar} and {NLEL at Universidad Polytécnica de Valencia}}, HOWPUBLISHED = {{http://www.webis.de/research/corpora}}, TITLE = {{PAN Plagiarism Corpus 2009 (PAN-PC-09)}}, YEAR = {2009}, NOTE = {{Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso (editors)}} } 6. Contact Information If you have comments, suggestions, questions about the corpus, or any other feedback don't hesitate to send mail to corpora@webis.de. Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, Paolo Rosso Bauhaus-Universität Weimar and Universidad Polytécnica de Valencia July 9, 2009