Task 1 Plagiarism Detection
The detection of plagiarism by hand is a laborious retrieval task---a task which can be aided or automatized. This evaluation task shall foster the development of new solutions in this respect.
- Evaluation Task
- Evaluation Corpus
- Resources
- Submission of Detection Results
- Performance Measures
- Evaluation Results
Evaluation Task
There are basically two paradigms to detect plagiarism automatically: External plagiarism detection refers to detecting plagiarized sections in a suspicious document and the corresponding source sections in a given set of source documents. Intrinsic plagiarism detection refers to detecting plagiarized sections without comparing the suspicious document to any other documents, e.g., by detecting changes in writing style.
You will get the opportunity to apply one or both of these paradigms when solving the following task:
Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.
Evaluation Corpus
We have set up a large-scale corpus of artificial plagiarism for this task. The corpus contains primarily English documents in which all types of plagiarism cases can be found, namely monolingual plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents.
You can learn more about how the corpus has been constructed in this paper. The following gives a brief breakdown of the corpus statistics:
- Corpus size: 20 611 suspicious documents, 20 612 source documents.
- Document lengths: small (up to paper size), medium, large (up to book size).
- Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities).
- Plagiarized passage length: short (few sentences), medium, long (many pages).
- Plagiarism types: monolingual (no, low, and high obfuscation), and multilingual (automatic translation).
In the corpus you will find plain text files encoded in UTF-8, and along each text file an XML file with meta information. The documents are divided into two folders, one with the suspicious documents and the other one with the source documents. Details about the available meta information can be found within the corpus.
Training Collection
In order to prepare and develop your detection software we provide for a training collection which includes annotations for all plagiarism cases we inserted. The collection has been employed in last year's competition on plagiarism detection.
You can download the collection from here (PAN-PC-09).Test Collection
The test collection is a revised version of the PAN-PC-09 corpus, in which a number of novel kinds of plagiarism cases have been inserted, namely, artificial plagiarism generated using a new strategy as well as a large number of simulated, man-made plagiarism cases. We will measure the performance of your detection software in detecting the plagiarism hidden in this collection. Of course, the test collection does not come with annotations, however, we will release the collection after the evaluation campaign is finished as a replacement of the PAN-PC-09.
You can download the collection from here:
-
pan10-plagiarism-test-collection-2010-05-17.part1.rar
[1 GB, MD5 sum 27e18ea748c20ed0f4141b0b1fae09a5] -
pan10-plagiarism-test-collection-2010-05-17.part2.rar
[0.7 GB, MD5 sum e7f4a6d368c8521464a98329c67b50ff]
Test Collection Annotations
The following archive, released after the result submission deadline, reveals the plagiarism cases contained in the test collection.
You can download the collection from here:
-
pan10-plagiarism-test-collection-annotations.rar
[12 MB, MD5 sum f12387441ed4df24cd3b323e6d7866a4]
Submission of Detection Results
The results of your plagiarism detection software are required to be formatted in XML:
<document reference="..."> <!-- file name of the suspicious document --> <feature name="detected-plagiarism" <!-- type of the plagiarism annotation --> this_offset="5" <!-- char offset within the suspicious document --> this_length="1000" <!-- number of chars beginning at the offset --> <!-- the following attributes may be omitted if no source has been found --> source_reference="..." <!-- file name of the source document --> source_offset="100" <!-- char offset within the source document --> source_length="1000" <!-- number of chars beginning at the offset --> /> ... <!-- more detections in this suspicious document --> </document>
The result document must be valid with respect to the XML schema found here.
In order to upload your results, please follow this tutorial.
Performance Measures
The success of a plagiarism detection software will be measured in terms of its precision, recall, and granularity on detecting the plagiarized passages in the corpus. Let s denote a plagiarized passage from the set S of all plagiarized passages. Let r denote a detection from the set R of all detections and let S_R be the subset of S for which detections exist in R. Let |s|, |r| denote the char lengths of s, r and let |S|, |R|, |S_R| be the sizes of the respective sets. The formulas compute as follows:
Remarks.
- We use the character counts in the formulas for precision and recall instead of, for instance, word counts to meet the fact that we cannot know what kind of tokenization approach you will be using. Thus, counting the characters which overlap with plagiarized passages is the safest way to compute these values.
- Recall and precision are well-known measures to assess retrieval performance, but granularity is not. We have added this performance measure to determine whether your plagiarism detection algorithm reports a plagiarized passage as a whole, or rather divided into many small and/or overlapping phrases. The former is preferable since it accounts for a better usability of your tool.
- External plagiarism cases and external detections comprise the chars of both the plagiarized passage and the source passage.
- An external detection r must overlap by at least one char with both the plagiarized passage and the source passage of the corresponding s, otherwise it will not contribute to the recall of s and the precision of r will be set to 0.
Reference Implementation
A reference implementation of the aforementioned performance measures in Python can be found here. The functions to be used are macro_avg_recall_and_precision, granularity, and overall_score.
Evaluation Results
Out of 38 registered groups 18 submitted results for this task. The plagiarism detection performances of all groups are listed in the table below, ranked by decreasing overall score. A more detailed analysis the participant's performances will be given in the upcoming paper that overviews this task as well as in each participant's lab report.
How to interpret the results? Take the first row of the first table as an example: in this case the participant's precision is 0.94 which means that 94.05% of his detections are correct, i.e., 5.95% of his detections are incorrect. The recall, on the other hand, is 0.6915 which means that the participant detected 69.15% of the plagiarism which is actually in the test collection, and 30.85% of the plagiarism has gone unnoticed. The granularity value is about 1.0 which, roughly speaking, means that one can expect that the participant's algorithm will detect each plagiarism case at most once.
The Overall score is a combination of Recall, Precision, and Granularity, so that, values close to 1 indicate good performance. This values cannot be interpreted as percentages. We computed these values to allow for an absolute ranking among the participants which would not have been possible based on Precision, Recall, and Granularity only. The latter, however, are what counts.
| Plagiarism Detection Performance | |||||||
|---|---|---|---|---|---|---|---|
| Rank | Recall | Precision | Granularity | Participant | |||
| 1 | 0.7971 | 0.6917 | 0.9414 | 1.0006 | J. Kasprzak and M. Brandejs Masaryk University, Czech Republic | ||
| 2 | 0.7090 | 0.6299 | 0.9055 | 1.0675 | D. Zou, W. Long, and Z. Ling South China University of Technology, China | ||
| 3 | 0.6948 | 0.7057 | 0.8417 | 1.1508 | M. Muhr, R. Kern, M. Zechner, and M. Granitzer Know-Center Graz, Austria | ||
| 4 | 0.6209 | 0.4808 | 0.9085 | 1.0177 | C. Grozea* and M. Popescu° *Fraunhofer FIRST, Germany °University of Bucharest, Romania | ||
| 5 | 0.6066 | 0.4768 | 0.8479 | 1.0086 | G. Oberreuter, G. L'Huillier, S.A. Ríos, and J.D. Velásquez University of Chile, Chile | ||
| 6 | 0.5851 | 0.4481 | 0.8507 | 1.0044 | D.A.R. Torrejón*,° and J.M.M. Ramos° *IES "José Caballero", Spain °Universidad de Huelva, Spain | ||
| 7 | 0.5191 | 0.4059 | 0.7256 | 1.0039 | R.C. Pereira, V.P. Moreira, and R. Galante Universidade Federal do Rio Grande do Sul, Brazil | ||
| 8 | 0.5093 | 0.3856 | 0.7817 | 1.0195 | Y. Palkovskii, A. Belov, and I. Muzika Zhytomyr State University and SkyLine, Inc. Ukraine | ||
| 9 | 0.4378 | 0.2868 | 0.9561 | 1.0108 | Sobha L., Pattabhi R.K R., Vijay S.R., A. Akilandeswari MIT Campus of Anna University Chennai, India | ||
| 10 | 0.2564 | 0.3175 | 0.5062 | 1.8720 | T. Gottron Universität Koblenz-Landau, Germany | ||
| 11 | 0.2222 | 0.2357 | 0.9308 | 2.2332 | D. Micol, Ó. Ferrández, and R. Muñoz University of Alicante, Spain | ||
| 12 | 0.2148 | 0.3025 | 0.1787 | 1.0652 | M.R. Costa-jussà, R.E. Banchs, J. Grivolla, and J. Codina Barcelona Media Research Center, Spain | ||
| 13 | 0.2053 | 0.1656 | 0.4047 | 1.2119 | R.M.A. Nawab, M. Stevenson, and P. Clough University of Sheffield, UK | ||
| 14 | 0.2034 | 0.1446 | 0.4983 | 1.1465 | P. Gupta and S. Rao DA-IICT, India | ||
| 15 | 0.1375 | 0.2620 | 0.9114 | 6.7764 | C. Vania and M. Adriani Universitas Indonesia, Indonesia | ||
| 16 | 0.0558 | 0.0729 | 0.1348 | 2.2376 | P. Suárez*, J.C. González*,°, and J. Villena-Román*,^ *Daedalus - Data, Decisions and Language, Spain °Universidad Politécnica de Madrid, Spain ^Universidad Carlos III de Madrid, Spain | ||
| 17 | 0.0195 | 0.0464 | 0.3460 | 17.3057 | S. Alzahrani* and N. Salim° *Taif University, Saudi Arabia °Universiti Teknologi Malaysia, Malaysia | ||
| 18 | 0.0008 | 0.0013 | 0.6035 | 8.6827 | A. Iftene et al. University of Iasi, Romania | ||

