Task 1   Plagiarism Detection

The detection of plagiarism by hand is a laborious retrieval task---a task which can be aided or automatized. This evaluation task shall foster the development of new solutions in this respect.

Evaluation Task

There are basically two paradigms to detect plagiarism automatically: External plagiarism detection refers to detecting plagiarized sections in a suspicious document and the corresponding source sections in a given set of source documents. Intrinsic plagiarism detection refers to detecting plagiarized sections without comparing the suspicious document to any other documents, e.g., by detecting changes in writing style.

You will get the opportunity to apply one or both of these paradigms when solving the following task:

Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.

Evaluation Corpus

We have set up a large-scale corpus of artificial plagiarism for this task. The corpus contains primarily English documents in which all types of plagiarism cases can be found, namely monolingual plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents.

You can learn more about how the corpus has been constructed in this paper. The following gives a brief breakdown of the corpus statistics:

  • Corpus size: 20 611 suspicious documents, 20 612 source documents.
  • Document lengths: small (up to paper size), medium, large (up to book size).
  • Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities).
  • Plagiarized passage length: short (few sentences), medium, long (many pages).
  • Plagiarism types: monolingual (no, low, and high obfuscation), and multilingual (automatic translation).

In the corpus you will find plain text files encoded in UTF-8, and along each text file an XML file with meta information. The documents are divided into two folders, one with the suspicious documents and the other one with the source documents. Details about the available meta information can be found within the corpus.

Training Collection

In order to prepare and develop your detection software we provide for a training collection which includes annotations for all plagiarism cases we inserted. The collection has been employed in last year's competition on plagiarism detection.

You can download the collection from here (PAN-PC-09).

Test Collection

The test collection is a revised version of the PAN-PC-09 corpus, in which a number of novel kinds of plagiarism cases have been inserted, namely, artificial plagiarism generated using a new strategy as well as a large number of simulated, man-made plagiarism cases. We will measure the performance of your detection software in detecting the plagiarism hidden in this collection. Of course, the test collection does not come with annotations, however, we will release the collection after the evaluation campaign is finished as a replacement of the PAN-PC-09.

You can download the collection from here:

The test collection is organized along the lines of the training collection; it is compressed into a 2-part RAR archive.

Test Collection Annotations

The following archive, released after the result submission deadline, reveals the plagiarism cases contained in the test collection.

You can download the collection from here:

Submission of Detection Results

The results of your plagiarism detection software are required to be formatted in XML:

<document reference="...">    <!-- file name of the suspicious document -->
<feature
  name="detected-plagiarism"  <!-- type of the plagiarism annotation -->
  this_offset="5"        <!-- char offset within the suspicious document -->
  this_length="1000"     <!-- number of chars beginning at the offset -->
  <!-- the following attributes may be omitted if no source has been found -->
  source_reference="..." <!-- file name of the source document -->
  source_offset="100"    <!-- char offset within the source document -->
  source_length="1000"   <!-- number of chars beginning at the offset -->
/>
...                      <!-- more detections in this suspicious document -->
</document>

The result document must be valid with respect to the XML schema found here.

In order to upload your results, please follow this tutorial.

Performance Measures

The success of a plagiarism detection software will be measured in terms of its precision, recall, and granularity on detecting the plagiarized passages in the corpus. Let s denote a plagiarized passage from the set S of all plagiarized passages. Let r denote a detection from the set R of all detections and let S_R be the subset of S for which detections exist in R. Let |s|, |r| denote the char lengths of s, r and let |S|, |R|, |S_R| be the sizes of the respective sets. The formulas compute as follows:

PAN'10 Plagiarism Detection Performance Measures

Remarks.

  • We use the character counts in the formulas for precision and recall instead of, for instance, word counts to meet the fact that we cannot know what kind of tokenization approach you will be using. Thus, counting the characters which overlap with plagiarized passages is the safest way to compute these values.
  • Recall and precision are well-known measures to assess retrieval performance, but granularity is not. We have added this performance measure to determine whether your plagiarism detection algorithm reports a plagiarized passage as a whole, or rather divided into many small and/or overlapping phrases. The former is preferable since it accounts for a better usability of your tool.
  • External plagiarism cases and external detections comprise the chars of both the plagiarized passage and the source passage.
  • An external detection r must overlap by at least one char with both the plagiarized passage and the source passage of the corresponding s, otherwise it will not contribute to the recall of s and the precision of r will be set to 0.

Reference Implementation

A reference implementation of the aforementioned performance measures in Python can be found here. The functions to be used are macro_avg_recall_and_precision, granularity, and overall_score.

Evaluation Results

Out of 38 registered groups 18 submitted results for this task. The plagiarism detection performances of all groups are listed in the table below, ranked by decreasing overall score. A more detailed analysis the participant's performances will be given in the upcoming paper that overviews this task as well as in each participant's lab report.

How to interpret the results? Take the first row of the first table as an example: in this case the participant's precision is 0.94 which means that 94.05% of his detections are correct, i.e., 5.95% of his detections are incorrect. The recall, on the other hand, is 0.6915 which means that the participant detected 69.15% of the plagiarism which is actually in the test collection, and 30.85% of the plagiarism has gone unnoticed. The granularity value is about 1.0 which, roughly speaking, means that one can expect that the participant's algorithm will detect each plagiarism case at most once.
The Overall score is a combination of Recall, Precision, and Granularity, so that, values close to 1 indicate good performance. This values cannot be interpreted as percentages. We computed these values to allow for an absolute ranking among the participants which would not have been possible based on Precision, Recall, and Granularity only. The latter, however, are what counts.

Plagiarism Detection Performance
RankOverallRecallPrecisionGranularityParticipant
.
10.79710.69170.94141.0006J. Kasprzak and M. Brandejs
Masaryk University, Czech Republic
20.70900.62990.90551.0675D. Zou, W. Long, and Z. Ling
South China University of Technology, China
30.69480.70570.84171.1508M. Muhr, R. Kern, M. Zechner, and M. Granitzer
Know-Center Graz, Austria
40.62090.48080.90851.0177C. Grozea* and M. Popescu°
*Fraunhofer FIRST, Germany
°University of Bucharest, Romania
50.60660.47680.84791.0086G. Oberreuter, G. L'Huillier, S.A. Ríos, and J.D. Velásquez
University of Chile, Chile
60.58510.44810.85071.0044D.A.R. Torrejón*,° and J.M.M. Ramos°
*IES "José Caballero", Spain
°Universidad de Huelva, Spain
70.51910.40590.72561.0039R.C. Pereira, V.P. Moreira, and R. Galante
Universidade Federal do Rio Grande do Sul, Brazil
80.50930.38560.78171.0195Y. Palkovskii, A. Belov, and I. Muzika
Zhytomyr State University and SkyLine, Inc. Ukraine
90.43780.28680.95611.0108Sobha L., Pattabhi R.K R., Vijay S.R., A. Akilandeswari
MIT Campus of Anna University Chennai, India
100.25640.31750.50621.8720T. Gottron
Universität Koblenz-Landau, Germany
110.22220.23570.93082.2332D. Micol, Ó. Ferrández, and R. Muñoz
University of Alicante, Spain
120.21480.30250.17871.0652M.R. Costa-jussà, R.E. Banchs, J. Grivolla, and J. Codina
Barcelona Media Research Center, Spain
130.20530.16560.40471.2119R.M.A. Nawab, M. Stevenson, and P. Clough
University of Sheffield, UK
140.20340.14460.49831.1465P. Gupta and S. Rao
DA-IICT, India
150.13750.26200.91146.7764C. Vania and M. Adriani
Universitas Indonesia, Indonesia
160.05580.07290.13482.2376P. Suárez*, J.C. González*,°, and J. Villena-Román*,^
*Daedalus - Data, Decisions and Language, Spain
°Universidad Politécnica de Madrid, Spain
^Universidad Carlos III de Madrid, Spain
170.01950.04640.346017.3057S. Alzahrani* and N. Salim°
*Taif University, Saudi Arabia
°Universiti Teknologi Malaysia, Malaysia
180.00080.00130.60358.6827A. Iftene et al.
University of Iasi, Romania