1st International Competition on Plagiarism Detection

The detection of plagiarism by hand is a laborious retrieval task---a task which can be aided or automatized. The PAN competition on plagiarism detection shall foster the development of new solutions in this respect.



Competition Tasks

The competition divides into two tasks:

  • External Plagiarism Analysis.
    Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents.
  • Intrinsic Plagiarism Analysis.
    Given a set of suspicious documents the task is to identify all plagiarized text passages, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.
Participants may submit results for one or both of the tasks.

Award

Yahoo! Research will award a cash prize of 500 Euros to the winner of the competition.

Final Results

In total, we received submissions from 13 out of 21 registered participants. There were 10 submissions for the external plagiarism analysis task and 4 for the intrinsic plagiarism analysis task (1 participant submitted results for both tasks). The competition corpus contains 46,946 plagiarism cases, 36,475 of them in the corpus for the external analysis task, and the remaining 10,471 in the corpus for the intrinsic analysis task.

The following three tables summarize the detection performances of the participants: the first table lists the participants who took part in the external analysis task, the second table lists the participants who took part in the intrinsic analysis task, and the third table lists each participant's overall performance in both tasks. The participants are ranked according to the overal score, which is computed based on the F-measure, precision, recall, and granularity.

How to interpret the results? Take the first row of the first table as an example, and concentrate on the columns Precision, Recall, and Granularity. In this case the participant's precision is 0.7418 which means that 74.18% of his detections are correct, i.e., 25.82% of his detections are incorrect. The recall, on the other hand, is 0.6585 which means that the participant detected 65.85% of the plagiarism which is actually in the test collection, and 34.15% of the plagiarism has gone unnoticed. The granularity value is about 1.0 which, roughly speaking, means that one can expect that the participant's algorithm will detect each plagiarism case at most once.
The column F-measure is a combination of Precision and Recall. Note that here, the absolute values have no semantics attached; it can only be said that the closer the value is to 1, the better the participant's performance is. Likewise, the Overall score is a combination of F-measure and Granularity, so that, again, values close to 1 indicate good performance. In particular, these values cannot be interpreted as percentages. We computed these values to allow for an absolute ranking among the participants which would not have been possible based on Precision, Recall, and Granularity only. The latter, however, are what counts.

External Plagiarism Analysis Task
RankOverall scoreF-measurePrecisionRecallGranularityParticipant
.
10.69570.69760.74180.65851.0038C. Grozea
Fraunhofer FIRST, Germany
20.60930.61920.55730.69671.0228J. Kasprzak, M. Brandejs, and M. Křipač
Masaryk University, Czech Republic
30.60410.64910.67270.62721.1060C. Basile*, D. Benedetto°, E. Caglioti°, and M. Degli Esposti*
*Università di Bologna and °Università La Sapienza, Italy
40.30450.52860.66890.43702.3317Y. A. Palkovskii, A. V. Belov, and I. A. Muzika
Zhytomyr State University, Ukraine
50.18850.46030.60510.37144.4354M. Granitzer, M. Muhr, M. Zechner, and R. Kern
Know-Center Graz, Austria
60.14220.61900.74730.528419.4327V. A. Scherbinin* and S. Butakov°
*American University of Nigeria, Nigeria, and
°Solbridge International School of Business, South Korea
70.06490.17360.65520.10015.3966R. C. Pereira, V. P. Moreira, and R. Galante
Universidade Federal do Rio Grande do Sul, Brazil
80.02640.02650.01360.45861.0068E. Vallés Balaguer, using WCopyFind
Private, Spain
90.01870.05530.02900.60486.7780J. A. Malcolm, P. C. R. Lane, and A. Rainer
Ferret, University of Hertfordshire, UK
100.01170.02260.36840.01162.8256J. Allen
Southern Methodist University in Dallas, USA


Intrinsic Plagiarism Analysis Task
RankOverall scoreF-measurePrecisionRecallGranularityParticipant
.
10.24620.30860.23210.46071.3839E. Stamatatos
University of the Aegean, Greece
20.19550.19560.10910.94371.0007B. Hagbi and M. Koppel
Bar Ilan University, Israel
30.17660.22860.19680.27241.4524M. Granitzer, M. Muhr, M. Zechner, and R. Kern
Know-Center Graz, Austria
40.12190.17500.10360.56301.7049L. M. Seaward and S. Matwin
University of Ottawa, Canada


Overall Tasks
RankOverall scoreF-measurePrecisionRecallGranularityParticipant
.
10.48710.48840.51930.46101.0038C. Grozea
Fraunhofer FIRST, Germany
20.42650.43350.39010.48771.0228J. Kasprzak, M. Brandejs, and M. Křipač
Masaryk University, Czech Republic
30.42290.45440.47090.43901.1060C. Basile*, D. Benedetto°, E. Caglioti°, and M. Degli Esposti*
*Università di Bologna and °Università La Sapienza, Italy
40.21310.37000.46820.30592.3317Y. A. Palkovskii, A. V. Belov, and I. A. Muzika
Zhytomyr State University, Ukraine
50.18330.40010.48260.34173.5405M. Granitzer, M. Muhr, M. Zechner, and R. Kern
Know-Center Graz, Austria
60.09960.43330.52310.369919.4327V. A. Scherbinin* and S. Butakov°
*American University of Nigeria, Nigeria, and
°Solbridge International School of Business, South Korea
70.07390.09260.06960.13821.3839E. Stamatatos
University of the Aegean, Greece
80.05860.05870.03270.28311.0007B. Hagbi and M. Koppel
Bar Ilan University, Israel
90.04540.12160.45860.07015.3966R. C. Pereira, V. P. Moreira, and R. Galante
Universidade Federal do Rio Grande do Sul, Brazil
100.03660.05250.03110.16891.7049L. M. Seaward and S. Matwin
University of Ottawa, Canada
110.01840.01850.00950.32101.0068E. Vallés Balaguer, using WCopyFind
Private, Spain
120.01310.03870.02030.42346.7780J. A. Malcolm, P. C. R. Lane, and A. Rainer
Ferret, University of Hertfordshire, UK
130.00810.01570.25790.00812.8256J. Allen
Southern Methodist University in Dallas, USA

Winner

We are happy to announce the following winners:

  • Task winner of the external analysis task is Cristian Grozea from Fraunhofer FIRST.
  • Task winner of the intrinsic analysis task is Efstathios Stamatatos from the University of the Aegean.
  • Overall winner of the 1st International Competition on Plagiarism Detection is Cristian Grozea from Fraunhofer FIRST.
Congratulations!

Competition Corpus

We have set up a large-scale corpus of artificial plagiarism for the competition. The corpus contains primarily English documents in which all types of plagiarism cases can be found, namely monolingual plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents. The corpus is self-contained, i.e., the source documents of all plagiarism cases are part of the corpus.

To generate artificial plagiarism cases we have employed a random plagiarist: given a text the plagiarist decides whether or not he will plagiarize, from which documents he will plagiarize, how many passages will be plagiarized, and for each plagiarized passage of which type and length it will be. The type of a plagiarized passage may either be obfuscated plagiarism or translated plagiarism. The random plagiarist attempts to obfuscate his plagiarism by applying a random sequence of text operations such as shuffling a word, deleting a word, inserting a word from an external source, or replacing a word with a synonym, antonym, hypernym, or hyponym. Translated plagiarism is created using machine translation.

Corpus Statistics

  • Corpus size: 20 611 suspicious documents, 20 612 source documents.
  • Document lengths: small (up to paper size), medium, large (up to book size).
  • Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities).
  • Plagiarized passage length: short (few sentences), medium, long (many pages).
  • Plagiarism types: monolingual (obfuscation degrees none, low, and high), and multilingual (automatic translation).

Corpus Format

In the corpus you will find plain text files encoded in UTF-8, and along each text file an XML file with meta information. The documents are divided into two folders, one with the suspicious documents and the other one with the source documents. Details about the available meta information can be found within the corpus.

Release Plan

The corpus will be released partially during the competition, and in full after competition. For each of the competition tasks a development corpus and a competition corpus will be released. The development corpus will contain annotated artificial plagiarism cases, the competition corpus will contain artificial plagiarism cases without annotation. The former can be used to develop and evaluate your plagiarism detection software while the latter will be used to determine the best plagiarism detection approach. Note that only your success in detecting the plagiarism in the competition corpus will be considered when selecting the winner of the competition.

Download

The full corpus, including annotations of all plagiarism cases for both tasks, can be found here.
The version of the corpus which was used during the comeptition is available on demand.

-->

Performance Measures

The success of a plagiarism detection software will be measured in terms of its precision, recall, and granularity on detecting the plagiarized passages in the corpus. Let s denote a plagiarized passage from the set S of all plagiarized passages. Let r denote a detection from the set R of all detections and let S_R be the subset of S for which detections exist in R. Let |s|, |r| denote the char lengths of s, r and let |S|, |R|, |S_R| be the sizes of the respective sets. The formulas compute as follows:

PAN'09 Plagiarism Detection Performance Measures

Remarks.

  • We use the character counts in the formulas for precision and recall instead of, for instance, word counts to meet the fact that we cannot know what kind of tokenization approach you will be using. Thus, counting the characters which overlap with plagiarized passages is the safest way to compute these values.
  • Recall and precision are well-known measures to assess retrieval performance, but granularity is not. We have added this performance measure to determine whether your plagiarism detection algorithm reports a plagiarized passage as a whole, or rather divided into many small and/or overlaping phrases. The former is preferable since it accounts for a better usability of your tool.
  • External plagiarism cases and external detections comprise the chars of both the plagiarized passage and the source passage.
  • An external detection r must overlap by at least one char with both the plagiarized passage and the source passage of the corresponding s, otherwise it will not contribute to the recall of s and the precision of r will be set to 0.

Registration

The registration is closed.

To register for participation in the competition send an e-mail to pan09@webis.de which includes the following information:

  • name of your group (optional),
  • full names, affiliations, and e-mail addresses of all group members,
  • the designated group leader, and
  • the competition tasks you will be participating in.
You will receive a short notification of you registration from one of the organizers.

Result Submission

The deadline for submitting detection results on the competition corpus is June 11, 2009.
The results of your plagiarism detection algorithm are required to be formatted in XML:

<document reference="...">                <!-- 'reference' refers to the analysed suspicious document -->
  <feature name="detected-plagiarism"     <!-- plagiarism which was detected in an external analysis -->
           this_offset="5"                <!-- the char offset within the suspicious document -->
           this_length="1000"             <!-- the number of chars beginning at the offset -->
           source_reference="..."         <!-- reference to the source document -->
           source_offset="100"            <!-- the char offset within the source document -->
           source_length="1000"           <!-- the number of chars beginning at the offset -->
  />
  ...                                     <!-- more external analysis results in this suspicious document -->

  <feature name="detected-plagiarism"     <!-- plagiarism which was detected in an intrinsic analysis -->
           this_offset="5"                <!-- just like above but excluding the "source"-attributes -->
           this_length="1000"
  />
  ...                                     <!-- more intrinsic analysis results in this suspicious document -->
</document>

The result document must be valid with respect to the XML schema found here.
In order to upload your results, please follow this tutorial.

Participant Network

We have set up a mailing list to connect prospective participants. Feel free to join!

Subscribe to the mailing list:
Email:
Visit the mailing list.

Competition Rules

  • Agreement. Participation in the competition constitutes the participant's full and unconditional agreement and acceptance of these rules.
  • Eligibility. The contest is open to any party planning to attend the PAN competition. A person can participate in only one group. Multiple submissions per group are allowed for each task. We will not provide feedback on the performance at the time of submission: only the last submission before the deadline will be evaluated and all other submissions will be discarded.
  • Integrity. The exploitation of potential flaws in the competition corpus to gain advantages in the competition is prohibited.
  • Winner Selection. There will be one winner of the "External Plagiarism Analysis" task, one winner of the "Intrinsic Plagiarism Analysis" task, and one winner of the whole competition. The winners will be determined according to the following method. All participants are ranked according to their overall performance on the competition corpus for each task which is measured as F-measure (harmonic mean of precision and recall) divided by granularity. Winner of a task is the participant who has the highest score on the respective part of the corpus. Winner of the competition is the participant who has the highest score on the whole competition corpus.
  • Award. The winner of the whole competition will be awarded the prize money. We expect that one member of the winning group attends the forthcoming PAN workshop and presents his approach. The winner is also encouraged to submit a research paper about his approach to the workshop.

FAQ

  1. My software will not be able to detect cross-language plagiarism. Can I participate anyway?
    Yes, definitely! The corpora contain only a small percentage of cross-language plagiarism. However, when selecting the winner we will not distinguish participants who claim to detect cross-language plagiarism from those who don't.
  2. Is it mandatory to also submit a research paper to the workshop when participating in the competition?
    No, but we strongly encourage you to do so since this is a great opportunity for you to present your approach.
  3. Do I need to submit my paper in Spanish?
    No, unlike the SEPLN conference the PAN workshop will be held in English only.
  4. How often can I submit detection results?
    As often as you like, however, only the last submission counts for the competition.
  5. Is it possible to register only for the PAN workshop and not for the SEPLN conference?
    Yes.
  6. Can vendors of commercial plagiarism detection software participate?
    Yes.

Competition Organization

Martin Potthast, and Andreas Eiselt (Bauhaus University Weimar), and
Alberto Barrón-Cedeño (Universidad Politécnica de Valencia)