Plagiarism Detection

Task Description

This year, we will focus on external plagiarism detection, and on the following two sub-tasks:

  • Candidate Document Retrieval:

    Given a suspicious document and a web search engine, the task is to retrieve a set of candidate source documents that may have served as an original to plagiarize from.

  • Detailed Comparison:

    Given a pair of suspicious document and potential source document, the task is to extract all plagiarized passages from the suspicious document and their corresponding source passages from the source document.

Please prepare two pieces of software that are capable of solving the above tasks, respectively, when being presented with the mentioned inputs. More details about the inputs and the expected outputs of your softwares can be found below.

Evaluation Corpus

For each of the two sub-tasks, you will be given separate evaluation resources in the form of training and test corpora as well as web-services:

  • Candidate Document Retrieval:

    • Training corpus: a set of suspicious documents, each of which about a specific topic and plagiarized from web pages on that topic found in the ClueWeb09 corpus. This corpus will contain annotations that reveal the plagiarism.

      pan12-plagiarism-candidate-retrieval-training-corpus-2012-03-31.zip
      (115 KB, MD5 sum: 9d030b013c823edbd68c820b70b55ba2)

    • Test corpus: as set of suspicious documents similar to the training corpus, but without annotations.

      pan12-plagiarism-candidate-retrieval-test-corpus-2012-05-21.zip
      (277 KB, MD5 sum: 44972cb3f70b2494bddb163cd6c9194c, access restricted)

      To access the test set you need to apply for an access token which is then also used to access the ChatNoir API. To apply for an access token, please send mail to pan@webis.de.

    • Search engine: the ChatNoir search engine indexes the ClueWeb09, and it offers a convenient API. Use this search engine to search for candidate documents for each of the suspicious documents found in the training and test corpora. In the test phase, you will be given a limited budget of queries.

      ChatNoir Search Engine

  • Detailed Comparison:

    • Training corpus: a set of pairs of suspicious document and potential source document. The suspicious document may contain passages of text plagiarized from the source document. The corpus will contain automatically and manually generated plagiarism, including annotations that reveal where they are.A readme file with detailed information about the corpus is included in the archive (readme.txt).

      pan12-detailed-comparison-training-corpus-2012-03-16.zip
      (499.5 MB, MD5 sum: c9f6cad903a041b6f16107814a7b9432)

    • Test corpus: the test corpus will not be made available for download. Instead, you are asked to upload your software as an executable to our evaluation platform (see below). This allows us for the first time to include real plagiarism cases into the test corpus. For your convenience, we provide our baseline detection program written in Python. It should help you with reading the documents and writing your detection results. A Java version of the baseline detection program is also available.

    • Evaluation platform: the TIRA evaluation platform will be used to evaluate your software. You may try it out during the training phase:

      PAN12 Evaluation Service: Training Phase

      The evaluation platform for the Test Phase will be made available as soon as possible.

Performance Measures

For each of the two sub-tasks the performance of your plagiarism detector will be determined based on standard performance measures:

  • Candidate Document Retrieval:

    Performance will be measured for each suspicious document in the test corpus in turn, based on the following 5 scores, averaging the results:

    1. Number of queries submitted.
    2. Number of web pages downloaded.
    3. Precision and recall of web pages downloaded regarding the actual sources.
    4. Number of queries until the first actual source is found.
    5. Number of downloads until the first actual source is downloaded.
    Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of extracting plagiarized passages from the suspicious documents is not taken into account in this sub-task.

    To facilitate our performance evaluation, you will be assigned a unique access token which must be added to the URL when accessing the ChatNoir API. All other access method will be disabled during the test phase.

  • Detailed Comparison:

    Performance will be measured using the plagdet score, which combines precision, recall, and granularity. Details about these measures can be found in this paper (Section 2). For convenience, we offer a reference implementation of the measures in Python. The functions to be used are macro_avg_recall_and_precision, granularity, and plagdet_score.

    Moreover, for the first time, the runtime of your software as measured during the test phase on the evaluation platform will be taken into account.

Resources

For an overview of approaches to plagiarism detection, we would like to refer you to the proceedings of the past three plagiarism detection competitions:

For an overview of the TIRA evaluation platform visit http://tira.webis.de.

Run Submission

For each of the two sub-tasks, we expect your software to produce a certain output which is then evaluated using the aforementioned performance measures:

  • Candidate Document Retrieval:

    For each of the suspicious documents in the test corpus, create a TXT file which lists URLs pointing to the source web pages you have identified. Please include only web pages that---according to you retrieval algorithm---are actually sources for plagiarized passages and exclude those which are not:

    http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    ...
    

    Optionally, you may rank the source web pages according to their likelihood of being sources for plagiarized passages. In that case, please prepend each URL by a weight value indicating its relevance:

    0.98 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    0.56 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    0.13 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
    ...
    

  • Detailed Comparison:

    Compile an executable of your software in binary format (x32/x86_64) along with all dependent libraries and upload it as a single archive file. Source code is welcome but not a requirement. Your sofware must be executable on either Windows 7, Windows XP, or Linux. To avoid misuse, internet access is granted to whitelisted resources only. You can request resources to be added to the whitelist. To execute your program, we need a program description (manual) that specifies additional dependencies, the parameters, and how to call your program.

    The output of your software shall be formatted in XML as follows:

    <document reference="...">    <!-- file name of the suspicious document        -->
    <feature
      name="detected-plagiarism"  <!-- type of the plagiarism annotation           -->
      this_offset="5"             <!-- char offset within the suspicious document  -->
      this_length="1000"          <!-- number of chars beginning at the offset     -->
    
      source_reference="..."      <!-- file name of the source document            -->
      source_offset="100"         <!-- char offset within the source document      -->
      source_length="1000"        <!-- number of chars beginning at the offset     -->
    
    />
    ...                           <!-- more detections in this suspicious document -->
    </document>

    The result document must be valid with respect to the XML schema found here.

Evaluation Results

The results of the evaluation will be made available as noted in the important dates.

Task Committee

Martin Potthast, Benno Stein, Matthias Hagen, and Tim Gollub
Webis @ Bauhaus-Universität Weimar

Alberto Barrón-Cedeño, Paolo Rosso, and Parth Gupta
NLEL @ Universidad Politécnica de Valencia