Plagiarism Detection
Task Description
This year, we will focus on external plagiarism detection, and on the following two sub-tasks:
Candidate Document Retrieval:
Given a suspicious document and a web search engine, the task is to retrieve a set of candidate source documents that may have served as an original to plagiarize from.
Detailed Comparison:
Given a pair of suspicious document and potential source document, the task is to extract all plagiarized passages from the suspicious document and their corresponding source passages from the source document.
Please prepare two pieces of software that are capable of solving the above tasks, respectively, when being presented with the mentioned inputs. More details about the inputs and the expected outputs of your softwares can be found below.
Evaluation Corpus
For each of the two sub-tasks, you will be given separate evaluation resources in the form of training and test corpora as well as web-services:
Performance Measures
For each of the two sub-tasks the performance of your plagiarism detector will be determined based on standard performance measures:
Candidate Document Retrieval:
Performance will be measured for each suspicious document in the test corpus in turn, based on the following 5 scores, averaging the results:
-
Number of queries submitted.
-
Number of web pages downloaded.
-
Precision and recall of web pages downloaded regarding the actual sources.
-
Number of queries until the first actual source is found.
-
Number of downloads until the first actual source is downloaded.
Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of extracting plagiarized passages from the suspicious documents is not taken into account in this sub-task.
To facilitate our performance evaluation, you will be assigned a unique access token which must be added to the URL when accessing the ChatNoir API. All other access method will be disabled during the test phase.
Detailed Comparison:
Performance will be measured using the plagdet score, which combines precision, recall, and granularity. Details about these measures can be found in this paper (Section 2). For convenience, we offer a reference implementation of the measures in Python. The functions to be used are macro_avg_recall_and_precision, granularity, and plagdet_score.
Moreover, for the first time, the runtime of your software as measured during the test phase on the evaluation platform will be taken into account.
Resources
For an overview of approaches to plagiarism detection, we would like to refer you to the proceedings of the past three plagiarism detection competitions:
For an overview of the TIRA evaluation platform visit http://tira.webis.de.
Run Submission
For each of the two sub-tasks, we expect your software to produce a certain output which is then evaluated using the aforementioned performance measures:
Candidate Document Retrieval:
For each of the suspicious documents in the test corpus, create a TXT file which lists URLs pointing to the source web pages you have identified. Please include only web pages that---according to you retrieval algorithm---are actually sources for plagiarized passages and exclude those which are not:
http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
...
Optionally, you may rank the source web pages according to their likelihood of being sources for plagiarized passages. In that case, please prepend each URL by a weight value indicating its relevance:
0.98 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
0.56 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
0.13 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=...
...
Detailed Comparison:
Compile an executable of your software in binary format (x32/x86_64) along with all dependent libraries and upload it as a single archive file. Source code is welcome but not a requirement. Your sofware must be executable on either Windows 7, Windows XP, or Linux. To avoid misuse, internet access is granted to whitelisted resources only. You can request resources to be added to the whitelist. To execute your program, we need a program description (manual) that specifies additional dependencies, the parameters, and how to call your program.
The output of your software shall be formatted in XML as follows:
<document reference="..."> <!-- file name of the suspicious document -->
<feature
name="detected-plagiarism" <!-- type of the plagiarism annotation -->
this_offset="5" <!-- char offset within the suspicious document -->
this_length="1000" <!-- number of chars beginning at the offset -->
source_reference="..." <!-- file name of the source document -->
source_offset="100" <!-- char offset within the source document -->
source_length="1000" <!-- number of chars beginning at the offset -->
/>
... <!-- more detections in this suspicious document -->
</document>
The result document must be valid with respect to the XML schema found here.
Evaluation Results
The results of the evaluation will be made available as noted in the important dates.
Task Committee
Martin Potthast, Benno Stein, Matthias Hagen, and Tim Gollub
Webis @ Bauhaus-Universität Weimar
Alberto Barrón-Cedeño, Paolo Rosso, and Parth Gupta
NLEL @ Universidad Politécnica de Valencia