If you are not in possession of the ClueWeb09 corpus, we also provide access to two search engines which index the ClueWeb, namely the Lemur Indri search engine and the ChatNoir search engine. To programmatically access these two search engines, we provide a unified search API.
Note: To better separate the source retrieval task from the text alignment task, the API provides a text alignment oracle feature. For each document you request to download from the ClueWeb, the text alignment oracle discloses if this document is a source for plagiarism for the suspicious document in question. In addition, the plagiarized text is returned. This, way participation in the source retrieval task does not require the development of a text alignment solution. However, you are free to use your own text alignment, if you want to.
For your convenience, we provide a baseline program written in Python.
The program loops through the suspicious documents in a given directory and outputs a search interaction log. The log is valid with respect to the output format described below. You may use the source code for getting started with your own approach.
For each suspicious document
suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an interaction log
suspicious-documentXYZ.log which logs meta information about your retrieval process:
1358326592 barack obama family tree
1358326605 barack obama genealogy
For example, the above file would specify that at 1358326592 (Unix timestamp) the query
barack obama family tree was sent and that in the following three of the retrieved documents were selected for download before the next query was sent.
Performance will be measured based on the following five scores as averages over each suspicious document:
Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of identifying reused passages between documents is not taken into account here, but note that retrieving duplicates of a source document is considered a true positive, whereas retrieving more than one duplicate of a source document does not improve performance.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.
During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
We ask you to prepare your software so that it can be executed via a command line call.
> mySoftware -i path/to/corpus -o path/to/output/directory -t accessToken
You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:
Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
This task may be solved in two alternative ways:
Collection: Find real-world instances of text reuse or plagiarism, and annotate them.
Generation: Given pairs of documents, generate passages of reused or plagiarized text between them. Apply a means of obfuscation of your choosing.
We ask you to prepare your corpus so that its format corresponds to the previous PAN plagiarism corpora. An example for of a correctly formatted corpus can be downloaded here:
Enclosed in the evaluation corpora, a file named
pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair
source-documentABC.txt, your plagiarism detector shall output an XML file
suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:
<document reference="suspicious-documentXYZ.txt"><feature name="detected-plagiarism" this_offset="5" this_length="1000" source_reference="source-documentABC.txt"source_offset="100" source_length="1000" /> <feature ... /> ... </document>
For example, the above file would specify an aligned passage of text between
source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.
Performance will be measured by assessing the validity of your corpus in two ways.
Detection: Your corpus will be fed into the text alignment prototypes that have been submitted in previous years to the text alignment task. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured using macro-averaged precision and recall, granularity, and the plagdet score.
Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review. Every participant will be given a chance to assess and analyze the corpora of all other participants in order to determine corpus quality.
To submit your corpus, put it in a ZIP archive, and make it available to us via a file sharing service of your choosing, e.g., Dropbox, or Mega.
The text alignment task has been run since PAN'09; here is a quick list of the respective proceedings and overviews: