This task is divided into two sub-tasks: source retrieval and text alignment. You can choose to solve one or both of them.
To develop your software, we provide you with a training data set that consists of suspicious documents, each of which about a specific topic and plagiarized from web pages on that topic found in the ClueWeb09 corpus.
If you are not in possession of the ClueWeb09 corpus, we also provide access to two search engines which index the ClueWeb, namely the Lemur Indri search engine and the ChatNoir search engine. To programmatically access these two search engines, we provide a unified search API.
Note: To better separate the source retrieval task from the text alignment task, the API provides a text alignment oracle feature. For each document you request to download from the ClueWeb, the text alignment oracle discloses if this document is a source for plagiarism for the suspicious document in question. In addition, the plagiarized text is returned. This, way participation in the source retrieval task does not require the development of a text alignment solution. However, you are free to use your own text alignment, if you want to.
For your convenience, we provide an example retrieval program written in Python.
The program loops through the suspicious documents in a given directory and outputs a search interaction log. The log is valid with respect to the output format described below. You may use the source code for getting started with your own approach.
The output of your software for each suspicious document must be an interaction log file suspicious-documentXYZ.log that looks like this:
Timestamp [Query|Download_URL] 1358326592 barack obama family tree 1358326597 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=110212744 1358326598 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=10221241 1358326599 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=100003305377 1358326605 barack obama genealogy 1358326610 http://webis15.medien.uni-weimar.de/chatnoir/clueweb?id=82208332 ...
For example, the above file would specify that at 1358326592 (Unix timestamp) the query barack obama family tree was sent and that in the following three of the retrieved documents were selected for download before the next query was sent.
To measure the performance of your software, it will be run against internal test data similar to the training data. As performance measures, we employ the following 5 scores as averages over each suspicious document:
Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of identifying reused passages between documents is not taken into account in this sub-task, but note that retrieving duplicates of a source document is considered a true positive, whereas retrieving more than one duplicate of a source document does not improve performance.
We ask you to prepare your software so that it can be executed via a command line call. You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. Please test your software using one of the unit-test-scripts below. Download the script, fill in the required fields, and start it using the sh command. If the script runs without errors and if the correct output is produced, you can submit your software by sending your unit-test-script via e-mail. For more information see the PAN 2013 User Guide below.
PAN User Guide » Unit-Test Windows » Unit-Test Ubuntu »
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
To develop your software, we provide you with a training data set that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to various kinds of (automatic) obfuscation, including paraphrasing and summarization.
You may also use the training data of PAN'12.
For your convenience, we provide an example alignment program written in Python.
The program loops through the document pairs of a corpus and records the detection results in XML files. The XML files are valid with respect to the output format described below. You may use the source code for getting started with your own approach.
Your software must take as input the absolute path to an unpacked dataset, and has to output, for each pair of suspicious document and source document in the corpus' pairs file, an XML file that looks like this:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" /> <feature ... /> ... </document>
For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.
The naming of the output files is up to you, we recommend to use suspicious-documentXYZ-source-documentABC.xml. The output files have to be written either directly to the working directory (to "./") or to a subfolder.
Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.
View details » Download measures
If you apply the performance measures program to the results produced by the example program for the corpus pan13-text-alignment-training-corpus-2013-01-21, you should get the following scores:In addition, the runtime of each software is measured, and we will also introduce precision and recall based on the level of plagiarism cases instead of character level.
We ask you to prepare your software so that it can be executed via a command line call. You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. Please test your software using one of the unit-test-scripts below. Download the script, fill in the required fields, and start it using the sh command. If the script runs without errors and if the correct output is produced, you can submit your software by sending your unit-test-script via e-mail. For more information see the PAN 2013 User Guide below.
PAN User Guide » Unit-Test Windows » Unit-Test Ubuntu »
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
The final evaluation results for text alignment are now available. We congratulate the winning team and say thank you to all for your participation. We are looking forward to meeting you in Valencia!