The detection of plagiarism by hand is a laborious retrieval task---a task which can be aided or automatized. This evaluation task shall foster the development of new solutions in this respect.
Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.
Remark. This task combines both external plagiarism detection and intrinsic plagiarism detection, where the former refers to detecting plagiarized sections in a suspicious document and the corresponding source sections in a given set of source documents, and the latter refers to detecting plagiarized sections without comparing the suspicious document to any other documents, e.g., by detecting changes in writing style.
To develop your approach, we provide you with a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages, the source passages of which may or may not be present in one or more of the source documents.
For each suspicious document
suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an XML file
suspicious-documentXYZ.xml which contains meta information about all plagiarism cases detected within:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000" source_reference="suspicious-documentABC.txt" source_offset="100" source_length="1000" /> ... </document>The
source_*attributes may be omitted in case no source document can be identified for a given detected plagiarized passage.
The XML documents must be valid with respect to the XML schema found here.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.
During the competition, the test corpus does not contain ground truth data that reveals whether or not a suspicious document contains any plagiarized passages. To find out the performance of your software on the test corpus, you must collect the output its and submit it as described below.
After the competition, the test corpus is updated to include the ground truth data. This way, you have all the neccessary data to evaluate your approach on your own, without submitting it's output, yet being comparable to those who took part in the competition.
To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to firstname.lastname@example.org.
Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.
The following table lists the performances achieved by the participating teams:
|Plagiarism Detection Performance|
|0.7971||J. Kasprzak and M. Brandejs|
Masaryk University, Czech Republic
|0.7090||D. Zou, W. Long, and Z. Ling|
South China University of Technology, China
|0.6948||M. Muhr, R. Kern, M. Zechner, and M. Granitzer|
Know-Center Graz, Austria
|0.6209||C. Grozea* and M. Popescu°|
*Fraunhofer FIRST, Germany
°University of Bucharest, Romania
|0.6066||G. Oberreuter, G. L'Huillier, S.A. Ríos, and J.D. Velásquez|
University of Chile, Chile
|0.5851||D.A.R. Torrejón*,° and J.M.M. Ramos°|
*IES "José Caballero", Spain
°Universidad de Huelva, Spain
|0.5191||R.C. Pereira, V.P. Moreira, and R. Galante|
Universidade Federal do Rio Grande do Sul, Brazil
|0.5093||Y. Palkovskii, A. Belov, and I. Muzika|
Zhytomyr State University and SkyLine, Inc. Ukraine
|0.4378||Sobha L., Pattabhi R.K R., Vijay S.R., A. Akilandeswari|
MIT Campus of Anna University Chennai, India
Universität Koblenz-Landau, Germany
|0.2222||D. Micol, Ó. Ferrández, and R. Muñoz|
University of Alicante, Spain
|0.2148||M.R. Costa-jussà, R.E. Banchs, J. Grivolla, and J. Codina|
Barcelona Media Research Center, Spain
|0.2053||R.M.A. Nawab, M. Stevenson, and P. Clough|
University of Sheffield, UK
|0.2034||P. Gupta and S. Rao|
|0.1375||C. Vania and M. Adriani|
Universitas Indonesia, Indonesia
|0.0558||P. Suárez*, J.C. González*,°, and J. Villena-Román*,^|
*Daedalus - Data, Decisions and Language, Spain
°Universidad Politécnica de Madrid, Spain
^Universidad Carlos III de Madrid, Spain
|0.0195||S. Alzahrani* and N. Salim°|
*Taif University, Saudi Arabia
°Universiti Teknologi Malaysia, Malaysia
|0.0008||A. Iftene et al.|
University of Iasi, Romania
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.