Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology & Information Systems / Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Picapica

    Synopsis

    Plagiarism is referred to as the malicious attempt to represent the work of another author as one's own. picapica (Plagiarism Indication by Computer-based Analysis) is a Web-based application for the automated detection of text plagiarism. Its underlying technologies and algorithms are developed at our research group and relate to the efficient retrieval and analysis of potentially plagiarized sources from the World Wide Web. picapica combines several approaches to plagiarism analysis: identification of copies which were taken 1:1 from a Web-document, copies that have undergone certain modifications, as well as an in-depth analyses of an author's writing style.

    Demo

    Watch picapica demo video.

    Project Outline

    picapica implements a plagiarism analysis process consisting of three basic steps:

    1. Heuristic retrieval of reference documents from the World Wide Web as well as from specially prepared plagiarism indexes.
    2. Detailed analysis of a suspicious document against reference documents.
    3. Knowledge-based post-processing of plagiarism indications to avoid the detection of correct citations as plagiarism.

    In the first step a suspicious document is analyzed in order to identify it's language, its topic, its genre, important keywords, and other characteristics which may help to narrow a Web search for plagiarized sources. Also, a special plagiarism index with commonly used sources for plagiarism (e.g. Wikipedia) is queried. The result of both heuristic searches is a set of URLs to Web documents which are downloaded in parallel on a distributed server architecture.

    In the second step the suspicious document is compared to each of the downloaded documents. This step encompasses the retrieval of passages which are equal or which have a high similarity. In this connection fuzzy-fingerprinting plays an important role: from each text passage a fuzzy fingerprint is computed, where text passages with a high similarity are likely to be mapped onto the same fingerprint. This allows for a linear time retrieval of similar text passages between the suspicious document and a reference document. Apart from the comparison with reference documents the writing style of the suspicious document's author is analyzed. This analysis can be used to detect paragraphs copied from sources that are not available electronically.

    The third step in analysis process is subject to our current research. Solutions to the problem of distinguishing between plagiarism and correct citations will be integrated to the Web service in the near future.

    The activity diagram shows the outlined analysis process and it's distribution on our middleware architecture.

    Report Generation. During the analysis process the user interface is incrementally updated as new results arrive. The figures below show snapshots of successful plagiarism analyses for an English document (first row) and a German document (second row). The first three snapshots show similarities between the uploaded file and reference documents found on the World Wide Web. The marked regions indicate different kinds of plagiarism. The fourth snapshot indicates changes in the writing style, and the fifth snapshot shows a list of duplicate documents found on the World Wide Web.

    Server Architecture. The server architecture implements a scalable distributed system based on the message oriented middleware paradigm. A gateway Web server attends to all client interactions. It receives uploaded files and delivers analysis results. A plagiarism analysis is conducted in parallel on several analysis servers. The entire communication, all analysis results, and the information about all currently running tasks are stored in a message queue. The message queue is realized with a relational database system.

    People

    • Martin Potthast
    • Benno Stein
    • Matthias Hagen
    • Sven Meyer zu Eissen

    Students: Dennis Braunsdorf, Franz Coriand, Andreas Eiselt, Jan Hühne, Alexander Kleppe, Karsten Klüger, Alexander Kümmel, Marion Kulig, Christoph Lössnitz, Fabian Loose, Hagen-Christian Tönnies, Michael Völske, André Zölitz

    Related Publications

    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (to appear), Beijing, China, August 2010. Association for Computational Linguistics. [paper] [bib]
    Alberto Barrón-Cedeño, Martin Potthast, Paolo Rosso, Benno Stein, and Andreas Eiselt. Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner and Daniel Tapias, editors, Proceedings of the Seventh International Language Resources and Evaluation Conference (LREC 10), Malta, May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. [url] [paper] [bib]
    Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso. Cross-Language Plagiarism Detection. Language Resources and Evaluation (LRE), 2010. (to appear) [url] [paper] [bib]
    Benno Stein, Nedim Lipka, and Peter Prettenhofer. Intrinsic Plagiarism Analysis. Language Resources and Evaluation (LRE), 2010. (to appear) [url] [paper] [bib]
    Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1-9, September 2009. CEUR-WS.org. ISSN 1613-0073. [url] [paper] [bib] [talk]
    Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors. SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR Workshop Proceedings. Universidad Politécnica de Valencia and CEUR-WS.org, September 2009. ISSN 1613-0073. [url] [paper] [bib]
    Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. PAN Plagiarism Corpus PAN-PC-09. http://www.uni-weimar.de/medien/webis/research/corpora, 2009. [url] [bib]
    Fabian Loose, Steffen Becker, Martin Potthast, and Benno Stein. Retrieval-Technologien für die Plagiaterkennung in Programmen. In J. Baumeister and M. Atzmüller, editors, Proceedings of the Information Retrieval Workshop at LWA 2008, pages 5-12, October 2008. University of Würzburg. [paper] [bib] [talk]
    Benno Stein and Nedim Lipka and Sven Meyer zu Eißen. Meta Analysis within Authorship Verification. In A. M. Tjoa and R. R. Wagner, editors, 19th International Conference on Database and Expert Systems Applications (DEXA 08), pages 34-39, September 2008. IEEE. ISBN 978-0-7695-3299-8. ISSN 1529-4188. [url] [paper] [bib]
    Benno Stein, Efstathios Stamatatos, and Moshe Koppel, editors. ECAI 2008 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 08), CEUR Workshop Proceedings. National Library of Greece and CEUR-WS.org, July 2008. ISBN 978-960-6843-08-2. ISSN 1613-0073. [url] [paper] [bib]
    Martin Potthast and Benno Stein. New Issues in Near-duplicate Detection. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme, and Reinhold Decker, editors, Data Analysis, Machine Learning and Applications, pages 601-609, 2008. Springer. ISBN 978-3-540-78239-1. [paper] [bib] [talk]
    Martin Potthast, Benno Stein and Maik Anderka. A Wikipedia-Based Multilingual Retrieval Model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR 2008), Glasgow, UK, 4956 of Lecture Notes in Computer Science, pages 522-530, 2008. Springer. ISBN 978-3-540-78645-0. [url] [paper] [bib] [talk]
    Benno Stein and Sven Meyer zu Eißen. Intrinsic Plagiarism Analysis with Meta Learning. In Benno Stein, Moshe Koppel, and Efstathios Stamatatos, editors, SIGIR Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN 07), pages 45-50, July 2007. CEUR-WS.org. ISSN 1613-0073. [url] [paper] [bib]
    Benno Stein. Principles of Hash-based Text Retrieval. In Clarke, Fuhr, Kando, Kraaij, and de Vries, editors, 30th Annual International ACM SIGIR Conference (SIGIR 07), pages 527-534, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
    Benno Stein, Sven Meyer zu Eißen, and Martin Potthast. Strategies for Retrieving Plagiarized Documents. In Clarke, Fuhr, Kando, Kraaij, and de Vries, editors, 30th Annual International ACM SIGIR Conference (SIGIR 07), pages 825-826, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
    Benno Stein and Martin Potthast. Applying Hash-based Indexing in Text-based Information Retrieval. In Moens, Tuytelaars, and de Vries, editors, 7th Dutch-Belgian Information Retrieval Workshop (DIR 2007), pages 29-35, March 2007. Faculty of Engineering, Universiteit Leuven. ISBN 978-90-5682-771-7. [paper] [bib] [talk]
    Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Plagiarism Detection without Reference Collections. In Reinhold Decker and Hans J. Lenz, editors, Advances in Data Analysis, pages 359-366, 2007. Springer. ISBN 978-3-540-70980-0. [paper] [bib]
    Sven Meyer zu Eißen and Benno Stein. Intrinsic Plagiarism Detection. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika. and A. Yavlinsky, editors, Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 06), London, UK, 3936 of Lecture Notes in Computer Science, pages 565-569, 2006. Springer. ISBN 3-540-33347-9. [url] [paper] [bib]
    Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, pages 430-437, 2006. Springer. ISBN 978-3-540-31313-7. [paper] [bib]
    Benno Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. In Klaus Tochtermann and Hermann Maurer, editors, Proceedings of the 5th International Conference on Knowledge Management (I-KNOW 05), Graz, Austria, Journal of Universal Computer Science, pages 572-579, July 2005. Know-Center. ISSN 0948-695x. [paper] [bib]
    Content signature

    © Fakultät Medien 15.02.2010 / Kontakt / Impressum / Bemerkung zu dieser Seite