Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Picapica.net

    Synopsis

    Plagiarism is referred to as the malicious attempt to represent the work of another author as one's own. picapica (Plagiarism Indication by Computer-based Analysis) is a Web-based application for the automated detection of text plagiarism. Its underlying technologies and algorithms are developed at our research group and relate to the efficient retrieval and analysis of potentially plagiarized sources from the World Wide Web. picapica combines several approaches to plagiarism analysis: identification of copies which were taken 1:1 from a Web-document, copies that have undergone certain modifications, as well as an in-depth analyses of an author's writing style.

    Demo

    Watch the picapica demo video.

    Project Outline

    picapica implements a plagiarism analysis process consisting of three basic steps:

    1. Heuristic retrieval of reference documents from the World Wide Web as well as from specially prepared plagiarism indexes.
    2. Detailed analysis of a suspicious document against reference documents.
    3. Knowledge-based post-processing of plagiarism indications to avoid the detection of correct citations as plagiarism.

    In the first step a suspicious document is analyzed in order to identify it's language, its topic, its genre, important keywords, and other characteristics which may help to narrow a Web search for plagiarized sources. Also, a special plagiarism index with commonly used sources for plagiarism (e.g. Wikipedia) is queried. The result of both heuristic searches is a set of URLs to Web documents which are downloaded in parallel on a distributed server architecture.

    In the second step the suspicious document is compared to each of the downloaded documents. This step encompasses the retrieval of passages which are equal or which have a high similarity. In this connection fuzzy-fingerprinting plays an important role: from each text passage a fuzzy fingerprint is computed, where text passages with a high similarity are likely to be mapped onto the same fingerprint. This allows for a linear time retrieval of similar text passages between the suspicious document and a reference document. Apart from the comparison with reference documents the writing style of the suspicious document's author is analyzed. This analysis can be used to detect paragraphs copied from sources that are not available electronically.

    The third step in analysis process is subject to our current research. Solutions to the problem of distinguishing between plagiarism and correct citations will be integrated to the Web service in the near future.

    The activity diagram shows the outlined analysis process and it's distribution on our middleware architecture.

    Report Generation. During the analysis process the user interface is incrementally updated as new results arrive. The figures below show snapshots of successful plagiarism analyses for an English document. The first three snapshots show similarities between the uploaded file and reference documents found on the World Wide Web. The marked regions indicate different kinds of plagiarism. The fourth snapshot indicates changes in the writing style, and the fifth snapshot shows a list of duplicate documents found on the World Wide Web.

    Server Architecture. The server architecture implements a scalable distributed system based on the message oriented middleware paradigm. A gateway Web server attends to all client interactions. It receives uploaded files and delivers analysis results. A plagiarism analysis is conducted in parallel on several analysis servers. The entire communication, all analysis results, and the information about all currently running tasks are stored in a message queue. The message queue is realized with a relational database system.

    People

    Students: Dennis Braunsdorf, Franz Coriand, Andreas Eiselt, Jan Hühne, Alexander Kleppe, Karsten Klüger, Alexander Kümmel, Marion Kulig, Christoph Lössnitz, Fabian Loose, Hagen-Christian Tönnies, Martin Trenkmann, Michael Völske, André Zölitz

    Related Publications

    Benno Stein, Nedim Lipka, and Peter Prettenhofer. Intrinsic Plagiarism Analysis. Language Resources and Evaluation (LRE), 45 (1) : 63-82, 2011. [doi] [paper] [bib]
    Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso. Cross-Language Plagiarism Detection. Language Resources and Evaluation (LRE), 45 (1) : 45-62, 2011. [doi] [paper] [bib]
    Martin Potthast. Technologies for Reusing Text from the Web. Dissertation, Bauhaus-Universität Weimar, December 2011. [publisher] [paper] [bib] [slides]
    Martin Potthast, Andreas Eiselt, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso. Overview of the 3rd International Competition on Plagiarism Detection. In Vivien Petras, Pamela Forner, and Paul D. Clough, editors, Notebook Papers of CLEF 11 Labs and Workshops, September 2011. ISBN 978-88-904810-1-7. [publisher] [paper] [bib] [slides]
    Benno Stein, Martin Potthast, Alberto Barrón-Cedeño, Paolo Rosso, Efstathios Stamatatos, and Moshe Koppel. Fourth International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 10). SIGIR Forum, 45 (1) : 45-48, June 2011. ACM. ISSN 0163-5840. [doi] [paper] [bib]
    Martin Potthast, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, and Paolo Rosso. Overview of the 2nd International Competition on Plagiarism Detection. In Martin Braschler, Donna Harman, and Emanuele Pianta, editors, Notebook Papers of CLEF 10 Labs and Workshops, September 2010. ISBN 978-88-904810-2-4. [publisher] [paper] [bib] [slides]
    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. An Evaluation Framework for Plagiarism Detection. In 23rd International Conference on Computational Linguistics (COLING 10), August 2010. Association for Computational Linguistics. [paper] [bib] [poster]
    Alberto Barrón-Cedeño, Martin Potthast, Paolo Rosso, Benno Stein, and Andreas Eiselt. Corpus and Evaluation Measures for Automatic Plagiarism Detection. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner and Daniel Tapias, editors, 7th Conference on International Language Resources and Evaluation (LREC 10), May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. [doi] [paper] [bib] [slides]
    Martin Potthast, Andreas Eiselt, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. PAN Plagiarism Corpus PAN-PC-09. http://www.uni-weimar.de/medien/webis/research/corpora, 2009. [corpus] [bib]
    Martin Potthast, Benno Stein, Andreas Eiselt, Alberto Barrón-Cedeño, and Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors, SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pages 1-9, September 2009. CEUR-WS.org. ISSN 1613-0073. [publisher] [paper] [bib] [slides]
    Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors. SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR Workshop Proceedings. Universidad Politécnica de Valencia and CEUR-WS.org, September 2009. ISSN 1613-0073. [publisher] [paper] [bib]
    Martin Potthast, Benno Stein and Maik Anderka. A Wikipedia-Based Multilingual Retrieval Model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Advances in Information Retrieval. 30th European Conference on IR Research (ECIR 08), 4956 of Lecture Notes in Computer Science, pages 522-530, 2008. Springer. ISBN 978-3-540-78645-0. [doi] [paper] [bib] [slides] [poster]
    Martin Potthast and Benno Stein. New Issues in Near-duplicate Detection. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme, and Reinhold Decker, editors, Data Analysis, Machine Learning and Applications. Selected papers from the 31th Annual Conference of the German Classification Society (GfKl 07), Studies in Classification, Data Analysis, and Knowledge Organization, pages 601-609, 2008. Springer. ISBN 978-3-540-78239-1. [doi] [paper] [bib] [slides]
    Fabian Loose, Steffen Becker, Martin Potthast, and Benno Stein. Retrieval-Technologien für die Plagiaterkennung in Programmen. In J. Baumeister and M. Atzmüller, editors, Information Retrieval Workshop at LWA 08, pages 5-12, October 2008. University of Würzburg. [paper] [bib] [slides]
    Benno Stein and Nedim Lipka and Sven Meyer zu Eißen. Meta Analysis within Authorship Verification. In A. M. Tjoa and R. R. Wagner, editors, TIR 08 at the 19th International Conference on Database and Expert Systems Applications (DEXA 08), pages 34-39, September 2008. IEEE. ISBN 978-0-7695-3299-8. ISSN 1529-4188. [doi] [paper] [bib]
    Benno Stein, Efstathios Stamatatos, and Moshe Koppel, editors. ECAI 08 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 08), CEUR Workshop Proceedings. National Library of Greece and CEUR-WS.org, July 2008. ISBN 978-960-6843-08-2. ISSN 1613-0073. [publisher] [paper] [bib]
    Sven Meyer zu Eißen, Benno Stein, and Marion Kulig. Plagiarism Detection without Reference Collections. In Reinhold Decker and Hans J. Lenz, editors, Advances in Data Analysis. Selected papers from the 30th Annual Conference of the German Classification Society (GfKl 06), Studies in Classification, Data Analysis, and Knowledge Organization, pages 359-366, 2007. Springer. ISBN 978-3-540-70980-0. [doi] [paper] [bib]
    Benno Stein and Sven Meyer zu Eißen. Intrinsic Plagiarism Analysis with Meta Learning. In Benno Stein, Moshe Koppel, and Efstathios Stamatatos, editors, SIGIR 07 Workshop Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN 07), pages 45-50, July 2007. CEUR-WS.org. ISSN 1613-0073. [publisher] [paper] [bib]
    Benno Stein, Sven Meyer zu Eißen, and Martin Potthast. Strategies for Retrieving Plagiarized Documents. In Charles Clarke, Norbert Fuhr, Noriko Kando, Wessel Kraaij, and Arjen P. de Vries, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 825-826, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib] [poster]
    Benno Stein. Principles of Hash-based Text Retrieval. In Charles Clarke, Norbert Fuhr, Noriko Kando, Wessel Kraaij, and Arjen P. de Vries, editors, 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 07), pages 527-534, July 2007. ACM. ISBN 978-1-59593-597-7. [paper] [bib]
    Benno Stein and Martin Potthast. Applying Hash-based Indexing in Text-Based Information Retrieval. In Moens, Tuytelaars, and de Vries, editors, 7th Dutch-Belgian Information Retrieval Workshop (DIR 07), pages 29-35, March 2007. Faculty of Engineering, Universiteit Leuven. ISBN 978-90-5682-771-7. [paper] [bib] [slides]
    Sven Meyer zu Eißen and Benno Stein. Intrinsic Plagiarism Detection. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika. and A. Yavlinsky, editors, Advances in Information Retrieval. 28th European Conference on IR Research (ECIR 06), London, UK, 3936 of Lecture Notes in Computer Science, pages 565-569, 2006. Springer. ISBN 3-540-33347-9. [doi] [paper] [bib]
    Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nürnberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering. Selected papers from the 29th Annual Conference of the German Classification Society (GfKl 05), Magdeburg, Germany, Studies in Classification, Data Analysis, and Knowledge Organization, pages 430-437, 2006. Springer. ISBN 978-3-540-31313-7. [doi] [paper] [bib]
    Benno Stein. Fuzzy-Fingerprints for Text-Based Information Retrieval. In Klaus Tochtermann and Hermann Maurer, editors, 5th International Conference on Knowledge Management (I-KNOW 05), Graz, Austria, Journal of Universal Computer Science, pages 572-579, July 2005. Know-Center. ISSN 0948-695x. [paper] [bib]

    Content signature