Additional Content
Main Content
Picapica.net
Synopsis
Plagiarism is referred to as the malicious attempt to represent the work of another author as one's own. picapica (Plagiarism Indication by Computer-based Analysis) is a Web-based application for the automated detection of text plagiarism. Its underlying technologies and algorithms are developed at our research group and relate to the efficient retrieval and analysis of potentially plagiarized sources from the World Wide Web. picapica combines several approaches to plagiarism analysis: identification of copies which were taken 1:1 from a Web-document, copies that have undergone certain modifications, as well as an in-depth analyses of an author's writing style.
Demo
Watch the picapica demo video.
Project Outline
picapica implements a plagiarism analysis process consisting of three basic steps:
- Heuristic retrieval of reference documents from the World Wide Web as well as from specially prepared plagiarism indexes.
- Detailed analysis of a suspicious document against reference documents.
- Knowledge-based post-processing of plagiarism indications to avoid the detection of correct citations as plagiarism.
In the first step a suspicious document is analyzed in order to identify it's language, its topic, its genre, important keywords, and other characteristics which may help to narrow a Web search for plagiarized sources. Also, a special plagiarism index with commonly used sources for plagiarism (e.g. Wikipedia) is queried. The result of both heuristic searches is a set of URLs to Web documents which are downloaded in parallel on a distributed server architecture.
In the second step the suspicious document is compared to each of the downloaded documents. This step encompasses the retrieval of passages which are equal or which have a high similarity. In this connection fuzzy-fingerprinting plays an important role: from each text passage a fuzzy fingerprint is computed, where text passages with a high similarity are likely to be mapped onto the same fingerprint. This allows for a linear time retrieval of similar text passages between the suspicious document and a reference document. Apart from the comparison with reference documents the writing style of the suspicious document's author is analyzed. This analysis can be used to detect paragraphs copied from sources that are not available electronically.
The third step in analysis process is subject to our current research. Solutions to the problem of distinguishing between plagiarism and correct citations will be integrated to the Web service in the near future.
The activity diagram shows the outlined analysis process and it's distribution on our middleware architecture.

Report Generation. During the analysis process the user interface is incrementally updated as new results arrive. The figures below show snapshots of successful plagiarism analyses for an English document. The first three snapshots show similarities between the uploaded file and reference documents found on the World Wide Web. The marked regions indicate different kinds of plagiarism. The fourth snapshot indicates changes in the writing style, and the fifth snapshot shows a list of duplicate documents found on the World Wide Web.
Server Architecture. The server architecture implements a scalable distributed system based on the message oriented middleware paradigm. A gateway Web server attends to all client interactions. It receives uploaded files and delivers analysis results. A plagiarism analysis is conducted in parallel on several analysis servers. The entire communication, all analysis results, and the information about all currently running tasks are stored in a message queue. The message queue is realized with a relational database system.

People
Students: Dennis Braunsdorf, Franz Coriand, Andreas Eiselt, Jan Hühne, Alexander Kleppe, Karsten Klüger, Alexander Kümmel, Marion Kulig, Christoph Lössnitz, Fabian Loose, Hagen-Christian Tönnies, Martin Trenkmann, Michael Völske, André Zölitz
Related Publications
Content signature
© Fakultät Medien 07.02.2012 / Contact / Imprint / Data privacy / Your feedback
The Bauhaus-Universität Weimar uses Piwik for web analytics.








