Author Identification
Task Description
This year we divided the task into two different sub-task:
Traditional Authorship Attribution:
Within the traditional authorship tasks there are different flavors:
Traditional (closed-class /open-class, with varying numbers of candidate authors) authorship attribution. Within the closed class you will be given a closed set of candidate authors and are asked to identify which one of them is the author of an anonymous text. Withing the open class you have to consider also that it might be that none of the candidates is the real author of the document.
Authorship clustering/intrinsic plagiarism: in this problem you are given a text (which, for simplicity, is segmented into a sequence of "paragraphs") and are asked to cluster the paragraphs into exactly two clusters: one that includes paragraphs written by the "main" author of the text and another that includes all paragraphs written by anybody else. (Thus, this year the intrinsic plagiarism has been moved from the plagiarism task to the author identification track.).
Sexual Predator Identification:
The goal of this sub-task is to identify classes of authors, namely online predators. You will be given chat logs involving two (or more) people and have to determine who is the one trying to convince the other partecipants(s) to provide some sexual favour . You will also need to identify the particular conversation where the person exploits his bad behavior.
The task can therefore be divided into two parts:
- Identify the predators (within all the users)
- Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior
Given the public nature of the dataset, we ask the participants not to use external or online resources for resolving this task (e.g. search engines) but to extract evidence from the provided datasets only.
Evaluation Corpus
For each of the two sub-tasks your will be given separate evaluation resources.
Performance Measures
For each of the two sub-tasks the performance be determined based on standard performance measures:
Traditional Authorship Attribution:
The performance of your authorship attribution will be judged by average precision, recall, and F1 over all authors in the given training set. A reference implementation will be forthcoming.
Sexual Predator Identification:
The performance of your authorship attribution will be judged by average precision, recall, and F1 over all persons involved and lines of the conversations.
Resources
For an overview of approaches to automated authorship attribution, we refer you to recent survey papers in the area:
-
Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, December 2006.
-
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
-
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.
Run Submission
Traditional Authorship Attribution:
Results are to be submitted in an XML format similar to the training files, just without the <body>-element. An example of such a file is included in the corpus download.
Sexual Predator Identification:
For each of the two parts we require a different format.
-
Identify the predators (within all the users)
Participants should update a text file containing an user-id per line, of those identified as predator only:
…
a7c5056a2c30e2dc637907f448934ca3
58f15bbb100bbeb6963b4b967ce04bdf
e040eb115e3f7ad3824e93141665fc2a
3d57ed3fac066fa4f8a52432db51c019
…
Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior
Participants should update an xml file similar to the corpus ones, containing conversation-ids and message line numbers considered suspicious (line numbers together with all the others message information: author, time, text):
<conversations>
…
<conversation id="0042762e26ed295a8576806f5548cad9">
<message line="3">
<author>f069dbec9ab3e090972d432db279e3eb</author>
<time>03:20</time>
<text>whats up?</text>
</message>
<message line="4">
<author>f069dbec9ab3e090972d432db279e3eb</author>
<time>03:21</time>
<text>how u doing?</text>
</message>
…
<message line="10">
<author>f069dbec9ab3e090972d432db279e3eb</author>
<time>04:00</time>
<text>sse you llater?</text>
</message>
</conversation>
…
<conversation id="0209b0a30c8eced86863631ada73a530">
<message line="3">
<author>0042762e26ed295a8576806f5548cad9</author>
<time>01:17</time>
<text>and that i dont touch u</text>
</message>
</conversation>
…
<conversations>
Evaluation Results
The results of the evaluation will be made available as noted in the important dates.
Task Committee
Patrick Juola
Duquesne University
Shlomo Argamon
Illinois Institute of Technology
Efstathios Stamatatos
University of the Aegean
Moshe Koppel
Bar-Ilan University
Giacomo Inches and Fabio Crestani
IRGroup @ University of Lugano