Author Identification

Task Description

This year we divided the task into two different sub-task:

  • Traditional Authorship Attribution:

    Within the traditional authorship tasks there are different flavors:

    • Traditional (closed-class /open-class, with varying numbers of candidate authors) authorship attribution. Within the closed class you will be given a closed set of candidate authors and are asked to identify which one of them is the author of an anonymous text. Withing the open class you have to consider also that it might be that none of the candidates is the real author of the document.

    • Authorship clustering/intrinsic plagiarism: in this problem you are given a text (which, for simplicity, is segmented into a sequence of "paragraphs") and are asked to cluster the paragraphs into exactly two clusters: one that includes paragraphs written by the "main" author of the text and another that includes all paragraphs written by anybody else. (Thus, this year the intrinsic plagiarism has been moved from the plagiarism task to the author identification track.).

  • Sexual Predator Identification:

    The goal of this sub-task is to identify classes of authors, namely online predators. You will be given chat logs involving two (or more) people and have to determine who is the one trying to convince the other partecipants(s) to provide some sexual favour . You will also need to identify the particular conversation where the person exploits his bad behavior.

    The task can therefore be divided into two parts:

    1. Identify the predators (within all the users)
    2. Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior

    Given the public nature of the dataset, we ask the participants not to use external or online resources for resolving this task (e.g. search engines) but to extract evidence from the provided datasets only.

Evaluation Corpus

For each of the two sub-tasks your will be given separate evaluation resources.

Performance Measures

For each of the two sub-tasks the performance be determined based on standard performance measures:

  • Traditional Authorship Attribution:

    The performance of your authorship attribution will be judged by average precision, recall, and F1 over all authors in the given training set. A reference implementation will be forthcoming.

  • Sexual Predator Identification:

    The performance of your authorship attribution will be judged by average precision, recall, and F1 over all persons involved and lines of the conversations.

Resources

For an overview of approaches to automated authorship attribution, we refer you to recent survey papers in the area:

Run Submission

  • Traditional Authorship Attribution:

    As per repeated requests, here is a sample submission format to use for the Traditional Authorship Attribution Competition for PAN/CLEF. Please note that following this format is not mandatory and we will continue to accept anything we can interpret.

    For traditional authorship problems (e.g. problem A), use the following (all words in ALL CAPS should be filled out appropriately):

    team TEAM NAME : run RUN NUMBER
    task TASK IDENTIFIER
    file TEST FILE = AUTHOR IDENTIFIER
    file TEST FILE = AUTHOR IDENTIFIER
    ...
    

    For problems E and F, there are no designated sample authors, so we recommend listing paragraph numbers. Author identifier is optional and arbitrary -- if it makes you feel better to talk about authors A and B or authors 1 and 2 you can insert it into the appropriate field. Any paragraphs not listed will be assumed to be part of an unnamed default author.

    team TEAM NAME : run RUN NUMBER
    task TASK IDENTIFIER
    file TEST FILE = AUTHOR IDENTIFIER (PARAGRAPH LIST)
    file TEST FILE = AUTHOR IDENTIFIER
    ...
    

    For example:

    team Jacob : run 1
    task B
    file 12Btest01.txt = A
    file 12Btest02.txt = A
    file 12Btest03.txt = A
    file 12Btest04.txt = None of the Above
    file 12Btest05.txt = A
    file 12Btest06.txt = A
    file 12Btest07.txt = A
    file 12Btest08.txt = A
    file 12Btest09.txt = A
    file 12Btest10.txt = A
    
    task C
    file 12Ctest01.txt = A
    file 12Ctest02.txt = A
    file 12Ctest03.txt = A
    file 12Ctest04.txt = A
    file 12Ctest05.txt = A
    file 12Ctest06.txt = A
    file 12Ctest07.txt = A
    file 12Ctest08.txt = A
    file 12Ctest09.txt = A
    
    task F
    file 12Ftest01.txt = (1,2,3,6,7)
    file 12Ftest01.txt = (4,5)
    

    In this sample file, we consider anything not listed in task F (paragraphs 8 and beyond) to be a third, unnamed author.

  • Sexual Predator Identification:

    For each of the two parts we require a different format.

    1. Identify the predators (within all the users)

      Participants should update a text file containing an user-id per line, of those identified as predator only:

      …
      a7c5056a2c30e2dc637907f448934ca3
      58f15bbb100bbeb6963b4b967ce04bdf
      e040eb115e3f7ad3824e93141665fc2a
      3d57ed3fac066fa4f8a52432db51c019
      …
      

    2. Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior

      Participants should update an xml file similar to the corpus ones, containing conversation-ids and message line numbers considered suspicious (line numbers together with all the others message information: author, time, text):

      <conversations>
      	…
      	<conversation id="0042762e26ed295a8576806f5548cad9">
      		<message line="3">
      			<author>f069dbec9ab3e090972d432db279e3eb</author>
      			<time>03:20</time>
      			<text>whats up?</text>
      		</message>
      		<message line="4">
      			<author>f069dbec9ab3e090972d432db279e3eb</author>
      			<time>03:21</time>
      			<text>how u doing?</text>
      		</message>
      		…
      		<message line="10">
      			<author>f069dbec9ab3e090972d432db279e3eb</author>
      			<time>04:00</time>
      			<text>sse you llater?</text>
      		</message>
      	</conversation>
      	…
      	<conversation id="0209b0a30c8eced86863631ada73a530">
      		<message line="3">
      			<author>0042762e26ed295a8576806f5548cad9</author>
      			<time>01:17</time>
      			<text>and that i dont touch u</text>
      		</message>
      	</conversation>
      	…
      <conversations>
      

Once prepared, please submit your runs via mail to pan@webis.de. You may submit more than one run, however, please limit yourself to a reasonable number of submissions. With regard to an overall ranking of participants, only the latest run submitted is used. If you submit more than one run at a time, please rank your runs accordingly so that it is clear to us which run is your preferred choice.

Evaluation Results

  • Traditional Authorship Attribution:

    The results of the traditional authorship attribution task can be found here.

  • Sexual Predator Identification:

    Due to an error in computing the F measure (with β=0.5 and β=3; β was not squared) and to the addition of 4 more predators (which produced 76 more lines) in the ground truth, results have sligtly changed compared to the first release. Please find the updated results below and grund truth in the corpus section above.

    For the predator identification subtask we received 16 submissions for the first part of the problem (identifying the predators) and 14 for the second part (identifying the distinctive chat lines of the predator behavior).

    We provide in the tables below the results for the first problem, for the main run of each team and for all the submitted runs. We evaluate the results with Precision (P: retrieved authors that are relevant), Recall (R: relevant authors that are retrieved) and F1 measure (weighted harmonic mean between Precision and Recall, with β factor equal to 1). If we interpret the results in a realistic scenario, we might observe that retrieving lot of relevant authors is important (Recall), since a police agent would like to receive the major number of suspect. However, what is more important is the fact that the retrieved authors are relevant (Precision), to optimize the time of a police agent towards the "right" suspect rather than "all" the possible suspects. For this reason we introduced another measure of F, with the β factor equal to 0.5, for emphasizing the precision. As it can be observed in the table, this influences the first 3 positions of the ranks only.

    Preliminary results of the Sexual Predator Identification: 1) Identify the predators
    Participant main runRetrievedRelevantPRF(β=1)F(β=0.5)Rank F(β=0.5)
    .
    villatorotello-run-2012-06-15-2157g 204 196 0.9608 0.7840 0.8634 0.8936 1
    snider12-run-2012-06-16-0032 186 181 0.9731 0.7240 0.8303 0.8730 2
    eriksson12-run-2012-06-15-1949 265 223 0.8415 0.8920 0.8660 0.8577 3
    parapar12-run-2012-06-15-0959j 181 168 0.9282 0.6720 0.7796 0.8235 4
    morris12-run-2012-06-16-0752-main 159 152 0.9560 0.6080 0.7433 0.8028 5
    peersman12-run-2012-06-15-1559 170 148 0.8706 0.5920 0.7048 0.7525 6
    grozea12-run-2012-06-14-1706b 215 160 0.7442 0.6400 0.6882 0.7059 7
    sitarz12-run-2012-0615-1515 218 156 0.7156 0.6240 0.6667 0.6822 8
    vartapetiance12-run-2012-06-15-1411 160 97 0.6063 0.3880 0.4732 0.5105 9
    kontostathis-run-2012-06-16-0317e 475 167 0.3516 0.6680 0.4607 0.4175 10
    kang12-run-2012-06-15-0904b 930 199 0.2140 0.7960 0.3373 0.2829 11
    kern12-run-2012-06-18-1827b 1172 173 0.1476 0.6920 0.2433 0.2001 12
    bogdanova12-run-2012-06-14-1117 2109 54 0.0256 0.2160 0.0458 0.0363 13
    prasath12-run-2012-06-15-2122 10289 204 0.0198 0.8160 0.0387 0.0294 14
    vilarino12-run-2012-06-14-2121b 5225 97 0.0186 0.3880 0.0354 0.0272 15
    gomezhidalgo12-2012-06-15-1900 150 1 0.0067 0.0040 0.0050 0.0055 16

    Updated results of the Sexual Predator Identification: 1) Identify the predators
    Participant main runRetrievedRelevantPRF(β=1)F(β=0.5)Rank F(β=0.5)
    .
    villatorotello-run-2012-06-15-2157g 204 200 0.9804 0.7874 0.8734 0.9346 1
    snider12-run-2012-06-16-0032 186 183 0.9839 0.7205 0.8318 0.9168 2
    parapar12-run-2012-06-15-0959j 181 170 0.9392 0.6693 0.7816 0.8691 3
    morris12-run-2012-06-16-0752-main 159 154 0.9686 0.6063 0.7458 0.8652 4
    eriksson12-run-2012-06-15-1949 265 227 0.8566 0.8937 0.8748 0.8638 5
    peersman12-run-2012-06-15-1559 170 152 0.8941 0.5984 0.7170 0.8137 6
    grozea12-run-2012-06-14-1706b 215 163 0.7581 0.6417 0.6951 0.7316 7
    sitarz12-run-2012-0615-1515 218 159 0.7294 0.6260 0.6737 0.7060 8
    vartapetiance12-run-2012-06-15-1411 160 99 0.6188 0.3898 0.4783 0.5537 9
    kontostathis-run-2012-06-16-0317e 475 170 0.3579 0.6693 0.4664 0.3946 10
    kang12-run-2012-06-15-0904b 930 203 0.2183 0.7992 0.3429 0.2554 11
    kern12-run-2012-06-18-1827b 1172 177 0.1510 0.6969 0.2482 0.1791 12
    bogdanova12-run-2012-06-14-1117 2109 55 0.0261 0.2165 0.0466 0.0316 13
    prasath12-run-2012-06-15-2122 10289 207 0.0201 0.8150 0.0393 0.0250 14
    vilarino12-run-2012-06-14-2121b 5225 98 0.0188 0.3858 0.0358 0.0232 15
    gomezhidalgo12-2012-06-15-1900 150 1 0.0067 0.0039 0.0050 0.0059 16

    Updated results (all the runs) for the Sexual Predator Identification: 1) Identify the predators
    Participant runRetrievedRelevantPRF(β=1)F(β=0.5)
    .
    bogdanova12-run-2012-06-14-1117 2109 55 0.0261 0.2165 0.0466 0.0316
    eriksson12-run-2012-06-15-1949 265 227 0.8566 0.8937 0.8748 0.8638
    gomezhidalgo12-2012-06-15-1900 150 1 0.0067 0.0039 0.0050 0.0059
    grozea12-run-2012-06-14-1706a 322 142 0.4410 0.5591 0.4931 0.4604
    grozea12-run-2012-06-14-1706b 215 163 0.7581 0.6417 0.6951 0.7316
    kang12-run-2012-06-15-0904a 1049 202 0.1926 0.7953 0.3101 0.2270
    kang12-run-2012-06-15-0904b 930 203 0.2183 0.7992 0.3429 0.2554
    kern12-run-2012-06-18-1827a 1172 177 0.1510 0.6969 0.2482 0.1791
    kern12-run-2012-06-18-1827b 1172 177 0.1510 0.6969 0.2482 0.1791
    kontostathis-run-2012-06-16-0317a 5225 206 0.0394 0.8110 0.0752 0.0487
    kontostathis-run-2012-06-16-0317b 5625 221 0.0393 0.8701 0.0752 0.0486
    kontostathis-run-2012-06-16-0317c 3696 206 0.0557 0.8110 0.1043 0.0685
    kontostathis-run-2012-06-16-0317d 688 172 0.2500 0.6772 0.3652 0.2861
    kontostathis-run-2012-06-16-0317e 475 170 0.3579 0.6693 0.4664 0.3946
    morris12-run-2012-06-16-0752-main 159 154 0.9686 0.6063 0.7458 0.8652
    morris12-run-2012-06-17-0126 152 147 0.9671 0.5787 0.7241 0.8527
    parapar12-run-2012-06-15-0959a 200 128 0.6400 0.5039 0.5639 0.6072
    parapar12-run-2012-06-15-0959b 205 160 0.7805 0.6299 0.6972 0.7449
    parapar12-run-2012-06-15-0959c 169 145 0.8580 0.5709 0.6856 0.7796
    parapar12-run-2012-06-15-0959d 175 151 0.8629 0.5945 0.7040 0.7914
    parapar12-run-2012-06-15-0959e 182 164 0.9011 0.6457 0.7523 0.8350
    parapar12-run-2012-06-15-0959f 202 154 0.7624 0.6063 0.6754 0.7250
    parapar12-run-2012-06-15-0959g 171 162 0.9474 0.6378 0.7624 0.8635
    parapar12-run-2012-06-15-0959h 223 161 0.7220 0.6339 0.6751 0.7024
    parapar12-run-2012-06-15-0959i 173 161 0.9306 0.6339 0.7541 0.8510
    parapar12-run-2012-06-15-0959j 181 170 0.9392 0.6693 0.7816 0.8691
    peersman12-run-2012-06-15-1559 170 152 0.8941 0.5984 0.7170 0.8137
    prasath12-run-2012-06-15-2122 10289 207 0.0201 0.8150 0.0393 0.0250
    sitarz12-run-2012-0615-1515 218 159 0.7294 0.6260 0.6737 0.7060
    snider12-run-2012-06-16-0032 186 183 0.9839 0.7205 0.8318 0.9168
    vartapetiance12-run-2012-06-15-1411 160 99 0.6188 0.3898 0.4783 0.5537
    vilarino12-run-2012-06-14-2121a 9071 236 0.0260 0.9291 0.0506 0.0323
    vilarino12-run-2012-06-14-2121b 5225 98 0.0188 0.3858 0.0358 0.0232
    villatorotello-run-2012-06-15-2157a 108 103 0.9537 0.4055 0.5691 0.7507
    villatorotello-run-2012-06-15-2157b 204 12 0.0588 0.0472 0.0524 0.0561
    villatorotello-run-2012-06-15-2157c 211 200 0.9479 0.7874 0.8602 0.9107
    villatorotello-run-2012-06-15-2157d 240 36 0.1500 0.1417 0.1457 0.1483
    villatorotello-run-2012-06-15-2157e 305 6 0.0197 0.0236 0.0215 0.0204
    villatorotello-run-2012-06-15-2157f 269 143 0.5316 0.5630 0.5468 0.5376
    villatorotello-run-2012-06-15-2157g 204 200 0.9804 0.7874 0.8734 0.9346

    We report in this other table below all the results for the second problem. An expert manually evaluated all the lines that where returned at least by 1 participant (these accounts for more than 90% of all the predator lines). As in the first problem, we computed Precision (P: retrieved lines that are relevant), Recall (R: relevant lines that are retrieved) and F1 measure (weighted harmonic mean between Precision and Recall, with β factor equal to 1). Still referring to the first problem, if we think at a realistic scenario we might noticed that in this second problem retrieving lot of relevant lines (Recall) is more important that finding only the relevant ones (Precision). Having lot of relevant lines augment the possibility of finding good evidences towards a suspect. For this reason we introduced another measure of F, with the β factor equal to 3, for emphasizing the recall, that slightly modifies the upper part of the ranking compared to the standard F1.

    Preliminary results of the Sexual Predator Identification: 2) Identify predators line
    Participant main runRetrievedRelevantPRF(β=1)F(β=3)Rank F(β=3)
    .
    kontostathis-run-2012-06-16-0317e 19535 3215 0.1646 0.5022 0.2479 0.3319 1
    grozea12-run-2012-06-14-1706b 63290 5715 0.0903 0.8927 0.1640 0.2771 2
    peersman12-run-2012-06-15-1559 4717 1650 0.3498 0.2577 0.2968 0.2759 3
    sitarz12-run-2012-0615-1515 4558 1469 0.3223 0.2295 0.2681 0.2473 4
    morris12-run-2012-06-16-0752-main 2685 1195 0.4451 0.1867 0.2630 0.2184 5
    kern12-run-2012-06-18-1827b 15533 1328 0.0855 0.2074 0.1211 0.1529 6
    eriksson12-run-2012-06-15-1949 10416 1116 0.1071 0.1743 0.1327 0.1507 7
    prasath12-run-2012-06-15-2122 77255 1041 0.0135 0.1626 0.0249 0.0432 8
    vartapetiance12-run-2012-06-15-1411 607 91 0.1499 0.0142 0.0260 0.0184 9
    parapar12-run-2012-06-15-0959j 2037 96 0.0471 0.0150 0.0228 0.0181 10
    vilarino12-run-2012-06-14-2121b 6787 47 0.0069 0.0073 0.0071 0.0072 11
    bogdanova12-run-2012-06-14-1117 49 4 0.0816 0.0006 0.0012 0.0008 12
    villatorotello-run-2012-06-15-2157g 50 1 0.0200 0.0002 0.0003 0.0002 13
    gomezhidalgo12-2012-06-15-1900 400 0 0.0000 0.0000 0.0000 0.0000 14

    Updated results of the Sexual Predator Identification: 2) Identify predators line
    Participant main runRetrievedRelevantPRF(β=1)F(β=3)Rank F(β=3)
    .
    grozea12-run-2012-06-14-1706b 63290 5790 0.0915 0.8938 0.1660 0.4762 1
    kontostathis-run-2012-06-16-0317e 19535 3249 0.1663 0.5015 0.2498 0.4174 2
    peersman12-run-2012-06-15-1559 4717 1688 0.3579 0.2606 0.3016 0.2679 3
    sitarz12-run-2012-0615-1515 4558 1486 0.3260 0.2294 0.2693 0.2364 4
    morris12-run-2012-06-16-0752-main 2685 1211 0.4510 0.1869 0.2643 0.1986 5
    kern12-run-2012-06-18-1827b 15533 1357 0.0874 0.2095 0.1233 0.1838 6
    eriksson12-run-2012-06-15-1949 10416 1122 0.1077 0.1732 0.1328 0.1633 7
    prasath12-run-2012-06-15-2122 77255 1044 0.0135 0.1612 0.0249 0.0770 8
    parapar12-run-2012-06-15-0959j 2037 105 0.0515 0.0162 0.0247 0.0174 9
    vartapetiance12-run-2012-06-15-1411 607 91 0.1499 0.0140 0.0257 0.0154 10
    vilarino12-run-2012-06-14-2121b 6787 48 0.0071 0.0074 0.0072 0.0074 11
    bogdanova12-run-2012-06-14-1117 49 4 0.0816 0.0006 0.0012 0.0007 12
    villatorotello-run-2012-06-15-2157g 50 1 0.0200 0.0002 0.0003 0.0002 13
    gomezhidalgo12-2012-06-15-1900 400 0 0.0000 0.0000 0.0000 0.0000 14

    Reference for the evaluation measures (and beta values): http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html (last check: August 9, 2012)

 

 

Sexual Predator Identification: Participants
Partecipant main runPartecipant and affiliation
.
villatorotello-run-2012-06-15-2157g Esaú Villatoro-Tello
Instituto Nacional de Astrfísica, Óptica y Electrónica (INAOE) and Universidad Autónoma Metropolitana
Mexico
Antonio Juárez-González
Instituto Nacional de Astrfísica, Óptica y Electrónica (INAOE)
Mexico
Hugo J. Escalante
Instituto Nacional de Astrfísica, Óptica y Electrónica (INAOE)
Mexico
Manuel Montes-y-Gómez
Instituto Nacional de Astrfísica, Óptica y Electrónica (INAOE)
Mexico
Luis Villaseñor-Pineda
Instituto Nacional de Astrfísica, Óptica y Electrónica (INAOE)
Mexico
snider12-run-2012-06-16-0032 Tim Snider
Porfiau Inc.
Canada
eriksson12-run-2012-06-15-1949 Gunnar Eriksson
Gavagai AB
Sweden
Jussi Karlgren
Gavagai AB
Sweden
parapar12-run-2012-06-15-0959j Javier Parapar
University of A Coruña
Spain
David E. Losada
Universidade de Santiago de Compostela
Spain
Alvaro Barreiro
University of A Coruña
Spain
morris12-run-2012-06-16-0752-main Colin Morris
University of Toronto
Canada
Graeme Hirst
University of Toronto
Canada
peersman12-run-2012-06-15-1559 Claudia Peersman
University of Antwerp
Netherlands
Frederik Vaassen
University of Antwerp
Netherlands
Vincent Van Asch
University of Antwerp
Netherlands
Walter Daelemans
University of Antwerp
Netherlands
grozea12-run-2012-06-14-1706b Cristian Grozea
Fraunhofer Institute FIRST
Germany
Marius Popescu
University of Bucharest
Romania
sitarz12-run-2012-0615-1515 Rachel Sitarz
Purdue University
United States
vartapetiance12-run-2012-06-15-1411 Anna Vartapetiance
University of Surrey
UK
Lee Gillam
University of Surrey
UK
kontostathis-run-2012-06-16-0317e April Kontostathis
Ursinus College
USA
Andy Garron
The University of Maryland
USA
Kelly Reynolds
Lehigh University
USA
Will West
Lehigh University
USA
Lynne Edwards
Ursinus College
USA
kang12-run-2012-06-15-0904b In-Su Kang
Kyungsung University
South Korea
Chul-Kyu Kim
Kyungsung University
South Korea
Shin-Jae Kang
Daegu University
South Korea
Seung-Hoon Na
Electronics and Telecommunications
Research Institute
South Korea
kern12-run-2012-06-18-1827b Roman Kern
Graz University of Technology and Know-Center GmbH
Austria
Stefan Klampfl
Know-Center GmbH
Austria
Mario Zechner
Know-Center GmbH
Austria
bogdanova12-run-2012-06-14-1117 Dasha Bogdanova
University of Saint Petersburg
Russia
Paolo Rosso
Universitat Politècnica de València
Spain
prasath12-run-2012-06-15-2122 Sriram Prasath Elango
KTH/Gavagai
Sweden
vilarino12-run-2012-06-14-2121b Darnes Vilariño
Benemérita
Universidad Autónoma
de Puebla
Mexico
Esteban Castillo
Benemérita Universidad Autónoma de Puebla
Mexico
David Pinto
Benemérita Universidad Autónoma de Puebla
Mexico
Iván Olmos
Benemérita Universidad Autónoma de Puebla
Mexico
Saul León
Benemérita. Universidad Autonóma de Puebla
Mexico
gomezhidalgo12-2012-06-15-1900 José María Gómez Hidalgo
Optenet
Spain
Andrés Alfonso Caurcel Díaz
Universidad Politécnica de Madrid
Spain

Task Committee

Patrick Juola
Duquesne University

Shlomo Argamon
Illinois Institute of Technology

Efstathios Stamatatos
University of the Aegean

Moshe Koppel
Bar-Ilan University

Giacomo Inches and Fabio Crestani
IRGroup @ University of Lugano