Author Profiling
2013

Universitat Pompeu Fabra Barcelona - Forensic Lab
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Award

We are happy to announce the following overall winner of the 1st International Competition on Author Profiling who will be awarded 300,- Euro sponsored by the Forensic Lab of the Universitat Pompeu Fabra Barcelona.

  • Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, Hugo Jair Escalante, and Adrián Pastor López-Monrroy from INAOE, Mexico.

Congratulations!

Task
Given a document, your task is to determine its author's age and gender.
Training Corpus

To develop your software, we provide you with a training data set that consists of documents written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47). Moreover, documents from authors who pretend to be minors will be included (e.g., documents composed of chat lines of sexual predators will be also considered).

Learn more » Download corpus

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

<author 
   id="{author-id}"
   lang="en|es"
   age_group="10s|20s|30s"
   gender="male|female"
/>

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension. The output files have to be written either directly to the working directory (to "./") or to a subfolder. The author-id has to be extracted from each document's filename which follows the pattern <authorid>_<lang>_<age>_<gender>.xml. Note that in the test corpus the age and gender information are replaced by "xxx".

Performance Measures
The performance of your author profiling solution will be ranked by its accuracy.
Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus 1 Download corpus 2

Submission

We ask you to prepare your software so that it can be executed via a command line call. You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. Please test your software using one of the unit-test-scripts below. Download the script, fill in the required fields, and start it using the sh command. If the script runs without errors and if the correct output is produced, you can submit your software by sending your unit-test-script via e-mail. For more information see the PAN 2013 User Guide below.

PAN User Guide » Unit-Test Windows » Unit-Test Ubuntu »

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following table lists the accuracies achieved by the participating teams on the English portions of the evaluation corpus:

English author profiling performance
AccuracyTeam
0.3894Michał Meina, Karolina Brodzínska, Bartosz Celmer, Maja Czoków, Martyna Patera, Jakub Pezacki, and Mateusz Wilk
Nicolaus Copernicus University, Poland
0.3813A. Pastor López-Monroy°, Manuel Montes-y-Gómez°, Hugo Jair Escalante°, Luis Villaseñor-Pineda°, and Esaú Villatoro-Tello*
°Instituto Nacional de Astrofísica, Óptica y Electrónica and *Universidad Autónoma Metropolitana-Cuajimalpa, Mexico
0.3677Seifeddine Mechti, Maher Jaoua, and Lamia Hadrich Belguith
University of Sfax, Tunisia
0.3508K Santosh, Romil Bansal, Mihir Shekhar, and Vasudeva Varma
International Institute of Information Technology, India
0.3488Wee-Yong Lim, Jonathan Goh, and Vrizlynn L. L. Thing
Institute for Infocomm Research, Singapore
0.3420Susana Ladra°, Francisco Claude*, and Roberto Konow^
°University of A Coruña, Spain, *University of Waterloo, Canada, and ^University of Chile, Chile
0.3292Yuridiana Aleman, Nahun Loya, Darnes Vilariño, and David Pinto
Benem´erita Universidad Aut´onoma de Puebla, Mexico
0.3268Lee Gillam
University of Surrey, UK
0.3115Roman Kern
Know-Center GmbH, Autria
0.3114Fermín L. Cruz°, Rafa Haro R.*, and F. Javier Ortega°
University of Seville and Zaizi, Spain
0.2843Aditya Pavan, Aditya Mogadala, and Vasudeva Varma
International Institute of Information Technology, India
0.2840Andrés Alfonso Caurcel Díaz° and José María Gómez Hidalgo*
Universidad Politécnica de Madrid and Optenet, Spain
0.2816Delia-Irazú Hernández°, Rafael Guzmán-Cabrera*, Antonio Reyes^, and Martha-Alicia Rocha°'
°Universidad Politécnica de Valencia, Spain, and *Universidad de Guanajuato, ^Instituto Superior de Intérpretes y Traductores, and 'Instituto Tecnológico de León, Mexico
0.2814Magdalena Jankowska, Vlado Kešelj, and Evangelos Milios
Dalhousie University, Canada
0.2785Lucie Flekovayz and Iryna Gurevych
Technische Universität Darmstadt and German Institute for Educational Research and Educational Information, Germany
0.2564Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira
UFRGS, Brazil
0.2471Upendra Sapkota°, Thamar Solorio°, Manuel Montes-y-Gómez*, and Gabriela Ramírez-de-la-Rosa°
°University of Alabama at Birmingham, USA, and *Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
0.2450Maria De-Arteaga, Sergio Jimenez, George Dueñas, Sergio Mancera and Julia Baquero
Universidad Nacional de Colombia, Colombia
0.2395Erwan Moreau and Carl Vogel
Trinity College Dublin, Ireland
0.1650Baseline
0.1574Braja Gopal Patra°, Somnath Banerjee°, Dipankar Das*, Tanik Saikh°, Sivaji Bandyopadhyay°
°Jadavpur University and NIT Meghalaya, India
0.0741Leticia Cagnina, Darío Funez, and Marcelo Errecalde
Universidad Nacional de San Luis, Argentina

A more detailed analysis of the detection performances with respect to precision, recall, and granularity can be found in the overview paper accompanying this task.

Learn more »

Related Work

We refer you to:

Task Chair

Francisco Rangel

Autoritas Consulting

Task Committee

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Giacomo Inches

University of Lugano

Moshe Koppel

Moshe Koppel

Bar-Ilan University

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean