Author Profiling
2014

Corex
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Award

We are happy to announce the following overall winner of the 2nd International Competition on Author Profiling who will be awarded 300,- Euro sponsored by Atribus (Corex).

  • A. Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, and Luis Villaseñor-Pineda from INAOE, Mexico

Congratulations!

Task

Given a document, your task is to determine its author's age and gender.

Note. Besides, at RepLab 2014 author profiling will be approached from the online reputation monitoring perspective. Given a large number of Twitter profiles with 600 associated tweets each, participants will be asked to classify the author of a set of tweets as journalist, politician, activist, professional, client, company, authority or citizen, since the fact of belonging to a certain category could determine the importance of the user's opinions. The dataset will contain English and Spanish tweets related to the banking and automotive domains.

Training corpus

To develop your software for age and gender identification, we provide you with a training data set that consists of blog posts, Twitter tweets and social media texts written in both English and Spanish as well as hotel reviews written in English. With regard to age, we will consider the following classes: 18-24, 25-34, 35-49, 50-64, 65-xx.

Download corpus (updated on April 16, 2014)

Remark. Due to Twitter's privacy policy we cannot provide tweets directly, but only URLs referring to them. You will have to download them yourself. For your convenience, we provide a download software for this. We expect participants to extract gender and age information only from the textual part of a tweet and to discard any other meta information that may be provided by Twitter's API. When we evaluate your software at our site, we do not expect it downloads tweets. We will do this beforehand.

Download software

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="{author-id}"
	  type="blog|twitter|socialmedia|reviews"
	  lang="en|es"
	  age_group="18-24|25-34|35-49|50-64|65-xx"
	  gender="male|female"
  />
  

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension. The output files have to be written either directly to the working directory (to "./") or to a subfolder. The author-id has to be extracted from each document's filename which follows the pattern <authorid>_<lang>_<age>_<gender>.xml. Note that in the test corpus the age and gender information are replaced by "xxx".

Performance Measures
The performance of your author profiling solution will be ranked by accuracy.
Submission

We ask you to prepare your software so that it can be executed via command line calls. To maximize the sustainability of software submissions for this task, we encourage you to prepare your software so it can be re-trained on demand, i.e., by offering two commands, one for training, and one for testing. This way, your software can be reused on future evaluation corpora as well as on private collections submitted to PAN by via our data submission initiative.

The training command shall take as input (i) an absolute path to a training corpus formated as described above, and (ii) an absolute path to an empty output directory:

> myTrainingSoftware -i path/to/training/corpus -o path/to/output/directory

Based on the training corpus, and perhaps based on its language and genre found within, your software shall train a classification model, and save the trained model to the specified output directory in serialized or binary form.

The testing command shall take as input (i) an absolute path to a test corpus (not containing the ground truth) (ii) an absolute path to a previously trained classification model, and (iii) an absolute path to an empty output directory:

> myTestingSoftware -i path/to/test/corpus -m path/to/classification/model -o path/to/output/directory

Based on the classification model, the software shall classifiy each case found in the test corpus and write an output file as described above to the output directory.

However, offering a command for training is optional, so if you face difficulties in doing so, you may skip the training command and omit the model option -m from the testing command.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, your can move to submit your software. Before doing so, we provide your with a software submission readiness tester. Please use this tester to verify that your software works. Since we will be calling your software automatically in much the same ways as the tester does, this lowers the risk ot errors.

Download PAN Software Submission Readiness Tester

When your software is submission-ready, please mail the filled out submission.txt file found along the software submission readiness tester to pan@webis.de.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following table lists the performances achieved by the participating teams:

Author profiling performance
Avg. AccuracyTeam
0.2895A. Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, and Luis Villaseñor-Pineda
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
0.2802Liau Yung Siang and Vrizlynn L. L. Thing
Institute for Infocomm Research, Singapore
0.2760Suraj Maharjan, Prasha Shrestha, and Thamar Solorio
University of Alabama at Birmingham, USA
0.2349Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira
UFRGS, Brazil
0.2315Julio Villena-Román and José Carlos González-Cristóbal
DAEDALUS - Data, Decisions and Language, S.A., Spain
0.1998James Marquardt°, Golnoosh Farnadi*, Gayathri Vasudevan°, Marie-Francine Moens*, Sergio Davalos°, Ankur Teredesai°, Martine De Cock°
°University of Washington Tacoma, USA, *Katholieke Universiteit Leuven, Belgium
0.1677Christopher Ian Baker
Private, UK
0.1404Baseline
0.1067Seifeddine Mechti, Maher Jaoua, and Lamia Hadrich Belguith
University of Sfax, Tunisia
0.0946Esteban Castillo Juarez°, Ofelia Delfina Cervantes Villagomez*, Darnes Vilariño Ayala*, David Pinto Avendaño*, and Saul Leon Silverio*
°Universidad de las Américas Puebla and *Benemérita Universidad Autónoma de Puebla, Mexico
0.0834Gilad Gressel, Hrudya P, Surendran K, Thara S, Aravind A, Prabaharan Poornachandran
Amrita University, India
Related Work

We refer you to:

Task Chair

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Task Committee

Francisco Rangel

Autoritas Consulting

Giacomo Inches

University of Lugano

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Walter Daelemans

Walter Daelemans

University of Antwerp

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

Fabio Crestani

Fabio Crestani

University of Lugano

RepLab@PAN Task Committee

Julio Gonzalo

Universidad Nacional de Educación a Distancia

Irina Chugur

Universidad Nacional de Educación a Distancia

Jorge Carrillo de Albornoz

Universidad Nacional de Educación a Distancia

Damiano Spina

Universidad Nacional de Educación a Distancia

© pan.webis.de