Author Profiling
2015

MeaningCloud
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Award

We are happy to announce that the best performing team at the 3rd International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

  • Miguel Ángel Álvarez Carmona, Adrián Pastor López Monroy, Manuel Montes y Gómez and Luis Villaseñor Pineda from INAOE, Mexico

Congratulations!

Task

This task is about predicting an author's demographics from her writing. Participants will be provided with Twitter tweets in English and Spanish to predict age, gender and personality traits. Moreover, they will be provided also with tweets in Italian and Dutch and asked to predict the gender and personality.

Training corpus

To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Italian and Dutch.

Download corpus (Updated April 23, 2015. The file is password-protected. To obtain the password, you need to register first.)

With regard to age, we will consider the following classes: 18-24, 25-34, 35-49, 50-xx.

With regard to personality traits, for each trait we will provide scores (between -0.5 and 0.5).

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="{author-id}"
	  type="twitter"
	  lang="en|es|it|nl"
	  age_group="18-24|25-34|35-49|50-xx"
	  gender="male|female"
	  extroverted="-0.5 to +0.5"
	  stable="-0.5 to +0.5"
	  agreeable="-0.5 to +0.5"
	  conscientious="-0.5 to +0.5"
	  open="-0.5 to +0.5"
  />
  

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

Performance Measures

The performance of your author profiling solution for age and gender will be ranked by accuracy.

For personality identification the average Root Mean Squared Error (RMSE) will be used.

For obtaining a global ranking, we apply the following formula: global_ranking = ((1-RMSE) + joint_accuracy)/ 2

Submission

We ask you to prepare your software so that it can be executed via command line calls. To maximize the sustainability of software submissions for this task, we encourage you to prepare your software so it can be re-trained on demand, i.e., by offering two commands, one for training, and one for testing. This way, your software can be reused on future evaluation corpora as well as on private collections submitted to PAN by via our data submission initiative.

The training command shall take as input (i) an absolute path to a training corpus formated as described above, and (ii) an absolute path to an empty output directory:

> myTrainingSoftware -i path/to/training/corpus -o path/to/output/directory

Based on the training corpus, and perhaps based on its language and genre found within, your software shall train a classification model, and save the trained model to the specified output directory in serialized or binary form.

The testing command shall take as input (i) an absolute path to a test corpus (not containing the ground truth) (ii) an absolute path to a previously trained classification model, and (iii) an absolute path to an empty output directory:

> myTestingSoftware -i path/to/test/corpus -m path/to/classification/model -o path/to/output/directory

Based on the classification model, the software shall classifiy each case found in the test corpus and write an output file as described above to the output directory.

However, offering a command for training is optional, so if you face difficulties in doing so, you may skip the training command and omit the model option -m from the testing command.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following table lists the performances achieved by the participating teams:

Author profiling performance
Avg. AccuracyTeam
0.8404Miguel Ángel Álvarez Carmona, Adrián Pastor López Monroy, Manuel Montes y Gómez, Luis Villaseñor Pineda and Hugo Jair Escalante. INAOE Mexico.
0.8346Carlos E. González Gallardo, Azucena Montes Redón, Gerardo Eugenio Sierra Martínez, José Antonio Nuñez Juárez, Adolfo Jonathan Salinas López and Juan Rodrigo Ek Catzin. UNAM Mexico.
0.8078Andreas Grivas, Anastasia Krithara and George Giannakopoulos. NCSR Demokritos, Greece.
0.7875Mirco Kocher. University of Neuchâtel, Switzerland.
0.7755Octavia Maria Sulea and Daniel Dichiu. Bitdefender and University of Bucharest, Romania.
0.7584Lesly Miculicich. University of Necuhatel, Switzerland.
0.7338Scot Nowson, Julien Perez, Caroline Brun, Shachar Mirkin and Claude Roux. Xerox Research Centre Europe, France.
0.7223Edson Roberto Duarte Weren. Brazil.
0.7130Adam Poulston, Mark Stevenson and Kalina Bontcheva. University of Sheffield, United Kingdom.
0.7061Suraj Maharjan and Thamar Solorio. University of Houston. United States.
0.6960Caitlin McCollister, Bo Lou and Shu Huang. University of Kansas, United States.
0.6875Mounica Arroju, Aftab Hassan and Golnoosh Farnadi. University of Washington Tacoma, United States.
0.6857Mayte Gimenez, Delia Irazú Hernández and Ferran Plá. Universitat Politècnica de València, Spain.
0.6809Alberto Bartoli, Andrea De Lorenzo, Alessandra Laderchi, Eric Medvet and Fabiano Tarlao. University of Trieste, Italy.
0.6685Ifrah Pervaz, Iqra Ameer, Abdul Sittar, Rao Muhammad Adeel Nawab. COMSATS Institute of Information Technology, Pakistan.
0.6495Fahad Najib, Waqas Arshad Cheema and Rao Muhammad Adeel Nawab. Comsats Lahore, Pakistan.
0.6401Piotr Przybyla and Pawel Teisseyre. Polish Academy of Sciences, Poland.
0.6204Alonso Palomino Garibay, Adolfo T. Camacho González, Ricardo A. Fierro Villaneda, Irazú Hernández Farias, Davide Buscaldi and Ivan Vladimir Meza Ruiz. UNAM, Mexico.
0.6178Roy Bayot, Teresa Gonçalves and Paolo Quaresma. Universidade de Évora, Portugal.
*Hafiz Rizwan Iqbal, Muhammad Adnan Ashraf and Rao Muhammad Adeel Nawab. Pakistan.
*Yasen Kiprov, Momchil Hardalov, Preslav Nakov and Ivan Koychev. Sofia University "St. Kliment Ohridski", Bulgaria.
*Juan Pablo Posadas Durán, Ilia Markov, Helena Gómez Adorno, Grigori Sidorov, Ildar Batyrshin, Alexander Gelbukh and Obdulia Pichardo Lagunas. National Polytechnic Institute, Mexico.

* Results have been omitted for these teams since they participated in some languages only.

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Learn more »

Related Work and Corpora

We refer you to:

Task Chair

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Task Committee

Francisco Rangel

Francisco Rangel

Autoritas Consulting

Fabio Celli

Fabio Celli

University of Trento

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Walter Daelemans

Walter Daelemans

University of Antwerp

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

© pan.webis.de