Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
The author of the best performing solution to this task will be awarded a cash prize of 300 Euros at the conference The award is kindly sponsored by the Forensic Lab of the Universitat Pompeu Fabra Barcelona.
We will evaluate independently each subtask (gender and age) for each language, and a compound ranking for age + gender for each language too. Then we will obtain the average for both languages. The team obtaining highest average will be the winner of the award.
To develop your software, we provide you with a training data set that consists of documents written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47). Moreover, documents from authors who pretend to be minors will be included (e.g., documents composed of chat lines of sexual predators will be also considered).
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="<author-id>" lang="en|es" age_group="10s|20s|30s" gender="male|female" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension. The output files have to be written either directly to the working directory (to "./") or to a subfolder.
The author-id has to be extracted from each document's filename which follows the pattern
<authorid>_<lang>_<age>_<gender>.xml. Note that in the test corpus the age and gender information are replaced by "xxx".
We ask you to prepare your software so that it can be executed via a command line call. You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. Please test your software using one of the unit-test-scripts below. Download the script, fill in the required fields, and start it using the sh command. If the script runs without errors and if the correct output is produced, you can submit your software by sending your unit-test-script via e-mail. For more information see the PAN 2013 User Guide below.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.