General Information

Lecturer: Prof. Dr. Matthias Hagen, Prof. Dr. Efstathios Stamatatos, and Dr. Martin Potthast
Host: Summer Academy of the Studientstiftung des Deutschen Volkes
Date: September 21 - October 3, 2015
Venue: La Colle sur Loup, France


The goal of this working group was to give an overview of research and development in the area of natural language processing with a particular focus on technologies to analyze the authorship of text documents. These technologies are posed to be eventually employed to answer the overarching question of "Who wrote the Web?" Particularly in web search and information retrieval, and in times of fake news, it is important to learn about who wrote a given text, in order to judge by its author's reputation whether the text's message is trustworthy. Modern authorship technologies are also employed within digital text forensics, where forensic linguists and law enforcement are tasked with judging the believability of, for example, threatening letters and suicide notes. Workshop participants were tasked with reproducing the most influential approaches to authorship attribution in order to demonstrate the viability of this technology on modern evaluation datasets, as well as how easy it is for people with a technical background (not necessarily computer science) to get such technology running.


  • [September 22, 2015]
    Efstathios Stamatatos. Authorship attribution. [slides]

  • [September 22, 2015]
    Robert Paßmann. Delta - A measure of stylistic difference. [slides]
    Florian Friedrich. N-gram-based author profiles for authorship attribution. [slides]
    Tolga Buz. Determining if two documents are written by the same author. [slides]

  • [September 23, 2015]
    Fabian Müller. Authorship attribution in the wild. [slides]
    Sebastian Wilhelm. Unmasking pseudonymous authors. [slides]
    Marvin Gülzow. Syntactic n-grams as machine learning features for natural language processing. [slides]

  • [September 24, 2015]
    Winfried Lötzsch. Augmenting naive Bayes classifiers with statistical language models. [slides]
    Jakob Köhler. Using compression-based language models for text categorization. [slides]
    Thomas Rometsch. Authorship attribution with author-aware topic models. [slides]

  • [September 25, 2015]
    Michaelt Träger. Local histograms of character n-grams for authorship attribution. [slides]
    Fabian Duffhauß. Mining e-mail content for author identification forensics. [slides]
    Bernhard Reinke. Language trees and zipping. [slides]

  • [September 28, 2015]
    Timo Sommer. Feature set subspaceing. [slides]
    Lucas Rettenmeier. A repetition-based measure for verifi cation of text collections and for text categorization [slides]
    Maike Müller. Stopword graphs and authorship attribution in text corpora. [slides]