Big Data Seminar

General Information

Lecturer: Prof. Dr. Benno Stein
Advisor: Michael Völske
Workload: 2 ECTS
Open to: M.Sc. (CS4DM, CSM, MI, HCI, DE)
Venue: Bauhausstraße 11 / Seminar room 013
Kick-off meeting: 15/04/2019 at 11.00. in Bauhausstraße 11/013
Regular sessions: Mondays at 11.00, Seminar room Bauhausstraße 11/013


The ever-increasing flood of digital information poses new challenges to data mining and machine learning practitioners. Data sets of interest routinely reach scales that call for distributed processing architectures. There is a great variety of problems to be solved in the areas of data mining, processing, and storage, and a vast landscape of software projects has arisen to address these problems. In this seminar, participants will get to know a selection of big data tools, and will gain hands-on experience in deploying, administering and using distributed systems.

In order to receive a grade, seminar participants should:

  • Give a seminar presentation (up to ~30min)
  • Provide a demo implementation and examples
  • Actively participate in discussions on other participants' talks

Talk topics and task details will be provided in one of the early seminar sessions.

Seminar Sessions

[2019-04-15] Kick-off meeting / crash course [slides].
Before the first meeting, please install everything mentioned under `Software` below on your laptop. You should bring your laptop to the seminar in order to participate in the tutorial sessions.

[2019-04-29] Crash course (cont'd) / Hadoop Introduction [slides]. [hadoop-installation-script]

[2019-05-06] Crash course loose ends & Seminar Talk Topics [slides].

[2019-06-17] Seminar Talks I.
Sagar Nagaraj Simha. Apache Spark [slides].
Joel Chukwuebuka Arukwe. Apache Kafka [slides].

[2019-06-24] Seminar Talks II.
Carsten Tetens. Apache Flink [slides].
Le Anh Phuong. Apache Beam [slides].

[2019-07-01] Seminar Talks III. (cancelled)
Konka Yamini. Apache Zookeeper [slides].
Lukas Trautner. Apache Cassandra [slides].


In order to participate in the seminar, please install the following software:

Oracle VirtualBox [download]
Download and install the VirtualBox platform package for your operating system (Windows/Mac/Linux).
[Note to Windows users (Win 8 and later): you need to disable the Hyper-V feature, otherwise VirtualBox will not work.]

Vagrant by HashiCorp [download]
Download and install the Vagrant VM management software for your operating system (Windows/Mac/Linux).

Git [download]
Download and install the Git source control software for your operating system (Windows/Mac/Linux).

FoxyProxy [firefox] [chrome]
Install the FoxyProxy Browser extension in your preferred web browser.

In preparation of the first seminar session, install all of the above, then open a terminal (Windows users: run the app called "Git Bash"), and type or paste the following:
  vagrant box add --provider virtualbox bento/ubuntu-18.04
Afterward, press Enter.

In addition to the software mentioned above, you should have a decent source code editor installed that you know how to use. If you don't have a preference of your own, we recommend Visual Studio Code (available for all platforms).


William E. Shotts. The Linux Command Line: A Complete Introduction. 2nd ed. No Starch Press, Incorporated, 2019.

Matotek, Turnbull, Lieverdink. Pro Linux System Administration. Apress, 2017. 

Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press, 2014.

Tom White. Hadoop: The Definitive Guide, 4th ed. O'Reilly Media, 2015. ISBN: 9781491901687.

Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.