Big Data Seminar

General Information

Lecturer: Prof. Dr. Benno Stein
Advisor: Michael Völske
Workload: 2 ECTS
Open to: M.Sc. (CS4DM, CSM, MI, HCI, DE)
Venue: Bauhausstraße 11 / Seminar room 013
Kick-off meeting: 15/04/2019 at 11.00. in Bauhausstraße 11/013
Regular sessions: Mondays at 11.00, Seminar room Bauhausstraße 11/013

Description

The ever-increasing flood of digital information poses new challenges to data mining and machine learning practitioners. Data sets of interest routinely reach scales that call for distributed processing architectures. There is a great variety of problems to be solved in the areas of data mining, processing, and storage, and a vast landscape of software projects has arisen to address these problems. In this seminar, participants will get to know a selection of big data tools, and will gain hands-on experience in deploying, administering and using distributed systems.

In order to receive a grade, seminar participants should:

  • Give a seminar presentation (up to ~30min)
  • Provide a demo implementation and examples
  • Actively participate in discussions on other participants' talks

Talk topics and task details will be provided in one of the early seminar sessions.

Seminar Sessions

[2019-04-15] Kick-off meeting / crash course [slides].
Before the first meeting, please install everything mentioned under `Software` below on your laptop. You should bring your laptop to the seminar in order to participate in the tutorial sessions.

[2019-04-29] Crash course (cont'd) / Hadoop Introduction [slides]. [hadoop-installation-script]

[2019-05-06] Crash course loose ends & Seminar Talk Topics [slides].

[2019-06-17] Seminar Talks I.
Sagar Nagaraj Simha. Apache Spark [slides].
Joel Chukwuebuka Arukwe. Apache Kafka [slides].

[2019-06-24] Seminar Talks II.
Carsten Tetens. Apache Flink [slides].
Le Anh Phuong. Apache Beam [slides].

[2019-07-01] Seminar Talks III. (cancelled)
Konka Yamini. Apache Zookeeper [slides].
Lukas Trautner. Apache Cassandra [slides].

Software

In order to participate in the seminar, please install the following software:

Oracle VirtualBox [download]
Download and install the VirtualBox platform package for your operating system (Windows/Mac/Linux).
[Note to Windows users (Win 8 and later): you need to disable the Hyper-V feature, otherwise VirtualBox will not work.]

Vagrant by HashiCorp [download]
Download and install the Vagrant VM management software for your operating system (Windows/Mac/Linux).

Git [download]
Download and install the Git source control software for your operating system (Windows/Mac/Linux).

FoxyProxy [firefox] [chrome]
Install the FoxyProxy Browser extension in your preferred web browser.

In preparation of the first seminar session, install all of the above, then open a terminal (Windows users: run the app called "Git Bash"), and type or paste the following:
  vagrant box add --provider virtualbox bento/ubuntu-18.04
Afterward, press Enter.

In addition to the software mentioned above, you should have a decent source code editor installed that you know how to use. If you don't have a preference of your own, we recommend Visual Studio Code (available for all platforms).

Literature

William E. Shotts. The Linux Command Line: A Complete Introduction. 2nd ed. No Starch Press, Incorporated, 2019. http://linuxcommand.org/tlcl.php.

Matotek, Turnbull, Lieverdink. Pro Linux System Administration. Apress, 2017. 

Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press, 2014. http://infolab.stanford.edu/~ullman/mmds/book.pdf

Tom White. Hadoop: The Definitive Guide, 4th ed. O'Reilly Media, 2015. ISBN: 9781491901687.

Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/