Big Data Seminar

General Information

Lecturer:

Prof. Dr. Benno Stein

Advisor:

Michael Völske

Workload:

3 ECTS

Kick-off meeting:

April 6th, 17:00, room SR015, B11

Regular sessions:

Thursdays, 17:00, room SR014, B11

Description

The ever-increasing flood of digital information poses new challenges to data mining and machine learning practitioners. Data sets of interest routinely reach scales that call for distributed processing architectures. In this seminar, participants will acquaint themselves with a selection of data processing tools based on the Apache Hadoop platform. In a practical part, seminar participants will work on relevant data mining problems. The Webis research group operates a large, modern high-performance compute cluster (about 1600 CPU cores, 2.5 Petabytes of disk space), which will be put to use in the course of this seminar. Students will receive training in the fundamentals of hardware and software architectures of big data cluster technologies, and learn the skills necessary to apply them. Thanks to the size of the cluster and the Webis group's expertise with big data technologies, this seminar shall provide a level of training that is currently exceptional in an academic context.

Seminar Sessions

  • [2016-04-06] Session 1
    Kick-off meeting. [slides]

  • [2016-04-21] Session 2
    Short Talk Topics. [slides]
    Hadoop Tutorial (1). [slides]
    Configuration Files Summary. [txt]

  • [2016-04-28] Session 3
    Hadoop Tutorial (2).
    Short Talk Consultation.

  • [2016-05-19] Session 4
    Long talk topics. [slides]
    Short talks (1).

  • [2016-05-26] Session 5
    Short talks (2).

  • [2016-06-02] Session 6
    Short talks (3).
    Long talk discussion.

  • [2016-06-09] Session 7
    Short talks (4).
    Long talk discussion.

  • [2016-07-07] Session 8
    Seminar Talks.

 

Seminar Paper Final Submissions

The deadline for seminar paper submissions is August 18th, 2016, 12:00pm CEST. Submissions should consist of a single ZIP file with the following contents:

  • The seminar paper in PDF format, 4-8 pages including figures and references.
  • All relevant code written to produce the analyses and figures in the paper.
  • A text file summarizing the layout and contents of the code folder, including instructions to compile (where applicable) and run the code.

All submissions must be handed in via email to michael.voelske[at]uni-weimar.de. The file name of the attached zip file should include the names and matriculation numbers of all group members.

Seminar Talks

Short Talks

  • Tatiana Zarta. Apache Spark [slides]
  • Milad Alshomary. Spark MLLib [slides]
  • Ekaterina Fuchkina. Apache Flink [slides]
  • Patrick Saad. h2o [slides]
  • Shahbaz Syed. Apache Mahout [slides]
  • Robert Scholz. Deeplearning4j [slides]
  • Viorel Morari. Petuum [slides]

Seminar Talks 

  • Ekaterina Fuchkina, Robert Scholz. On-the-fly Indexing [slides]
  • Viorel Morari, Shahbaz Syed. Text-Summarization From Social Media Data [slides]
  • Milad Alshomary, Patrick Saad. Citation Network Analysis [slides]

Software

  • Oracle VirtualBox. [download]
    Download and install the VirtualBox platform package for your operating system (Windows/Mac/Linux).

  • BigData Seminar Virtual Machine Appliance. [download]
    Download the appliance file BigData.ova (3.8GiB download) and import it into VirtualBox ("File" -> "Import Appliance")

  • PuTTY SSH Client. [download]
    Install this if you're running Windows.

  • Apache Hadoop 2.7.2 binary package. [download]
    (will be installed during the tutorial session)

Literature

Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press, 2014. http://infolab.stanford.edu/~ullman/mmds/book.pdf

Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/