Big Data and Language Technogies

General Information

Lecturer: Prof. Dr. Benno Stein
Advisor: Michael Völske, Janek Bevendorff
Workload: 6 ECTS
Open to: M.Sc. (CS4DM, CSM, MI, HCI, DE)
Venue: SR3.09, S143
Kick-off meeting: Monday, April 4th, 2022.
Regular sessions: Monday 13:30
Moodle: moodle.uni-weimar.de/course/view.php

Description

Information on the web is growing at an exponential pace, courtesy of social media platforms, blogs, and news. Such large scale data sources call for high-end, scalable, distributed architectures for cognitive analysis, which shape the business decisions of many industries. In addition, deep learning has been propelled into mainstream and is now accessible to researchers and companies alike, thanks to tools such as TensorFlow, PyTorch. The Webis research group operates large-scale high-performance compute infrastructure (totaling more than 3000 CPU cores, 10+ Petabytes of storage, and 24 high-end GPUs), which will be put to use in the course of this seminar. Students will receive application-oriented training in Big data and deep learning frameworks, language technologies, and explore interesting research questions. This seminar requires good skills in both programming (Python) and algorithms.

Course Prerequisites

Please note that this course requires prior Python programming experience. In addition, some familiarity with Linux environments, and knowledge of machine learning basics, is highly recommended. To help you gauge your prior subject knowledge, we've provided a set of self-assessment questions below. Read through the self-assessment questions, and take note of how many you can answer in the affirmative, and how many answers you know without having to look them up.

Self-assessment questionnaire

This questionnaire is not perfectly suitable for studying in order to catch up; however, the questions should cover a broad range of topics around our course's scope and highlight potential weak points.

Python

  • Have you worked with Python 3 before?
  • Do you know how to run Python scripts?
  • Did you install pip packages before?
  • Did you use Jupyter Notebooks before?
  • Do you know how to assign variables?
  • How to use for and while loops?
  • How to define functions with default arguments?
  • How to use *args, **kwargs and variable unpacking?
  • When does call by object reference happen and why might it lead to unexpected problems?
  • How is a class defined and instantiated?
  • How does inheritance work? Why use super()?
  • How is a generator defined and iterated?
  • What is a list comprehension?
  • How can the entries of a list be doubled using a list comprehension?
  • How to get only even entries of a list using a list comprehension?
  • How to get the Collatz successor of each number in a list using a list comprehension and a conditional expression?
  • How is a dictionary defined? How can you get a value?
  • What ways can a dict be iterated?
  • What are dictionary comprehensions?
  • What are lambda functions? Why are they used?
  • What is map()?
  • How to use the key argument in sorted()?
  • Have you worked with numpy before?
  • What are ndarrays? Why use them?
  • What are shape and dtype? How can both be altered?
  • What is the difference between reshape and transpose? When can you safely use reshape?
  • How to matrix multiply two ndarrays?
  • What is the axis argument in ndarray.sum()?
  • How to read all lines from a file?
  • What is JSON? How to read/write JSON data in Python?

Linux Command line/Remote work

  • How to compose shell commands with pipes and input/output redirection?
  • How to pass each line from an input file to the same command as an argument, and run all resulting processes in parallel?
  • How to find out how many lines in a text file contain the strings "cat" or "hat"?
  • How to download and then unpack a zip file from the command line?
  • How to log into a remote machine via SSH?
  • What is an SSH public key and how to create one?
  • How to make sure a program continues to run after you log out?
  • How to transfer a file to a remote machine using only an SSH connection?

Machine Learning Basics

  • What is the difference between supervised and unsupervised learning?
  • What is the difference between regression and classification?
  • What is gradient descent, and how does it work?
  • What are precision, recall and accuracy, and how are they computed?
  • What is overfitting, why is it a problem, and how to detect and avoid it?

Seminar Deliverables

In order to successfully complete this course, you will have to

  • Actively participate in seminar sessions
  • Complete a half-semester long group research project (topics to be assigned a few weeks into the seminar)
  • After selecting your project topic, submit a short exposé descibing your goals and work plan (1-2 pages)
  • Give a group presentation discussing your progress (30 minutes + discussion)
  • At the end of the semester, submit a research report discussing your approach and results (at least 4 pages)

 

 

Lecturenotes

  • Big Data and Language Technologies » Introduction » Organization, Literature [slides] [video (LE)] [video (WE)]
  • Big Data and Language Technologies » Introduction » Introduction [slides] [video (LE)] [video (WE)]
  • Big Data and Language Technologies » Machine Learning Basics » Regression [slides] [video]
  • Big Data and Language Technologies » Machine Learning Basics » Gradient Descent [slides] [video]
  • Big Data and Language Technologies » Machine Learning Basics » Recurrent Neural Networks [slides] [video]

Seminar Schedule

Date Title Description Materials Deliverables Stream
04.04.2022 Deep Learning in Python (Session 1)
  • Introduction & Lab Setup
  • Keras Basics
11.04.2022 Deep Learning in Python (Session 2)
  • Tensorflow Datasets
  • Custom loss functions
  • Custom training loops
  • TensorBoard
18.04.2022 No Session (Easter Monday)
25.04.2022 Deep Learning in Python (Session 3)
  • Huggingface & pretrained models
[lab]
02.05.2022 Deep Learning on SLURM (Session 1)
  • 20NG Dataset
  • Local model development
  • SLURM basics
09.05.2022 Deep Learning on SLURM (Session 2)
  • SLURM deployment
  • Parameter sweeping
Set up Cluster Access
16.05.2022 Project Fair
  • Introduction of available project topics
23.05.2022 Prompt Engineering (Session 1)
  • Prompt Engineering
  • OpenAI GPT-3 playground
  • Google BIG-Bench
30.05.2022 Prompt Engineering (Session 2)
  • Self-Deployed GPT-2 on SLURM
  • OpenAI GPT-3 API
06.06.2022 No Session (Whit Monday)
13.06.2022 Prompt Engineering (Session 3)
  • Prompt Engineering Projects
Prompt Engineering Presentations
20.06.2022 Q&A Session Project Exposé
27.06.2022 Group Meetings
  • Individual meetings with groups
04.07.2022 Mid-Term Presentations
  • Group project presentations
Project Presentation
11.07.2022 Q&A Session
29.08.2022 Project Deadline Hand in your report in PDF format by eMail. Cutoff is 22:00 CEST Project Report

Literature

William E. Shotts. The Linux Command Line: A Complete Introduction. 2nd ed. No Starch Press, Incorporated, 2019. http://linuxcommand.org/tlcl.php.

Matotek, Turnbull, Lieverdink. Pro Linux System Administration. Apress, 2017. 

Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press, 2014. http://infolab.stanford.edu/~ullman/mmds/book.pdf

Tom White. Hadoop: The Definitive Guide, 4th ed. O'Reilly Media, 2015. ISBN: 9781491901687.

Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/