Big Data and Language Technogies
General Information
Lecturer: | Prof. Dr. Benno Stein |
Advisor: | Michael Völske, Janek Bevendorff |
Workload: | 6 ECTS |
Open to: | M.Sc. (CS4DM, CSM, MI, HCI, DE) |
Venue: | SR3.09, S143 |
Kick-off meeting: | Monday, April 4th, 2022. |
Regular sessions: | Monday 13:30 |
Moodle: | moodle.uni-weimar.de/course/view.php |
Description
Information on the web is growing at an exponential pace, courtesy of social media platforms, blogs, and news. Such large scale data sources call for high-end, scalable, distributed architectures for cognitive analysis, which shape the business decisions of many industries. In addition, deep learning has been propelled into mainstream and is now accessible to researchers and companies alike, thanks to tools such as TensorFlow, PyTorch. The Webis research group operates large-scale high-performance compute infrastructure (totaling more than 3000 CPU cores, 10+ Petabytes of storage, and 24 high-end GPUs), which will be put to use in the course of this seminar. Students will receive application-oriented training in Big data and deep learning frameworks, language technologies, and explore interesting research questions. This seminar requires good skills in both programming (Python) and algorithms.
Course Prerequisites
Please note that this course requires prior Python programming experience. In addition, some familiarity with Linux environments, and knowledge of machine learning basics, is highly recommended. To help you gauge your prior subject knowledge, we've provided a set of self-assessment questions below. Read through the self-assessment questions, and take note of how many you can answer in the affirmative, and how many answers you know without having to look them up.
Self-assessment questionnaire
This questionnaire is not perfectly suitable for studying in order to catch up; however, the questions should cover a broad range of topics around our course's scope and highlight potential weak points.
Python
- Have you worked with Python 3 before?
- Do you know how to run Python scripts?
- Did you install pip packages before?
- Did you use Jupyter Notebooks before?
- Do you know how to assign variables?
- How to use for and while loops?
- How to define functions with default arguments?
- How to use *args, **kwargs and variable unpacking?
- When does call by object reference happen and why might it lead to unexpected problems?
- How is a class defined and instantiated?
- How does inheritance work? Why use super()?
- How is a generator defined and iterated?
- What is a list comprehension?
- How can the entries of a list be doubled using a list comprehension?
- How to get only even entries of a list using a list comprehension?
- How to get the Collatz successor of each number in a list using a list comprehension and a conditional expression?
- How is a dictionary defined? How can you get a value?
- What ways can a dict be iterated?
- What are dictionary comprehensions?
- What are lambda functions? Why are they used?
- What is map()?
- How to use the key argument in sorted()?
- Have you worked with numpy before?
- What are ndarrays? Why use them?
- What are shape and dtype? How can both be altered?
- What is the difference between reshape and transpose? When can you safely use reshape?
- How to matrix multiply two ndarrays?
- What is the axis argument in ndarray.sum()?
- How to read all lines from a file?
- What is JSON? How to read/write JSON data in Python?
Linux Command line/Remote work
- How to compose shell commands with pipes and input/output redirection?
- How to pass each line from an input file to the same command as an argument, and run all resulting processes in parallel?
- How to find out how many lines in a text file contain the strings "cat" or "hat"?
- How to download and then unpack a zip file from the command line?
- How to log into a remote machine via SSH?
- What is an SSH public key and how to create one?
- How to make sure a program continues to run after you log out?
- How to transfer a file to a remote machine using only an SSH connection?
Machine Learning Basics
- What is the difference between supervised and unsupervised learning?
- What is the difference between regression and classification?
- What is gradient descent, and how does it work?
- What are precision, recall and accuracy, and how are they computed?
- What is overfitting, why is it a problem, and how to detect and avoid it?
Seminar Deliverables
In order to successfully complete this course, you will have to
- Actively participate in seminar sessions
- Complete a half-semester long group research project (topics to be assigned a few weeks into the seminar)
- After selecting your project topic, submit a short exposé descibing your goals and work plan (1-2 pages)
- Give a group presentation discussing your progress (30 minutes + discussion)
- At the end of the semester, submit a research report discussing your approach and results (at least 4 pages)
Lecturenotes
- Big Data and Language Technologies » Introduction » Organization, Literature [slides] [video (LE)] [video (WE)]
- Big Data and Language Technologies » Introduction » Introduction [slides] [video (LE)] [video (WE)]
- Big Data and Language Technologies » Machine Learning Basics » Regression [slides] [video]
- Big Data and Language Technologies » Machine Learning Basics » Gradient Descent [slides] [video]
- Big Data and Language Technologies » Machine Learning Basics » Recurrent Neural Networks [slides] [video]
- Big Data and Language Technologies » Deep Learning » RNNs for Machine Translation [slides] [video]
Seminar Schedule
Date | Title | Description | Materials | Deliverables | Stream |
---|---|---|---|---|---|
04.04.2022 | Deep Learning in Python (Session 1) |
|
|||
11.04.2022 | Deep Learning in Python (Session 2) |
|
|
||
18.04.2022 | No Session (Easter Monday) | ||||
25.04.2022 | Deep Learning in Python (Session 3) |
|
|
[lab] | |
02.05.2022 | Deep Learning on SLURM (Session 1) |
|
|
||
09.05.2022 | Deep Learning on SLURM (Session 2) |
|
Set up Cluster Access | ||
16.05.2022 | Project Fair |
|
|
|
|
23.05.2022 | Prompt Engineering (Session 1) |
|
|||
30.05.2022 | Prompt Engineering (Session 2) |
|
|||
06.06.2022 | No Session (Whit Monday) | ||||
13.06.2022 | Prompt Engineering (Session 3) |
|
|
Prompt Engineering Presentations | |
20.06.2022 | Q&A Session |
|
Project Exposé | ||
27.06.2022 | Group Meetings |
|
|||
04.07.2022 | Mid-Term Presentations |
|
Project Presentation | ||
11.07.2022 | Q&A Session |
|
|||
29.08.2022 | Project Deadline | Hand in your report in PDF format by eMail. Cutoff is 22:00 CEST | Project Report |
Literature
William E. Shotts. The Linux Command Line: A Complete Introduction. 2nd ed. No Starch Press, Incorporated, 2019. http://linuxcommand.org/tlcl.php.
Matotek, Turnbull, Lieverdink. Pro Linux System Administration. Apress, 2017.
Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press, 2014. http://infolab.stanford.edu/~ullman/mmds/book.pdf
Tom White. Hadoop: The Definitive Guide, 4th ed. O'Reilly Media, 2015. ISBN: 9781491901687.
Manning, Raghavan, Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/