Corpora



This page organizes the corpora that are used by the Webis research group. Their availability for external use is as follows: (1) corpora that have been officially released by our group can be downloaded here, (2) internal Webis corpora (will be officially released in the future) are supplied upon request, (3) affiliated corpora made available by courtesy of our research partners can be downloaded here, (4) other corpora must be obtained from the original publisher/creator. We are collecting usage statistics and other meta information for all corpora listed below; clicking a corpus name will take you to the respective page. A note for corpus developers: if you are interested in getting your corpus listed here drop us a mail.



Released Webis Corpora
Name Publisher/Creator Year Size Default Task
ArguAna TripAdvisor Webis group & FG Engels 2014 2,100 hotel reviews Sentiment Analysis
LFA-11 Webis Group & FG Engels 2011 4.9 MB (compressed) Genre and Sentiment Analysis
PAN-PC-09 Webis Group 2009 41,000 documents Plagiarism Detection
PAN-PC-10 Webis Group 2010 27,000 documents Plagiarism Detection
PAN-PC-11 Webis Group 2011 27,000 documents Plagiarism Detection
PAN-WQF-12 Webis Group 2012 1,592,226 documents Quality Flaw Prediction in Wikipedia
PAN-WVC-10 Webis Group 2010 32,000 documents Vandalism Detection
PAN-WVC-11 Webis Group 2011 24,000 documents Vandalism Detection
WDVC-15 FG Engels & Webis Group 2015 24 million revisions Vandalism Detection
Webis-Ambient-15 Webis Group 2015 5,592 documents Clustering/Cluster Labeling
Webis-ArgRank-17 Webis Group 2017 17,877 arguments Computational Argumentation
Webis-CBC-16 Webis Group 2016 2,992 tweets Clickbait Detection
Webis-CLS-10 Webis Group 2010 800,000 documents Cross Language Text Classification
Webis-CPC-11 Webis Group 2011 7,859 paraphrases Plagiarism Detection
Webis-Editorials-16 Webis Group 2016 300 documents Computational Argumentation
Webis-Query-Log-12 Webis Group 2012 150 search logs Exploratory Search
Webis-TRC-12 Webis Group 2012 150 interaction logs Text Reuse Detection, Paraphrasing, and Exploratory Search
Genre-KI-04 Webis Group 2004 1,239 documents Web Genre Analysis
Webis-KIQC-13 Webis Group 2013 2,755 questions Known-Item Search
Webis-Mnemonics-17 Webis Group 2017 1048 Mnemonics Password analysis
Webis-ODP-10 Webis Group 2010 5 million documents Clustering/Cluster Labeling
Webis-PRA-12 Webis Group 2012 14,189 company names Spelling Error Detection
Webis-PC-08 Webis Group 2008 298 MB (compressed) Plagiarism Detection
Webis-QSeC-10 Webis Group 2010 1.9 MB (compressed) Query Segmentation
Webis-Sentences-17 Webis Group 2017 3,4 billion sentences Text statistics
Webis-SMC-12 Webis Group 2012 123 KB (compressed) Search Mission Detection
Webis-Revenue-10 FG Engels & Webis Group 2010 1,000 documents Entity and Relation Extraction
Webis-SDMbridge-12 Webis Group 2012 14,641 models Simulation Data Mining
Webis-WVC-07 Webis Group 2007 1,000 documents Vandalism Detection
Webis-Tripad-13-Sentiment Webis Group 2013 2,100 hotel reviews Sentiment Analysis
Webis-Tripad-14 Webis Group 2014 266,061 hotel reviews Sentiment Analysis and Author Profiling
Webis-Debate-16 Webis Group 2016 26,689 text segments Computational Argumentation
Internal Webis Corpora
Name Publisher/Creator Year Size Default Task
Arxiv Webis Group - 550 documents -
Bauphysik Webis Group 2010 70 MB (compressed) Vertical Search
ODP Cluster Labeling Webis Group 2010 6,400 documents Cluster Labeling
Converter Testfiles Webis Group - 1.6 GB -
Wikipedia Editwars Webis Group 2008 919 MB (compressed) Editwar Detection
Genre Corpus (2008) Webis Group 2008 1,600 documents Web Genre Analysis
German Newsgroups Webis Group - 27,000 documents Cluster Analysis
Google News Crawl Webis Group - 35,000 documents -
Gutenberg Wordcount Webis Group - 3.5 MB -
Netspeak Dictionary Webis Group - 3.3 GB -
Slashdot Webis Group - 3.1 GB (compressed) -
TLDP Crawl Webis Group - 15,000 documents
Twitter Movie Sentiments Webis Group 2010 1.3 GB (compressed) Sentiment Analysis
Webdiversity Webis Group - 225 MB -
Youtube Comments Webis Group - 324,000 documents -
Affiliated Corpora
Name Publisher/Creator Year Size Default Task
Dagstuhl-15512 ArgQuality Corpus Dagstuhl-15512 Quality breakout group 2017 304 arguments Computational argumentation
Burrows Authorship Corpora Steven Burrows, RMIT University 2010 8 MB (compressed) Source Code Authorship Attribution
Paderborn Genre Analysis Corpus 2012 Baumann, Lettmann, Stein 2012 19.7 MB (compressed) Web Genre Analysis
Other Corpora
Name Publisher/Creator Year Size Default Task
20 Newsgroups Carnegie Mellon University 1999 20,000 documents Text Classification, Text Clustering
7Sectors-WebKB CMU World Wide Knowledge Base 2001 4,477 documents Text Classification, Text Clustering
A Corpus of Plagiarised Short Answers University of Sheffield 2009 80 KB (compressed) Plagiarism Detection
Annotated Customer Reviews Simon Fraser University Burnaby 2004 870 KB Sentiment Analysis
AOL Query Log AOL 2006 1.5 GB (compressed) Query Log Analysis
Argument Annotated Essays, v1 TU Darmstadt 2014 90 persuasive essays Computational Argumentation
Argument Annotated Essays, v2 TU Darmstadt 2016 402 persuasive essays Computational Argumentation
Araucaria Argumentation Corpus University of Dundee 2014 664 examples Computational Argumentation
Arguing Subjectivity Corpus University of Pittsburgh 2012 84 documents Computational Argumentation
Bergsma-Wang-Corpus 2007 S. Bergsma and Q. I. Wang 2007 2.4 MB Web Search Analysis
BLOGS06 test collection University of Glasgow 2006 4 million documents Link Analysis
BNC Writing Errors J. Wagner et al. 2007 274 MB (compressed) Writing Error Detection
British National Corpus (XML) BNC Consortium 2007 5.1 GB Text Analysis (English)
Brown Corpus Brown University 2011 500 documents Text Analysis (English)
CEEAUS 2010 Beta Edition Kobe University 2010 1,800 documents Cross-Language Analysis
CLEANEVAL 2007 University of Trento and University of Leeds 2007 1333 documents Main Content Extraction
CLEF-IP 2009 Information Retrieval Facility Society (IRF) 2009 1.9 million documents Patent Retrieval
CLEF-IP 2010 Information Retrieval Facility Society (IRF) 2010 2.6 million documents Patent Retrieval
ClueWeb09 Carnegie Mellon University 2009 109 documents Web Mining
ClueWeb12 Carnegie Mellon University 2012 109 documents Web Mining
CoNLL-2003 University of Antwerpen 2003 11.5 MB Named Entity Recognition
CoPhIR Consiglio Nazionale delle Ricerche (ISTI-CNR) 2003 106 million images Image Retrieval
DBLP University of Massachusetts Amherst 2006 910 MB Network Analysis
Dbpedia 3.5.1 DBpedia 2010 8.3 GB (compressed) Data Mining
DMOZ Open Directory Project 2010 11 GB (compressed) Clustering and Clusterlabeling and Data Mining
ECML PKDD Discovery Challenge 2008 ECML 2008 304 MB (compressed) Collaborative Filtering and Spam Detection
ESL 123 Mass Noun Examples Microsoft Corporation 2006 123 sentences Cross-Language Analysis
Essay Argument Strength UT Dallas 2015 1000 scores Essay scoring
Essay Organization UT Dallas 2010 1003 scores Essay scoring
Essay Prompt Adherence UT Dallas 2014 830 scores Essay scoring
Essay Thesis Clarity UT Dallas 2013 830 scores Essay scoring
Finegrained Sentiment Uppsala University 2011 294 Amazon reviews Sentiment Analysis
European Corpus Initiative Multilingual Corpus I European Corpus Initiative 1994 824 MB Text Analysis (Multilingual)
Europarl (v1 & v3) University of Edinburgh 2007 2.6 GB (compressed) Machine Translation
Falko Essaykorpus L2 V2.0 Institut für deutsche Sprache und Linguistik 2005 248 documents Interlanguage Analysis
German General Inquirer Dictionary Harvard University 1966 240 KB Sentiment Analysis (German Wordlist)
Google Books N-Gram 20090715 Google 2009 898 GB (compressed) Data Mining
Google Web 1T 5-gram Version 1 Google 2006 55 GB Text Analysis (English)
IBM Context-dependent Argumentation, ACL-14 IBM 2014 2,683 argument elements Computational Argumentation
IBM Context-dependent Argumentation, EMNLP-15 IBM 2015 6,984 argument elements Computational Argumentation
IBM Term-relatedness IBM 2015 9,856 term pairs Text Analysis (English)
ICWSM 2009 Data Challenge ICWSM 2009 37 GB (compressed) Network Analysis
imat2009 dataset Yandex 2009 650 MB Machine-learned Ranking
International Corpus of Learner English v2 Center for English Corpus Linguistics 2009 6,100 documents Language Analysis
The JRC-Acquis Multilingual Parallel Corpus (3.0) European Commission's Office for Official Publications (OPOCE) 2009 2.3 GB (compressed) Cross-Language Research
Koppel Authorship Corpus M. Koppel and J. Schler 2004 3.9 MB (compressed) Authorship Verification
Learning To Rank 3.0 Microsoft 2008 8.0 GB Machine-learned Ranking
Lee 50 Documents M. D. Lee et al. 2005 130 KB Text Similarity Analysis
METER Corpus Department of Journalism and Department of Computer Science at Sheffield University 2002 9.6 MB (compressed) Text Reuse
MIR Flickr 2008 LIACS Medialab at Leiden University, Netherlands 2008 25,000 documents Image Retrieval
Multi Domain Sentiment Dataset (Processed ACL) John Hopkins University 2007 29 MB Sentiment Analysis
Montclair Electronic Language Database Montclair State University 2001 33 documents Cross-Language Analysis
Movielens University of Minnesota 74 MB (compressed) Collaborative Filtering
Movie Review Data Cornell University 2004-2005 219 MB Sentiment Analysis
Netflix Challenge (Partial) Netflix 2006 1.6 GB (compressed) Collaborative Filtering
New York Times Corpus New York Times 2008 1.8 million articles Text Mining
ODP239 C. Carpineto and G. Romano 2009 4.8 MB Subtopic Information Retrieval
OHSUMED Test Collection Oregon Health & Science University 1994 461 MB Text Clustering
OPUS (Europarl3_0.2b and EMEA0.3) Jörg Tiedemann 2009 9.0 GB Machine Translation
Reason Identification and Classification Dataset Kazi Saidul Hasan and Vincent Ng 2014 4.3 MB (compressed) Computational Argumentation
Reuters 21578 (22173) Reuters, David D. Lewis 1996 21578 articles Text Clustering
Reuters RCV1 Reuters, David D. Lewis 2000 1.0 GB (compressed) Text Clustering
Reuters RCV1 - CCAT split Reuters, David D. Lewis 2002 1.6 GB Machine Learning
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection National Research Council of Canada 2009 166 MB (compressed) Crosslingual Categorization
Request For Comments Collections (to 4501) RFC Editor 2008 4,380 documents Data Mining
Rovereto Twitter N-Gram Corpus University of Trento, Italy 2011 75 million tweets Social Network Analysis
SILS Learner Corpus of English Waseda University 2007 16 MB (compressed) Cross-Language Analysis
SMS Spam Collection v.1 T. A. Almeida and J. M. G. Hidalgo 2011 210 KB Spam Identification
TIPSTER Complete Advanced Research Projects Agency 1993 1.2 MB (compressed) Information Retrieval
TREC vol4 National Institute of Standards and Technology (NIST) 1996 436 MB Data Mining
TREC vol5 National Institute of Standards and Technology (NIST) 1997 389 MB Data Mining
TREC web National Institute of Standards and Technology (NIST) 1999-2004 90 GB Data Mining
Tswana Learner English Corpus Center for Text Technology 2006 1.6 MB (compressed) Cross-Language Analysis
Twitter tweets Yang and Leskovec 2011 467 million tweets Social Network Analysis
UKPConvArg1 TU Darmstadt 2016 16,081 argument pairs Computational Argumentation
UKPConvArg2 TU Darmstadt 2016 9,111 argument pairs Computational Argumentation
USPTO Patents from 2001 to 2010 U.S. Patent & Trademark Office 2010 10 TB (uncompressed) Patent Analysis
Uppsala Student English Uppsala University 2001 1,500 documents Cross-Language Analysis
WaCKy: deWaC Web-As-Corpus Kool Yinitiative 2009 1.7 billion words Text Analysis (German)
WaCKy: frWaC Web-As-Corpus Kool Yinitiative 2009 1.6 billion words Text Analysis (French)
WaCKy: itWaC Web-As-Corpus Kool Yinitiative 2009 2 billion words Text Analysis (Italian)
WaCKy: sdeWaC Web-As-Corpus Kool Yinitiative 2009 0.9 billion words Text Analysis (German)
WaCKy: ukWaC Web-As-Corpus Kool Yinitiative 2009 2 billion words Text Analysis (English)
WaCKy: WaCkypedia_EN Web-As-Corpus Kool Yinitiative 2009 0.8 billion words Text Analysis (English)
Web People Search Corpus (WePS-1) NLP Group (UNED), Proteus Project (NYU) 2007 2,000 web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-2) NLP Group (UNED), Proteus Project (NYU) 2009 3,000 web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-3) NLP Group (UNED), Proteus Project (NYU) 2010 50,000 web pages Person Disambiguation, Text Clustering
Wikipedia Revision Dump Wikimedia Foundation 2006 46 GB (compressed) Data Mining
Wikipedia Revision Dump Wikimedia Foundation 2008 133 GB (compressed) Data Mining
Wikipedia Full Dump Wikimedia Foundation 2011 more than 5 TB (uncompressed) Data Mining
Wikipedia History Snapshots Wikimedia Foundation 2006-2012 32 GB (compressed) Data Mining
Wikipedia Snapshots Wikimedia Foundation 2006-2012 280 GB (compressed) Data Mining
Wikipedia Participation Challenge Wikimedia Foundation 2011 976 MB (compressed) User Behaviour Prediction
Wordsim353 L. Finkelstein et al. 2002 60 KB Word Similarities
Wortschatz Leipzig Universität Leipzig 2006 7.6 GB Text Analysis (Multilingual)
Yahoo N-Grams Yahoo 2006 13 GB (compressed) Text Analysis (English)
Yahoo Learning To Rank Challenge 2010 Yahoo 2010 421 MB (compressed) Document Ranking
TripAdvisor Data Set University of Illinois at Urbana-Champaign 2010 220 MB (compressed) Opinion Mining