The possibility to reproduce and compare results of other researchers is essential for scientific progress. In many research fields, however, it is often impossible to specify the complete experiment setting; e.g., in the scope of a scientific publication. As a consequence, a  reliable comparison becomes difficult, if not impossible. TIRA is our approach to address this shortcoming. TIRA—jokingly, "The Incredible Research Assistant"—provides a means for evaluation as a service. It focuses on hosting shared tasks and facilitates the submission of softwares as opposed to the output of running a software on a test dataset (a so-called run). TIRA encapsulates the submitted softwares into virtual machines. This way, even after a shared task is over, the submitted softwares can be re-evaluated at the click of a button, which severely increased the reproducibility of the corresponding shared task. An overview of existing shared tasks is available at [service]


TIRA is currently one of the few (if not the only) platform that supports software submissions with little extra effort. We have used it to organize 12 shared tasks within PAN@CLEF, CoNLL, and the now running WSDM Cup. All told, 300 pieces of software have been collected to date, all archived for re-execution. This ensures replicability, and also reproducibility (e.g., re-evaluating the collected software on new datasets).

For a recent example on applied reproducibility: we hosted a shared task where participants submitted software adversary to those submitted to a previous shared task: author obfuscation vs. authorship verification. Evaluating the obfuscators involved running obfuscated texts through all of 44 previously submitted verifiers to check whether authors could still be identified. This is something that would have been virtually impossible without TIRA.

- supports almost any working environment and software stack (incl. Windows)
- apparently does not impede participation in shared tasks
  (we have so far not observed drops of registrations or heard any serious complaints afterward)
- prevents participants from directly accessing the test datasets (blind evaluation)
- prevents leakage of test datasets,
- allows for controlling the amount of information passed back to participants when they run software on test datasets
- for the above reasons supports the use of proprietary and sensitive datasets
- allows for many different task setups (e.g., for the source retrieval task, participants accessed our in-house ClueWeb search engine ChatNoir)

TIRA's only requirement to participants is that
- software is executable from a POSIX command line (Cygwin on Windows) with a number of parameters

TIRA's requirements for organizers are that they
- supply datasets
- supply run evaluation software
- review participant runs for errors
- moderate evaluation results and whether it should be published
- help to answer participant questions as they arise
That's nothing more than they are doing, anyway.

TIRA does currently not support
- GPU acceleration inside virtual machines
- accessing cluster computers to run, e.g., MapReduce jobs
These are things that will be available eventually.

TIRA's operational costs include
- running the virtual machines and the servers that host them

We are currently running TIRA on our Betaweb cluster at the Digital Bauhaus Lab (130 machines with 64GB RAM each). We can afford to host hundreds of virtual machines simultaneously, and we would be willing to offer hosting free of charge. In return
- we'd ask to be highlighted in appropriate places on web pages, presentations, papers, etc.
- we'd ask task organizers to make sure participants properly cite TIRA in case they mention it in their papers

As a grain of salt: TIRA is still a prototype (beta), and it is rough around the edges in some places. We are working toward an open source release.  


Students: Anna Beyer, Matthias Busse, Clement Welsch, Arnd Oberländer, Johannes Kiesel, Adrian Teschendorf, Manuel Willem


Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1-19, August 2017. Association for Computational Linguistics. [doi] [paper] [bib]
Allan Hanbury, Henning Müller, Krisztian Balog, Torben Brodt, Gordon V. Cormack, Ivan Eggel, Tim Gollub, Frank Hopfgartner, Jayashree Kalpathy-Cramer, Noriko Kando, Anastasia Krithara, Jimmy Lin, Simon Mercer, and Martin Potthast. Evaluation-as-a-Service: Overview and Outlook. ArXiv e-prints, December 2015. [publisher] [article] [bib]
Frank Hopfgartner, Allan Hanbury, Henning Müller, Noriko Kando, Simon Mercer, Jayashree Kalpathy-Cramer, Martin Potthast, Tim Gollub, Anastasia Krithara, Jimmy Lin, Krisztian Balog, and Ivan Eggel. Report on the Evaluation-as-a-Service (EaaS) Expert Workshop. SIGIR Forum, 49 (1) : 57-65, June 2015. [doi] [article] [bib]
Martin Potthast, Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In Evangelos Kanoulas et al, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14), pages 268-299, Berlin Heidelberg New York, September 2014. Springer. ISBN 978-3-319-11381-4. [doi] [paper] [bib] [slides]
Tim Gollub, Martin Potthast, Anna Beyer, Matthias Busse, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. Recent Trends in Digital Text Forensics and its Evaluation. In Pamela Forner et al, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 13), pages 282-302, Berlin Heidelberg New York, September 2013. Springer. ISBN 978-3-642-40801-4. ISSN 0302-9743. [doi] [paper] [bib] [slides]
Tim Gollub, Benno Stein, Steven Burrows, and Dennis Hoppe. TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments. In A Min Tjoa, Stephen Liddle, Klaus-Dieter Schewe, and Xiaofang Zhou, editors, 9th International Workshop on Text-based Information Retrieval (TIR 12) at DEXA, pages 151-155, Los Alamitos, California, September 2012. IEEE. ISBN 978-1-4673-2621-6. ISSN 1529-4188. [doi] [paper] [bib] [slides]
Tim Gollub, Steven Burrows, and Benno Stein. First Experiences with TIRA for Reproducible Evaluation in Information Retrieval. In Andrew Trotman et al, editors, SIGIR 12 Workshop on Open Source Information Retrieval (OSIR 12), pages 52-55, August 2012. ISBN 978-0-473-22025-9. [paper] [bib] [slides] [poster]
Tim Gollub, Benno Stein, and Steven Burrows. Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), pages 1125-1126, August 2012. ACM. ISBN 978-1-4503-1472-5. [doi] [paper] [bib] [poster]