Webis-QSpell-17

Synopsis

The Webis Query Spelling Corpus 2017 (Webis-QSpell-17) contains 54,772 web queries that were manually spell-checked; for 9,171 queries alternative spelling variants are contained.

As for segmentations of many of the queries (i.e., tagged concepts and phrases), please refer to the companion corpus Webis-QSeC-10.

Download

We provide the corpus as a single folder in a Zip archive.

To download the corpus, please use the following link:

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].

Research

The original queries were extracted from the AOL query log, and range from 3 to 10 keywords in length. Two independent annotators went through all the queries; allowed to use any tool they wanted to support their work (e.g., Hunspell, aspell, search engines, dictionaries, Wikipedia). For each query, potential alternative spellings (also possibly more than one) had to be annotated. Both annotators then discussed the cases where they disagreed. This typically resulted in different reasonable spelling variants being fed into the final corpus. After this step, three annotators each independently checked one third of the queries that contained alternative spellings from the first iteration and could further add or remove variants if need be---also using tools of their choice.

The two example queries below show the corpus format, different columns separated by semicolons:

  • 4030033927;new york and company;new york & company;new york and company
  • 3431465218;new york aquarium;new york aquarium;

Each query has a unique internal ID (e.g., 4030033927 in the first example); queries that are also contained in the the Webis-QSeC-10 have the same IDs in both corpora. The original query spelling is in the second column, spelling variants annotated by our annotators are contained in the following column(s). In the first example, two spelling variants are given in the third and fourth column, while in the second example only one spelling variant is given. In the second example, the spelling variant in the third column is identical to the original query in the second column which indicates a case without spelling error.

For more information on the construction of the dataset see the respective publication.

People

Students: Marcel Gohsen, Anja Rathgeber

Publications

Matthias Hagen, Martin Potthast, Marcel Gohsen, Anja Rathgeber, and Benno Stein. A Large-Scale Query Spelling Correction Corpus. In Noriko Kando et al, editors, 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR 17), pages 1261-1264, August 2017. ACM. ISBN 978-1-4503-5022-8. [doi] [paper] [corpus] [bib] [poster]