Additional Content
Main Content
Netspeak
Synopsis
To write in a foreign language is a difficult task, even for an experienced author. Problems include choosing the right word or preposition in a given context, finding a wording which is commonly used, and avoiding the use of grammatical forms which reflect the author's native language. The Netspeak Web service assists authors to overcome these issues by using the World Wide Web as a source of common language. The service can be queried with short text phrases to determine their customariness on the Web. Wildcard characters can be added to the query to search for variations and synonyms of the query phrase, which will be returned as ranked list with respect to their occurence frequency on the Web. See a screencast that shows Netspeak in action.
Project Outline
Netspeak indexes the complete "Web 1T 5-gram Version 1" corpus as a source of common language on the Web. The corpus comprises about 3.8 billion phrases up to a length of 5 words (so-called n-grams) which were collected by Google from the English Web. The following table shows details on the size of the corpus:
| n-grams | count |
size (compressed) |
size (uncompressed) |
|---|---|---|---|
| 1-grams | 13 588 391 |
70.2 MB |
177.0 MB |
| 2-grams | 314 843 401 |
1.6 GB |
5.0 GB |
| 3-grams | 977 069 902 |
5.5 GB |
19.0 GB |
| 4-grams | 1 313 818 354 |
8.4 GB |
30.5 GB |
| 5-grams | 1 176 470 663 |
8.8 GB |
32.1 GB |
People
- Martin Potthast (Scientific Mentoring and Software Architecture)
- Benno Stein (Scientific Mentoring and Statistics)
Students: Martin Trenkmann (Software Engineering)
Related Publications
Content signature
© Fakultät Medien 19.01.2012 / Kontakt / Impressum / Datenschutz / Bemerkung zu dieser Seite
Die Bauhaus-Universität Weimar verwendet Piwik zur Web-Analyse.


