In this project we develop OpinionCloud, a new opinion summarization technology for Web comments in general and YouTube and Flickr in particular. Popular Web items often get up to thousands of comments and in order to get an idea about the crowd's overall opinion one has to read all of them, which is of course impractical. Our summarization approach helps to retrieve this important piece of information by generating an opinion word cloud for a given set of comments. We operationalize the technology in browser add-ons for Firefox and Chrome which summarize the comments on a YouTube video when the user starts watching it.


Left: positive and negative words. Right: all words unfiltered.

Our research on opinion summarization of Web comments boils down to two research areas: sentiment analysis and summary visualization. The former deals with the classification of words as positive, negative, or neutral, whereas the latter deals with the design of an accessible visual representation of a set of opinions.

Sentiment Analysis & Opinion Visualization. In sentiment analysis a word's polarity can be identified by measuring its co-occurrence with words whose polarity is known in advance, i.e., if a given word occurs with a high probability in the vicinity of positive (negative) words it can be considered positive (negative) as well. Neutral words, however, tend to occur arbitrarily next to words of both polarities. We use this idea to train a dictionary of opinion words which also contains slang terms that are often used in comments. The dictionary is then used to classify the words of comments into positive, negative, and neutral words. By default, words that are not contained in the dictionary are considered neutral.

The visualization of the opinions found in a set of comments is done as shown in the left figure. The words are arranged in a cloud where the color of a word denotes its polarity and the size of a word its frequency in the comments. This visualization is comparable to the well-known tag clouds for folksonomies.

Why YouTube? We have chosen YouTube as a working example for our technology since a comment on YouTube usually contains only some kind of opinion exclamation, and, a large amount of comments is available. For a user, reading these comments is time-consuming and boring, or put another way, comments on YouTube are neither universally accessible nor useful. However, for an information retrieval researcher these comments form a unique large-scale corpus of highly opinion-coloured language. For instance, to train our dictionary we have analyzed about 9 million YouTube comments.


Students: Steffen Becker


Martin Potthast. Technologien zur Wiederverwendung von Texten aus dem Web. In Steffen Hölldobler et al, editors, Ausgezeichnete Informatikdissertationen 2011 volume D-12 LNI of Lecture Notes in Informatics, pages 141-150, December 2012. Gesellschaft für Informatik. ISBN 978-3-88579-416-5. [publisher] [paper] [bib] [slides]
Martin Potthast, Benno Stein, Fabian Loose, and Steffen Becker. Information Retrieval in the Commentsphere. Transactions on Intelligent Systems and Technology (ACM TIST), 3 (4) : 68:1-68:21, September 2012. [doi] [article] [bib]
Martin Potthast. Technologies for Reusing Text from the Web. Dissertation, Bauhaus-Universität Weimar, December 2011. [publisher] [paper] [bib] [video] [slides]
Martin Potthast and Steffen Becker. Opinion Summarization of Web Comments. In Cathal Gurrin et al, editors, Advances in Information Retrieval. 32nd European Conference on Information Retrieval (ECIR 10) volume 5993 of Lecture Notes in Computer Science, pages 668-669, Heidelberg, 2010. Springer. ISBN 978-3-642-12274-3. [doi] [paper] [bib] [poster]
Antonio Reyes, Martin Potthast, Paolo Rosso, and Benno Stein. Evaluating Humor Features on Web Comments. In Nicoletta Calzolari et al, editors, 7th Conference on International Language Resources and Evaluation (LREC 10), May 2010. European Language Resources Association (ELRA). ISBN 2-9517408-6-7. [paper] [bib] [poster]
Martin Potthast, Benno Stein, and Steffen Becker. Towards Comment-based Cross-Media Retrieval. In Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti, editors, 19th International Conference on World Wide Web (WWW 10), pages 1169-1170, April 2010. ACM. ISBN 978-1-60558-799-8. [doi] [paper] [bib] [poster]
Martin Potthast. Measuring the Descriptiveness of Web Comments. In Mark Sanderson et al, editors, 32th International ACM Conference on Research and Development in Information Retrieval (SIGIR 09), pages 724-725, July 2009. ACM. ISBN 978-1-60558-483-6. [doi] [paper] [bib] [poster]