Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Person Resolution

    Synopsis

    The mapping between people and their names is not one-to-one. When searching the World Wide Web for a person, one is immediately faced with this problem: search results contain Web pages of different individuals having the same name. Since an individual can have several Web pages, search results get cluttered, especially when searching for people with rather common names.

    The problem can only be addressed with a deeper semantic analysis: Web page content must be interpreted in multiple respects in order to distinguish between different individuals even if they have the same name. This grouping problem is also referred to as "multi document person resolution"; it has recently gained much attention, among others through the Spock Challenge. The solution that is outlined below has been awarded the winner of this challenge.

    Project Outline

    The heart of our solution for the person resolution problem is similarity graph clustering. Let G=< V,E, φ > denote a weighted graph with node set V, edge set E, and edge weight function φ. Each node represents a document, and each edge e in E has a weight φ(e) in [0,1] that models the similarity of its incident documents with respect to their referents. Using such a graph-based approach, the challenge can be subdivided into the following steps.

    • Modeling. Design of a similarity function φ that reflects page similarity according to individuals (also called "referents").
    • Merging. Choice of a clustering algorithm to generate different clusterings using various parameters.
    • Evaluation. Test the validity of the generated clusterings in order to choose the best.

    In general, the modeling part constitutes the most critical and most important part in the analysis process. If the model is poor, no combination of clustering algorithm or validity measure can succeed in producing a high quality clustering.

     

    Analysis Process. For training purposes, we assume the existence of a training set of Web pages that has been labeled according to referents. The analysis process is depicted in the figure above. At training time (left hand side), a similarity measure in form of a classifier is learned from the training data, which is employed online (right hand side) for deriving similarities of previously unseen Web pages. In the following, the procedure for learning a similarity function (at training time) and the procedure for applying the learned similarity function are shown.

    Training procedure:

    (1a) For each document within an instance, several representations are computed. In particular, we compute a vector space representation using a TF*IDF weighting scheme, a core vocabulary representation, a semantic similarity model based on Wikipedia, a categorical model where the document is classified into the DMOZ hierarchy, and a high-level domain specific model that uses an NLP parser to extract specific knowledge.

    (2a) For each pair of documents, similarity values φi according to the mentioned document representations are computed.

    (3) A logistic regression is used to combine the similarity values φi of an edge e into a single value φ. The objective of the regression is to produce a value of φ(e)=1 if the incident documents of e belong to the same referent, and φ(e)=0 otherwise. In other words, the value of the logistic regression represents the probability that the documents incident to e belong to the same individual.

    Once the model is learned, it can be employed to combine the similarity values φi into one value φ for previously unseen data. The following procedure illustrates how the trained classifier is used online.

    Application of the learned model:

    (1b) For each pair of documents in an instance, the similarity values φi are computed.

    (2b) The trained classifier is used to combine the φi into one value φ, which is used as edge weight in a similarity graph G.

    (4) The similarity graph is reduced to a k-nearest-neighbor graph in order to reduce noise.

    (5) The similarity graph is clustered with the MajorClust algorithm (Stein and Niggemann 1999) using different density thresholds. The resulting clusterings are assessed with respect to their validity using the expected density measure ρ (Stein and Meyer zu Eissen 2003, Meyer zu Eissen 2007).

    People

    Students: Steffen Becker, Christof Bräutigam, Tino Rüb, and Hagen-Christian Tönnies

    Related Publications

    Benno Stein and Sven Meyer zu Eißen. Weighted Experts: A Solution for the Spock Data Mining Challenge. In Klaus Tochtermann and Hermann Maurer, editors, 8th International Conference on Knowledge Management (I-KNOW 08), Journal of Universal Computer Science, pages 358-365, September 2008. Springer. ISSN 0948-695x. [paper] [bib]
    Sven Meyer zu Eißen. On Information Need and Categorizing Search. Dissertation, University of Paderborn, Germany, Faculty of Computer Science, Electrical Engineering and Mathematics, February 2007. [doi] [paper] [bib]
    Benno Stein and Sven Meyer zu Eißen. Automatic Document Categorization: Interpreting the Performance of Clustering Algorithms. In A. Günter, R. Kruse, and B. Neumann, editors, Advances in Artificial Intelligence. 26th Annual German Conference on AI (KI 03), 2821 of Lecture Notes in Artificial Intelligence, pages 254-266, September 2003. Springer. ISBN 3-540-20059-2. [doi] [paper] [bib]
    Benno Stein and Oliver Niggemann. On the Nature of Structure and its Identification. In P. Widmayer, G. Neyer, and S. Eidenbenz, editors, Graph-Theoretic Concepts in Computer Science, 1665 of Lecture Notes in Computer Science, pages 122-134, June 1999. Springer. ISBN 3-540-66731-8. [paper] [bib]

    Content signature