Page heading
Languages and Services
  
    You are here menu
    Subpage heading
    Web Technology · Information Systems · Prof. Dr. Benno Stein
    Navigation
    Additional Content
    Main Content

    Wikipedia Revision Dump

    Synopsis

    The Wikipedia revision dump can be used to study collaborative writing on Wikipedia. We mirror the corpus to assure its availability.

    Download

    You can download the corpus upon request.

    We are in possession of two of the last Wikipedia revision dumps which had been available: one contains all revisions of all English articles up to August 16, 2006 (46 GB), the other contains all English article revisions up to January 1, 2008 (133 GB). Besides the original zipped XML files, the former dump is available in a partitioned version (2-3 GB per part, 74 parts, zipped with ZIP) which is a lot easier to handle in practice.

    Corpus Outline

    Wikipedia is a valuable resource for all kinds of research and its contents are available free of charge under the GNU Free Documentation Licence. The Wikimedia Foundation periodically dumps parts of the Wikipedia projects into zipped XML files for backup purposes. These dumps can be found here. However, an important dump has not been available for a long while due to technical difficulties: the Wikipedia revision dump for the English Wikipedia. This dump contains all revisions of all English articles and it can be used, for instance, to evaluate algorithms for plagiarism detection, near-duplicate detection, or vandalism detection.

    Content signature