Wikilinks RDF/NIF

The Wikilinks corpus is a coreference resolution corpus of very large scale. It contains over 40 million mentions of over 3 million entities. Mentions are manually labeled links to respective Wikipedia pages in natural language text. They are obtained via a web crawl and aggregated together with the pages' source in the extended corpus of over 180GB in size.

We took the corpus and converted it into the NLP Interchange Format (NIF), publishing it here in Linked Open Data, RDF dumps and an accompanying CSV.

Every webpage in the corpus was parsed. The text of the html element surrounding the individual Wikipedia links was extracted and concatenated together, if there was more than one link on the page. The position of the links in these texts was located and annotated via string offsets. The position of the html elements containing the links was annotated with Xpath expressions. For every link to Wikipedia, the respective DBpedia page was included as a link. The DBpedia ontology classes of the linked resource were added as well. If a mapping exists, NERD core classes were added, too.

The data is available in the Apache file system under http://wiki-link.nlp2rdf.org/data/. However, for ease of use, it is also available in a number of gzipped Dumpfiles. Additionally, there is a gzipped CSV file containing the core of the data.

Data and Resources

Additional Info

Field Value
Author Martin Brümmer
Maintainer Martin Brümmer
Last Updated March 11, 2015, 15:39 (UTC)
Created September 2, 2014, 10:18 (UTC)
homepage http://wiki-link.nlp2rdf.org/
links:dbpedia 31542468
triples 533016300
comments powered by Disqus
comments powered by Disqus