DBpedia abstract corpus

This corpus contains a conversion of Wikipedia abstracts in six languages (dutch, english, french, german, italian and spanish) into the I used the NLP Interchange Format (NIF). The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation.

Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.

Acknowledgments: The conversion of this corpus was supported by the FREME H2020 project.

Data and Resources

Additional Info

Field Value
Author Martin Brümmer
Maintainer Martin Brümmer
Last Updated May 10, 2018, 16:40 (UTC)
Created September 10, 2015, 14:37 (UTC)
links:dbpedia 82318744
triples 743532157
comments powered by Disqus
comments powered by Disqus