Structured Wikinews

NLP processed version of English and Spanish Wikinews articles. We used a dump of 18,862 English and 7,603 Spanish articles. The data contains links to DBpedia for entities and provenance information as to in what source the event was mentioned. The entities and events in the Wikinews articles are structured using the Simple Event Model (SEM) and the grounded annotation framework (GAF).

The English dataset contains:

  • 811,885 Events
  • 11,610 Locations
  • 103,496 Actors

The Spanish dataset contains:

  • 158,757
  • 9,047 Locations
  • 144,793 Actors

For each dataset, the original sources are available as a single cleaned xml file, the files containing the linguistic analyses as processed by a state-of-the-art document-based NLP pipeline and the aggregated information containing events and actors mentioned across different document (encoded as RDF/TRiG).

For more info and a list of files see:

This dataset was created in NewsReader project (ICT-316404), funded by the European Union's 7th Framework Programme.

Data and Resources

Additional Info

Field Value
Author Marieke van Erp
Maintainer Marieke van Erp
Last Updated May 14, 2014, 09:06 (UTC)
Created May 11, 2014, 19:57 (UTC)
comments powered by Disqus
comments powered by Disqus