Reuters-128 NIF NER Corpus

This English corpus is based on the well known Reuters-21578 corpus which contains economic news articles. In particular, we chose 128 articles containing at least one NE. Compared to the News-100 corpus the documents of Reuters-128 are significantly shorter and thus carry a smaller context.

To create the annotation of NEs with URIs, we implemented a supporting judgement tool. . The input for the tool was a subset of more than 150 Reuters-21578 news articles sampled randomly. First, FOX (Ngonga Ngomo et al., 2011) was used for recognizing a first set of NEs. This reduced the amount of work to a feasible portion regarding the size of this dataset. Afterwards, the domain experts corrected the mistakes of FOX manually using the annotation tool. Therefore, the tool highlighted the entities in the texts and added initial URI candidates via simple string matching algorithms. Two scientists determined the correct URI for each named entity manually with an initial voter agreement of 74%. This low initial agreement rate hints towards the difficulty of the disambiguation task. In some cases judges did not agree initially, but came to an agreement shortly after reviewing the cases. While annotating, we left out ticker symbols of companies (e.g., GOOG for Google Inc.), abbreviations and job descriptions be- cause those are always preceded by the full company name respectively a person’s name.

Download Data Package

Gögn og tilföng

Complete Reuters-128 Corpustext/turtle
Full corpus in single Turtle format file

Frekari upplýsingar Fara í tilfang
Documentation paperPDF
Title: N3 - A Collection of Datasets for Named Entity Recognition and...

Frekari upplýsingar Fara í tilfang
DataIDtext/turtle
Metadata description of the corpus

Frekari upplýsingar Fara í tilfang

Viðbótarupplýsingar

Svæði	Gildi
Höfundur	Ricardo Usbeck
Umsjónarmaður	Ricardo Usbeck
Síðast uppfært	október 29, 2014, 16:25 (UTC)
Stofnað	september 5, 2014, 07:46 (UTC)
github	https://github.com/AKSW/n3-collection
homepage	http://aksw.org/Projects/N3NERNEDNIF.html
links:dbpedia	650
triples	6967