Reuters-128 NIF NER Corpus

This English corpus is based on the well known Reuters-21578 corpus which contains economic news articles. In particular, we chose 128 articles containing at least one NE. Compared to the News-100 corpus the documents of Reuters-128 are significantly shorter and thus carry a smaller context.

To create the annotation of NEs with URIs, we implemented a supporting judgement tool. . The input for the tool was a subset of more than 150 Reuters-21578 news articles sampled randomly. First, FOX (Ngonga Ngomo et al., 2011) was used for recognizing a first set of NEs. This reduced the amount of work to a feasible portion regarding the size of this dataset. Afterwards, the domain experts corrected the mistakes of FOX manually using the annotation tool. Therefore, the tool highlighted the entities in the texts and added initial URI candidates via simple string matching algorithms. Two scientists determined the correct URI for each named entity manually with an initial voter agreement of 74%. This low initial agreement rate hints towards the difficulty of the disambiguation task. In some cases judges did not agree initially, but came to an agreement shortly after reviewing the cases. While annotating, we left out ticker symbols of companies (e.g., GOOG for Google Inc.), abbreviations and job descriptions be- cause those are always preceded by the full company name respectively a person’s name.

Gögn og tilföng

Viðbótarupplýsingar

Svæði Gildi
Höfundur Ricardo Usbeck
Umsjónarmaður Ricardo Usbeck
Síðast uppfært október 29, 2014, 16:25 (UTC)
Stofnað september 5, 2014, 07:46 (UTC)
github https://github.com/AKSW/n3-collection
homepage http://aksw.org/Projects/N3NERNEDNIF.html
links:dbpedia 650
triples 6967
comments powered by Disqus
comments powered by Disqus