Search for a Dataset - the Datahub

Add Dataset Import Data Package

DBpedia abstract corpus

This corpus contains a conversion of Wikipedia abstracts in six languages (dutch, english, french, german, italian and spanish) into the I used the NLP Interchange Format (NIF)....
- GZ
- text/turtle
Ontos News Portal

The Ontos News Portal extracts facts (objects as e. g. persons or organizations as well as relations between them, e. g. a person is working for an organization or living at a...
- text/turtle
- RDF
JRC-Names-MLODE

From their web site: JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and...
- gzip
- text/turtle
- gz:nt
- api/sparql
- example/turtle
CopyrightTermBank

Terminology on copyright and related concepts
KORE 50 NIF NER Corpus

KORE 50[1] (AIDA) is a subset of the larger AIDA corpus, which is based on the dataset of the CoNLL 2003 NER task. The dataset aims to capture hard to disambiguate mentions of...
- text/turtle
- PDF
LemonWiktionary

Lemon data extracted from Wiktionary
Brown Corpus in RDF/NIF

RDF version of the Brown Corpus (W. N. Francis, H. Kucera; Brown University; 1979). 1,014,312 words in 500 documents, taken from newspapers texts on diverse topics, non-fiction...
- text/turtle
- example/turtle
Multext-East

From the web site: Version 4 of the MULTEXT-East resources, a multilingual dataset for language engineering research and development. This dataset contains, for Bulgarian,...
- text/turtle
SentimentWortschatz

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity...
News-100 NIF NER Corpus

This corpus comprises 100 German news articles from the online news platform news.de. All of the articles were published in the year of 2010 and contain the word Golf. This word...
- text/turtle
- PDF
RSS-500 NIF NER CORPUS

This corpus has been created using a dataset comprising a list of 1,457 RSS feeds as compiled in (Goldhahn et al. 2012). The list includes all major worldwide newspapers and a...
- text/turtle
- PDF
DBpedia Spotlight NIF NER Corpus

Based on P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia Spotlight: shedding light on the web of documents. In Proc. of the 7th Int. Conf. on Semantic Systems,...
- text/turtle
- PDF
Reuters-128 NIF NER Corpus

This English corpus is based on the well known Reuters-21578 corpus which contains economic news articles. In particular, we chose 128 articles containing at least one NE....
- text/turtle
- PDF
PanLex

A lexical database documenting translations among lexemes of language varieties.
Chat Game corpus

A corpus resulting from an object arrangement game using a computer-mediated setting.
- text/turtle
MExiCo

MExiCo (short for "Multimodal Experiment Corpora") is a data model for data collections containing multimodal linguistic and interaction annotations.
- text/turtle
- example/turtle
FiESTA

FiESTA (short for "Format for extensive spatiotemporal annotations") is a generic format for linguistic and behavioral annotations.
- text/turtle

You can also access this registry using the API (see API Docs).

17 datasets found