Interact with the corpora¶
Download the corpora hosted into the ELG directly in Python.
[1]:
from elg import Corpus
You can initialize a corpus from its id. You will be asked to authenticate on the ELG website.
[2]:
corpus = Corpus.from_id(913)
You can display the corpus information.
[3]:
print(corpus)
----------------------------------------------------------------------
Id 913
Name 2006 CoNLL Shared Task - Ten Languages
Resource type Corpus
Entity type LanguageResource
Description 2006 CoNLL Shared Task - Ten Languages consists of
dependency treebanks in ten languages used as part of
the CoNLL 2006 shared task on multi-lingual dependency
parsing. The languages covered in this release are:
Bulgarian, Danish, Dutch, German, Japanese, Portuguese,
Slovene, Spanish, Swedish and Turkish. The Conference
on Computational Natural Language Learning (CoNLL) is
accompanied every year by a shared task intended to
promote natural language processing applications and
evaluate them in a standard setting. In 2006, the
shared task was devoted to the parsing of syntactic
dependencies using corpora from up to thirteen
languages. The task aimed to define and extend the
then-current state of the art in dependency parsing, a
technology that complemented previous tasks by
producing a different kind of syntactic description of
input text. More information about CoNLL and the 2006
shared task are available respectively at:
http://ifarm.nl/signll/conll and
http://ilk.uvt.nl/conll. The source data in the
treebanks in this release consists principally of
various texts (e.g., textbooks, news, literature)
annotated in dependency format. In general, dependency
grammar is based on the idea that the verb is the
center of the clause structure and that other units in
the sentence are connected to the verb as directed
links or dependencies. This is a one-to-one
correspondence: for every element in the sentence there
is one node in the sentence structure that corresponds
to that element. In constituency or phrase structure
grammars, on the other hand, clauses are divided into
noun phrases and verb phrases and in each sentence, one
or more nodes may correspond to one element. All of the
data sets in this release are dependency treebanks.
The individual data sets are: BulTreeBank (Bulgarian)
The Danish Dependency Treebank (Danish) The Alpino
Treebank (Dutch) The TIGER Corpus (German) Treebank
Tuba-J/S (Japanese) Floresta Sinta(c)tica (Portuguese)
Slovene Dependency Treebank, SDT V0.1 (Slovene) Cast3LB
(Spanish) Talbanken05 (Swedish) METU-Sabanci Turkish
Treebank (Turkish) This corpus is distributed jointly
with LDC. LDC Catalogue Reference is:
https://catalog.ldc.upenn.edu/LDC2015T11.
Licences ['ELRA-END-USER-ACADEMIC-MEMBER-NONCOMMERCIALUSE-1.0',
'ELRA-END-USER-COMMERCIAL-NOMEMBER-
NONCOMMERCIALUSE-1.0', 'ELRA-END-USER-ACADEMIC-
NOMEMBER-NONCOMMERCIALUSE-1.0', 'ELRA-END-USER-
COMMERCIAL-MEMBER-NONCOMMERCIALUSE-1.0']
Languages ['Slovenian', 'Portuguese', 'Japanese', 'German',
'Dutch', 'Danish', 'Bulgarian', 'Turkish', 'Swedish',
'Spanish']
Status p
----------------------------------------------------------------------
You can download the corpus. Note that only corpora hosted on ELG are downloadable using the python SDK.
[4]:
corpus.download()
Warning: The refresh token will expire in -2839.0 seconds!
Downloading:
[913] 2006 CoNLL Shared Task - Ten Languages
Please, visit the licence of this corpus distribution by clicking: https://live.european-language-grid.eu/catalogue_backend/static/project/licences/ELG-ENT-LIC-050320-00000769.pdf
Do you accept the licence terms: (yes/[no]): yes
Downloading the corpus distribution to 2006_CoNLL_Shared_Task_Ten_Languages.zip:
100%|██████████| 19.0M/19.0M [00:02<00:00, 6.95MiB/s]
By default the corpus is downloaded at the current location and the filename is the name of the ELG corpus. You can overwrite this with the folder and filename parameters.
[5]:
corpus.download(filename="ELG_corpus", folder="/tmp/")
Downloading:
[913] 2006 CoNLL Shared Task - Ten Languages
Please, visit the licence of this corpus distribution by clicking: https://live.european-language-grid.eu/catalogue_backend/static/project/licences/ELG-ENT-LIC-050320-00000769.pdf
Do you accept the licence terms: (yes/[no]): yes
Downloading the corpus distribution to /tmp/ELG_corpus.zip:
100%|██████████| 19.0M/19.0M [00:02<00:00, 6.52MiB/s]
You can create an corpus from a catalog search result. First you need to search for a service using the catalog. Let’s search an English to French Machine Translation service.
[6]:
from elg import Catalog
catalog = Catalog()
results = catalog.search(resource = "Corpus", languages = ["German"], search="ner", limit = 1,)
corpus = Corpus.from_entity(next(results))
print(corpus)
----------------------------------------------------------------------
Id 5010
Name GermEval 2014 NER Shared Task
Resource type Corpus
Entity type LanguageResource
Description The data was sampled from German Wikipedia and News
Corpora as a collection of citations.The dataset covers
over 31,000 sentences corresponding to over 590,000
tokens.
Licences ['Creative Commons Attribution 4.0 International']
Languages ['German']
Status None
----------------------------------------------------------------------