Interact with the corpora

Download the corpora hosted into the ELG directly in Python.

[1]:
from elg import Corpus

You can initialize a corpus from its id. You will be asked to authenticate on the ELG website.

[2]:
corpus = Corpus.from_id(913)

You can display the corpus information.

[3]:
print(corpus)
----------------------------------------------------------------------
Id             913
Name           2006 CoNLL Shared Task - Ten Languages
Resource type  Corpus
Entity type    LanguageResource
Description    2006 CoNLL Shared Task - Ten Languages consists of
               dependency treebanks in ten languages used as part of
               the CoNLL 2006 shared task on multi-lingual dependency
               parsing. The languages covered in this release are:
               Bulgarian, Danish, Dutch, German, Japanese, Portuguese,
               Slovene, Spanish, Swedish and Turkish.  The Conference
               on Computational Natural Language Learning (CoNLL) is
               accompanied every year by a shared task intended to
               promote natural language processing applications and
               evaluate them in a standard setting. In 2006, the
               shared task was devoted to the parsing of syntactic
               dependencies using corpora from up to thirteen
               languages. The task aimed to define and extend the
               then-current state of the art in dependency parsing, a
               technology that complemented previous tasks by
               producing a different kind of syntactic description of
               input text. More information about CoNLL and the 2006
               shared task are available respectively at:
               http://ifarm.nl/signll/conll and
               http://ilk.uvt.nl/conll.   The source data in the
               treebanks in this release consists principally of
               various texts (e.g., textbooks, news, literature)
               annotated in dependency format. In general, dependency
               grammar is based on the idea that the verb is the
               center of the clause structure and that other units in
               the sentence are connected to the verb as directed
               links or dependencies. This is a one-to-one
               correspondence: for every element in the sentence there
               is one node in the sentence structure that corresponds
               to that element. In constituency or phrase structure
               grammars, on the other hand, clauses are divided into
               noun phrases and verb phrases and in each sentence, one
               or more nodes may correspond to one element. All of the
               data sets in this release are dependency treebanks.
               The individual data sets are: BulTreeBank (Bulgarian)
               The Danish Dependency Treebank (Danish) The Alpino
               Treebank (Dutch) The TIGER Corpus (German) Treebank
               Tuba-J/S (Japanese) Floresta Sinta(c)tica (Portuguese)
               Slovene Dependency Treebank, SDT V0.1 (Slovene) Cast3LB
               (Spanish) Talbanken05 (Swedish) METU-Sabanci Turkish
               Treebank (Turkish)  This corpus is distributed jointly
               with LDC. LDC Catalogue Reference is:
               https://catalog.ldc.upenn.edu/LDC2015T11.
Licences       ['ELRA-END-USER-ACADEMIC-MEMBER-NONCOMMERCIALUSE-1.0',
               'ELRA-END-USER-COMMERCIAL-NOMEMBER-
               NONCOMMERCIALUSE-1.0', 'ELRA-END-USER-ACADEMIC-
               NOMEMBER-NONCOMMERCIALUSE-1.0', 'ELRA-END-USER-
               COMMERCIAL-MEMBER-NONCOMMERCIALUSE-1.0']
Languages      ['Slovenian', 'Portuguese', 'Japanese', 'German',
               'Dutch', 'Danish', 'Bulgarian', 'Turkish', 'Swedish',
               'Spanish']
Status         p
----------------------------------------------------------------------

You can download the corpus. Note that only corpora hosted on ELG are downloadable using the python SDK.

[4]:
corpus.download()
Warning: The refresh token will expire in -2839.0 seconds!
Downloading:
        [913] 2006 CoNLL Shared Task - Ten Languages

Please, visit the licence of this corpus distribution by clicking: https://live.european-language-grid.eu/catalogue_backend/static/project/licences/ELG-ENT-LIC-050320-00000769.pdf

Do you accept the licence terms: (yes/[no]): yes

Downloading the corpus distribution to 2006_CoNLL_Shared_Task_Ten_Languages.zip:
100%|██████████| 19.0M/19.0M [00:02<00:00, 6.95MiB/s]

By default the corpus is downloaded at the current location and the filename is the name of the ELG corpus. You can overwrite this with the folder and filename parameters.

[5]:
corpus.download(filename="ELG_corpus", folder="/tmp/")
Downloading:
        [913] 2006 CoNLL Shared Task - Ten Languages

Please, visit the licence of this corpus distribution by clicking: https://live.european-language-grid.eu/catalogue_backend/static/project/licences/ELG-ENT-LIC-050320-00000769.pdf

Do you accept the licence terms: (yes/[no]): yes

Downloading the corpus distribution to /tmp/ELG_corpus.zip:
100%|██████████| 19.0M/19.0M [00:02<00:00, 6.52MiB/s]

You can create an corpus from a catalog search result. First you need to search for a service using the catalog. Let’s search an English to French Machine Translation service.

[6]:
from elg import Catalog

catalog = Catalog()
results = catalog.search(resource = "Corpus", languages = ["German"], search="ner", limit = 1,)

corpus = Corpus.from_entity(next(results))
print(corpus)
----------------------------------------------------------------------
Id             5010
Name           GermEval 2014 NER Shared Task
Resource type  Corpus
Entity type    LanguageResource
Description    The data was sampled from German Wikipedia and News
               Corpora as a collection of citations.The dataset covers
               over 31,000 sentences corresponding to over 590,000
               tokens.
Licences       ['Creative Commons Attribution 4.0 International']
Languages      ['German']
Status         None
----------------------------------------------------------------------