Corpus

class elg.corpus.Licence(name: str, urls: List[str], identifiers: List[dict])

Class to represent a licence

class elg.corpus.Distribution(pk: int, corpus_id: int, domain: str, form: str, distribution_location: str, download_location: str, access_location: str, licence: Licence, cost: str, attribution_text: str, filename: str)

Class to represent a corpus distribution

classmethod from_data(corpus_id: int, domain: str, data: dict)

Class method to init the distribution object from the metadata information.

Parameters
  • corpus_id (int) – id of the corpus the distribution is from.

  • domain (str) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG.

  • data (dict) – metadata information of the distribution.

Returns

the distribution object initialized.

Return type

elg.Distribution

is_downloadable() str

Method to get if the distribution is downloadable.

Returns

return True is the distribution is downloadable, False if not.

Return type

bool

class elg.corpus.Corpus(id: int, resource_name: str, resource_short_name: List[str], resource_type: str, entity_type: str, description: str, keywords: List[str], detail: str, licences: List[str], languages: List[str], country_of_registration: List[str], creation_date: str, last_date_updated: str, functional_service: bool, functions: List[str], intended_applications: List[str], views: int, downloads: int, size: int, service_execution_count: int, status: str, under_construction: bool, record: dict, auth_object: Authentication, auth_file: str, scope: str, domain: str, use_cache: bool, cache_dir: str, **kwargs)

Class to represent a corpus. Download ELG corpora.

Examples:

from elg import Corpus

# You can initialize a corpus from its id. You will be asked to authenticate on the ELG website.
corpus = Corpus.from_id(913)

# You can display the corpus information.
print(corpus)

# You can download the corpus. Note that only corpora hosted on ELG are downloadable using the python SDK.
corpus.download()

# By default the corpus is downloaded at the current location and the filename is the name of the ELG corpus.
# You can overwrite this with the folder and filename parameters.
corpus.download(filename="ELG_corpus", folder="/tmp/")

# You can create an corpus from a catalog search result. First you need to search for a service using the catalog.
# Let's search an English to French Machine Translation service.
from elg import Catalog

catalog = Catalog()
results = catalog.search(
    resource = "Corpus",
    languages = ["German"],
    search="ner",
    limit = 1,
)

corpus = Corpus.from_entity(results[0])
print(corpus)
classmethod from_id(id: int, auth_object: Optional[Authentication] = None, auth_file: Optional[str] = None, scope: Optional[str] = None, domain: Optional[str] = None, use_cache: bool = True, cache_dir: str = '~/.cache/elg')

Class method to init a Corpus class from its id. You can provide authentication information through the auth_object or the auth_file attributes. If not authentication information is provided, the Authentication object will be initialized.

Parameters
  • id (int) – id of the corpus.

  • auth_object (elg.Authentication, optional) – elg.Authentication object to use. Defaults to None.

  • auth_file (str, optional) – json file that contains the authentication tokens. Defaults to None.

  • scope (str, optional) – scope to use when requesting tokens. Can be set to “openid” or “offline_access” to get offline tokens. Defaults to “openid”.

  • domain (str, optional) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG. Defaults to “live”.

  • use_cache (bool, optional) – True if you want to use cached files. Defaults to True.

  • cache_dir (str, optional) – path to the cache_dir. Set it to None to not store any cached files. Defaults to “~/.cache/elg”.

Returns

the corpus object initialized.

Return type

elg.Corpus

classmethod from_entity(entity: Entity, auth_object: Optional[str] = None, auth_file: Optional[str] = None, scope: Optional[str] = None, use_cache: bool = True, cache_dir='~/.cache/elg')

Class method to init a Corpus class from an Entity object. You can provide authentication information through the auth_object or the auth_file attributes. If not authentication information is provided, the Authentication object will be initialized.

Parameters
  • entity (elg.Entity) – Entity object to init as a Corpus.

  • auth_object (elg.Authentication, optional) – elg.Authentication object to use. Defaults to None.

  • auth_file (str, optional) – json file that contains the authentication tokens. Defaults to None.

  • scope (str, optional) – scope to use when requesting tokens. Can be set to “openid” or “offline_access” to get offline tokens. Defaults to “openid”.

  • domain (str, optional) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG. Defaults to “live”.

  • use_cache (bool, optional) – True if you want to use cached files. Defaults to True.

  • cache_dir (str, optional) – path to the cache_dir. Set it to None to not store any cached files. Defaults to “~/.cache/elg”.

Returns

Corpus object with authentication information.

Return type

elg.Corpus

download(distribution_idx: int = 0, filename: str = None, folder: str = './')

Method to download the corpus if possible.

Parameters
  • distribution_idx (int, optional) – Index of the distribution of the corpus to download. Defaults to 0.

  • filename (str, optional) – Name of the output file. If None, the name of the corpus will be used. Defaults to None.

  • folder (str, optional) – path to the folder where to save the downloaded file. Defaults to “./”.