Corpus¶
- class elg.corpus.Licence(name: str, urls: List[str], identifiers: List[dict])¶
Class to represent a licence
- class elg.corpus.Distribution(pk: int, corpus_id: int, domain: str, form: str, distribution_location: str, download_location: str, access_location: str, licence: Licence, cost: str, attribution_text: str, filename: str)¶
Class to represent a corpus distribution
- classmethod from_data(corpus_id: int, domain: str, data: dict)¶
Class method to init the distribution object from the metadata information.
- Parameters
corpus_id (int) – id of the corpus the distribution is from.
domain (str) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG.
data (dict) – metadata information of the distribution.
- Returns
the distribution object initialized.
- Return type
elg.Distribution
- is_downloadable() str ¶
Method to get if the distribution is downloadable.
- Returns
return True is the distribution is downloadable, False if not.
- Return type
bool
- class elg.corpus.Corpus(id: int, resource_name: str, resource_short_name: List[str], resource_type: str, entity_type: str, description: str, keywords: List[str], detail: str, licences: List[str], languages: List[str], country_of_registration: List[str], creation_date: str, last_date_updated: str, functional_service: bool, functions: List[str], intended_applications: List[str], views: int, downloads: int, size: int, service_execution_count: int, status: str, under_construction: bool, record: dict, auth_object: Authentication, auth_file: str, scope: str, domain: str, use_cache: bool, cache_dir: str, **kwargs)¶
Class to represent a corpus. Download ELG corpora.
Examples:
from elg import Corpus # You can initialize a corpus from its id. You will be asked to authenticate on the ELG website. corpus = Corpus.from_id(913) # You can display the corpus information. print(corpus) # You can download the corpus. Note that only corpora hosted on ELG are downloadable using the python SDK. corpus.download() # By default the corpus is downloaded at the current location and the filename is the name of the ELG corpus. # You can overwrite this with the folder and filename parameters. corpus.download(filename="ELG_corpus", folder="/tmp/") # You can create an corpus from a catalog search result. First you need to search for a service using the catalog. # Let's search an English to French Machine Translation service. from elg import Catalog catalog = Catalog() results = catalog.search( resource = "Corpus", languages = ["German"], search="ner", limit = 1, ) corpus = Corpus.from_entity(results[0]) print(corpus)
- classmethod from_id(id: int, auth_object: Optional[Authentication] = None, auth_file: Optional[str] = None, scope: Optional[str] = None, domain: Optional[str] = None, use_cache: bool = True, cache_dir: str = '~/.cache/elg')¶
Class method to init a Corpus class from its id. You can provide authentication information through the auth_object or the auth_file attributes. If not authentication information is provided, the Authentication object will be initialized.
- Parameters
id (int) – id of the corpus.
auth_object (elg.Authentication, optional) – elg.Authentication object to use. Defaults to None.
auth_file (str, optional) – json file that contains the authentication tokens. Defaults to None.
scope (str, optional) – scope to use when requesting tokens. Can be set to “openid” or “offline_access” to get offline tokens. Defaults to “openid”.
domain (str, optional) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG. Defaults to “live”.
use_cache (bool, optional) – True if you want to use cached files. Defaults to True.
cache_dir (str, optional) – path to the cache_dir. Set it to None to not store any cached files. Defaults to “~/.cache/elg”.
- Returns
the corpus object initialized.
- Return type
elg.Corpus
- classmethod from_entity(entity: Entity, auth_object: Optional[str] = None, auth_file: Optional[str] = None, scope: Optional[str] = None, use_cache: bool = True, cache_dir='~/.cache/elg')¶
Class method to init a Corpus class from an Entity object. You can provide authentication information through the auth_object or the auth_file attributes. If not authentication information is provided, the Authentication object will be initialized.
- Parameters
entity (elg.Entity) – Entity object to init as a Corpus.
auth_object (elg.Authentication, optional) – elg.Authentication object to use. Defaults to None.
auth_file (str, optional) – json file that contains the authentication tokens. Defaults to None.
scope (str, optional) – scope to use when requesting tokens. Can be set to “openid” or “offline_access” to get offline tokens. Defaults to “openid”.
domain (str, optional) – ELG domain you want to use. “live” to use the public ELG, “dev” to use the development ELG and another value to use a local ELG. Defaults to “live”.
use_cache (bool, optional) – True if you want to use cached files. Defaults to True.
cache_dir (str, optional) – path to the cache_dir. Set it to None to not store any cached files. Defaults to “~/.cache/elg”.
- Returns
Corpus object with authentication information.
- Return type
elg.Corpus
- download(distribution_idx: int = 0, filename: str = None, folder: str = './')¶
Method to download the corpus if possible.
- Parameters
distribution_idx (int, optional) – Index of the distribution of the corpus to download. Defaults to 0.
filename (str, optional) – Name of the output file. If None, the name of the corpus will be used. Defaults to None.
folder (str, optional) – path to the folder where to save the downloaded file. Defaults to “./”.