Provide a Language Resource

You can register at the ELG platform data resources, such as corpora (raw and annotated), computational lexica, terminological glossaries, models, computational grammars, etc. For more information on the resource types, see Overview.

Corpora are structured collections of data selected according to specific criteria in order to represent as comprehensively as possible a research question. The most common cases are:

  • text corpora: monolingual, bilingual or multilingual collections of texts in a specific domain, such as corpora of news articles, scientific publications, legal documents, medical records, tweets, etc.
  • corpora of audio recordings, e.g., of broadcast news, or lists of sentences recorded by individuals from a specific region with a dialect accent, etc.
  • collections of videos, such as interviews with politicians, sign language corpora, etc.
  • corpora combining all of the above, such as a multimedia corpus of video lectures, with their audio recordings, transcripts, subtitles and their translations.

Under language descriptions, we comprise:

  • models, including Machine Learning models, statistical models, word embeddings, n-gram models,
  • computational grammars of a language, language variety or for a specific domain or phenomenon.

The vast majority of these consist of a text part, but videos and images are also foreseen for cases such as sign language grammars.

Examples of lexical/conceptual resources include

  • computational lexica, that are used for computational processing, and include morphological, syntactic and semantic information;
  • dictionaries in digital format,
  • ontologies and controlled vocabularies,
  • monolingual and multilingual terminological glossaries,
  • word lists, gazetteers of place names, proper names, etc.

They typically consist of a text part, but they may also comprise audio and video files, as in the case of:

  • multimedia lexica with sound recordings (e.g., pronunciation of a word) and images (e.g. pictures denoting the sense of a word),
  • sign language lexica with videos.

Technical requirements

All data resources must be provided as .zip, .tar or .tar.gz archives.

Describe a Language Resource

To register your resource at ELG, you must describe it according to the ELG metadata schema (at least minimal version), i.e., you have to provide a metadata record, and upload this description to the platform.

Note

For this release, you MUST provide an ELG-compliant XML file. Upcoming releases will also include a metadata editor and other functionalities supporting an easy import of metadata records.

You will find the full schema XSD, documentation and templates and examples of metadata records for all resource types here and some examples of already registered data resources here.

Examples of metadata records and a list of the metadata elements of the minimal version are given in separate sections:

Register a language resource to the platform

The following steps should be followed:

  • Provide a metadata record: Sign into the ELG platform using your credentials and press the “upload” button on the main menu.
Upload menu

Then upload the XML file that contains the metadata. In the current release this is the only way to provide them. Upcoming releases will also include a metadata editor and other functionalities supporting an easy import of metadata records.

Upload metadata XML

The metadata record is validated at import against the metadata schema. Additional rules that check for syntactic and partial semantic integrity are also used. If the file is found invalid, you will see a message with a list of errors; you must correct them and re-upload the file. If it is valid you will be shown a success message; the file will be imported in the database. At this stage, it is visible only to the platform administrators.

  • Resource is checked: The administrator will assign it to a reviewer; during the review process, the metadata record is visible only to you (LR provider) and the reviewer.
Resource under review.
  • LR is published: When the LR has been checked, the reviewer will approve it; the metadata record is then published and visible to all ELG users through the catalogue.