Contribute a corpus/dataset

This page describes how to contribute a corpus to the European Language Grid.

To contribute a corpus to ELG, you must

  • create a metadata record for it, with at least the mandatory elements,

  • provide access to the physical data (aka content files), by uploading them to ELG or including an external link in the metadata.

Recommendations for the description and organization of corpora

Corpora are composed of files that can be organized according to different criteria. For instance, a multilingual corpus of texts from various domains can be described as a whole (one metadata record) or split into subsets (and corresponding metadata records) using the language and/or domain criteria.

In order to facilitate users, especially those accessing ELG through programmatic APIs, to automatically identify, download and use corpora as is, without having to download them and manually search among them the subsets that interest them, we include here some recommendations.

Providers that upload their corpora into ELG can use the following recommendations to appropriately package the files and register them as one or multiple metadata records.

Providers that grant access to corpora through hyperlinks can use as a criterion for the registration of one or multiple records the availability of the corpus through a direct link (downloadLocation).

The following cases are recommended:

  • multilingual corpora: we recommend the split into bilingual pairs, so that users can easily find them and use them, for instance, to train bilingual models;

  • corpora of shared tasks: these are usually already split into training, develoment, gold, and test corpus, with a direct link to each of these datasets; we suggest to use this as an established practice and register them as separate metadata records.

In all cases, we suggest you create a parent metadata record, to which the metadata records of the subsets can point, using the isPartOf relation.

On the other hand, the concept of distribution (see ELG schema) can be used to describe resources with the same metadata record in the following cases:

  • corpora available in multiple formats: these can be described with the same metadata record, but different distributions;

  • corpora available with different licensing terms: if the same corpus is available with different licensing terms (e.g. for non-commercial use for free and for commercial use on a fee).

Note

Corpora available with multiple licences, i.e. composite resources (e.g. a corpus available via an interface, a tool available with a model) may be licensed with multiple licences, one for the data and one for the tool. These can be described with the same metadata record and distribution where both licences are added.

0. Before you start

  • Please make sure that the corpus you want to contribute complies with our terms of use.

  • Please make sure you have registered and been assigned the provider role.

  • Check out our recommendations for the description and organization of corpora into distinct metadata records.

1. Prepare the content files (for ELG hosted resources)

If you wish to upload the corpus at ELG, you must package it in a compressed format (currently as a .zip, .tar, or .gz file).

Tip

If the files are available in multiple formats, (e.g. in XML, TXT and PDF formats), you are advised to package them in different zip files by data format and describe them as distinct distributions.

2. Describe and register the corpus at ELG

You can register the item (i.e., the metadata record and, optionally, the content files)

To upload the content files for the corpus, you can follow the procedure described here.

The following figure gives an overview of the metadata elements you must provide 1 for a corpus, replicating the editor (with sections horizontally and tabs vertically) so that you can easily track each element. In the editor, all elements, mandatory or not, are explained by definitions and examples.

Corpus at a glance

To describe any resource efficiently you need to name it, provide a description with a few words about it and indicate its version 2. Then, one or more keywords are asked for the resource and an email or a landing page for anyone who wishes to have additional information about it.

For corpora, you must also specify the corpus subclass (if it is raw or annotated, for example) and whether personal or sensitive data are included. If this is the case, you must say whether they have been anonymized.

You also have to describe independently each distributable form of the corpus (i.e. all the ways the user can obtain it, e.g., in a downloadable form, or accessed through a data service). For each distribution, you must always specify the licence under which it is made available. In case you decide not to upload the content files at ELG, you must also include a link to the point it can be accessed from (download / access / distribution location 3 ).

Your corpus consists of one or more media parts (namely: text, audio, video, image or numerical text parts). Each of these parts must be described separately. For instance, if you have a corpus of video recordings and their subtitles in various languages, you must provide separately information on the language(s) of each part, and, if multilingual, multilinguality type, as well as their respective distribution features (size and data format at least). The figure shows the mandatory and mandatory if applicable elements for each type of part and distribution feature group.

3. Manage and submit for publication

Through the My items page you can access your metadata record (see Manage your items) and edit it until you are satisfied. You can then submit it for publication, in line with the publication lifecycle defined for ELG metadata records.

At this stage, the metadata record can no longer be edited and is only visible to you and to us, the ELG technical team.

Before it is published, your submission undergoes a validation process, which is described in detail at CHAPTER 4: VALIDATING ITEMS.

Once approved, it will appear on the ELG catalogue and you will receive a notification email.

1

You must fill in at least the mandatory elements for the metadata record to be saved. In addition, you may be required to fill in specific mandatory if applicable elements (indicated in the figure with an asterisk), depending on the values you provide for other elements.

2

If no version number is provided, the system will automatically number it as “1.0.0” with an indication that it has been automatically assigned. We recommend, however, the use of Semantic Versioning (https://semver.org/) for labelling versions.

3

The three elements differ in terms of the actions that a consumer has to undertake in order to access the resource. Use download location to provide a direct link to the content files; no actions are required on behalf of the user who can simply download the file(s). Access location is typically a page with some text, which includes a button or a link for accessing or downloading the resource itself; the user must read through the text on the page in order to find the link. Finally, distribution location is reserved for distributable forms in formats, such as CD-ROM, or hard disks, for which the user must engage in a transaction with the provider to gain access to the resource.