Contribute a corpus/dataset

This page describes how to contribute a corpus to the European Language Grid. You can describe a corpus and upload its contents at ELG or include in its description a link to the location it can be accessed from.

0. Before you start

1. Prepare the content files (for ELG hosted resources)

If you wish to upload the corpus at ELG, you must package it in a compressed format (currently as a .zip file).

If the files are available in multiple formats, (e.g. in XML, TXT and PDF formats), you are advised to package them in different zip files by data format.

For recommendations on the criteria for organizing files into corpora, you can also see below.

2. Describe the corpus

Metadata overview

The corpus must be described according to the ELG schema and comply at least with the minimal version. The metadata elements that you need to provide for the corpus comprise a set of elements organized (for presentation purposes) into the following groups:

Examples

Example 1: Bilingual raw corpus

Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)

Published at: https://live.european-language-grid.eu/catalogue/corpus/2943

Example 2: Annotated corpus

Greek Textual Entailment corpus

Published at: https://live.european-language-grid.eu/catalogue/corpus/649

Recommendations for the description and organization of corpora

Corpora are composed of files that can be organized according to different criteria. For instance, a multilingual corpus of texts from various domains can be described as a whole (one metadata record) or split into subsets (and corresponding metadata records) using the language and/or domain criteria.

In order to facilitate users, especially those accessing ELG through programmatic APIs, to automatically identify, download and use corpora as is, without having to download them and manually search among them the subsets that interest them, we include here some recommendations.

Providers that upload their corpora into ELG can use the following recommendations to appropriately package the files and register them as one or multiple metadata records.

Providers that grant access to corpora through hyperlinks can use as a criterion for the registration of one or multiple recrods the availability of the corpus through a direct link (downloadLocation).

The following cases are recommended:

  • multilingual corpora: we recommend the split into bilingual pairs, so that users can easily find them and use them, for instance, to train bilingual models;

  • corpora of shared tasks: these are usually already split into training, develoment, gold, and test corpus, with a direct link to each of these datasets; we suggest to use this as an established practice and register them as separate metadata records.

In all cases, providers can create a “parent” metadata record, to which the metadata records of the subsets can point, using the isPartOf relation.

On the other hand, the concept of distribution (see ELG schema) can be used to describe resources with the same metadata record in the following cases:

  • corpora available in multiple formats: these can be described with the same metadata record, but different distributions

  • corpora available with different licensing terms: if the same corpus is available with different licensing terms (e.g. for non-commercial use for free and for commercial use on a fee),

  • corpora available with multiple licences: composite resources (e.g. a corpus available via an interface, a tool available with a model) may be licensed with multiple licences, one for the data and one for the tool. These can be described with the same metadata record and distribution where both licences are added.

3. Register the corpus at ELG

The current release of ELG offers two options for registering a catalogue item:

To upload the content files for the corpus, you can follow the procedure described here.

4. Manage and submit for publication

Through the “My items” page you can access your metadata record (see Manage your items) and edit it until you are satisfied. You can then submit it for publication, in line with the publication lifecycle defined for ELG metadata records.

At this stage, the metadata record can no longer be edited and is only visible to you and to us, the ELG platform administrators.

Before it is published, your submission undergoes a validation process, which is described in detail at CHAPTER 4: VALIDATING ITEMS.

Once approved, it will appear on the ELG catalogue and you will receive a notification email.