Contribute a corpus/dataset

This page describes how to contribute a corpus to the European Language Grid.

Before you start

  • Please make sure that the corpus you want to contribute complies with our terms of use.

  • Please make sure you have registered and been assigned the provider role.

  • For this release, you can provide the data files of the corpus

    • at a remote URL and include this information in the relevant metadata element (accessLocation), or
    • if you want us to upload it at the ELG cloud area, contact us through the ELG contact form.

Step 1: create metadata

The first step is to describe your corpus using ELG’s metadata format, ELG-SHARE. Future releases of ELG will include an interactive editor for this. However, for now, you must create an XML file. Refer to the examples below for how to do this.

The elements you need are documented on the following pages:

For more information about ELG-SHARE, see:

At the ELG GitLab, you will find templates (that you can use to create new metadata records) and examples in XML format.

Example 1: Bilingual raw corpus

Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)

Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/734

<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
<ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
        <ms:metadataCreationDate>2020-10-03</ms:metadataCreationDate>
        <ms:metadataCurator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>username@someDomain.com</ms:email>
        </ms:metadataCurator>
        <ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
        <ms:metadataCreator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>username@someDomain.com</ms:email>
        </ms:metadataCreator>
        <ms:DescribedEntity>
                <ms:LanguageResource>
                        <ms:entityType>LanguageResource</ms:entityType>
                        <ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)</ms:resourceName>
                        <ms:description xml:lang="en">Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency.

        Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency. It was offered as collection of documents by the Bulgarian National Revenue Agency.  Modules of the ILSP Focused Crawler was used for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied:  TMX files generated from document pairs which have been identified by non-aupidh methods were discarded ;  TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ;  Alignments of non-[1:1] type(s) were discarded. ;  Alignments with a TUV (after normalization) that has less than 1 tokens, were annotated ;  Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were annotated ;  Alignments in which different digits appear in each TUV were kept and annotated. ;  Alignments with identical TUVs (after normalization) were annotated ;  Alignments with only non-letters in at least one of their TUVs were annotated ;  Duplicate alignments were kept and were annotated. The mean value of aligner's scores is 5.714609036504669, the std value is 1.8063256236105307. The mean value of length (in terms of characters) ratios is 1.0040012545201242 and the std value is 0.26545877788005745. There are 832 TUs with no annotation, containing 13336 words and 2604 lexical types in bul and 15010 words and 2031 lexical types in eng. The mean value of aligner's scores is 6.336834960545485, the std value is 1.53829791384023</ms:description>
                        <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_471</ms:LRIdentifier>
                        <ms:version>2.0</ms:version>
                        <ms:additionalInfo>
                                <ms:landingPage>https://elrc-share.eu/repository/browse/bilingual-bulgarian-english-corpus-from-the-national-revenue-agency-bg-processed/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:landingPage>
                        </ms:additionalInfo>
                        <ms:additionalInfo>
                                <ms:email>contact@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Rusinova</ms:surname>
                                        <ms:givenName xml:lang="en">Annie</ms:givenName>
                                        <ms:email>contact@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:iprHolder>
                                <ms:Organization>
                                        <ms:actorType>Organization</ms:actorType>
                                        <ms:organizationName xml:lang="en">National Revenue Agency (BG)</ms:organizationName>
                                        <ms:website>http://www.nap.bg/en/</ms:website>
                                </ms:Organization>
                        </ms:iprHolder>
                        <ms:keyword xml:lang="en">corpus</ms:keyword>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">FINANCE</ms:categoryLabel>
                                <ms:DomainIdentifier ms:DomainClassificationScheme="http://w3id.org/meta-share/meta-share/EUROVOC">24</ms:DomainIdentifier>
                        </ms:domain>
                        <ms:fundingProject>
                                <ms:projectName xml:lang="en">European Language Resource Coordination LOT3</ms:projectName>
                                <ms:ProjectIdentifier ms:ProjectIdentifierScheme="http://w3id.org/meta-share/meta-share/other">Tools and Resources for CEF Automated Translation - LOT3 (SMART 2015/1091 - 30-CE-0816766/00-92)</ms:ProjectIdentifier>
                                <ms:website>http://www.lr-coordination.eu</ms:website>
                        </ms:fundingProject>
                        <ms:validated>true</ms:validated>
                        <ms:validation>
                                <ms:validationDetails xml:lang="en">validated</ms:validationDetails>
                        </ms:validation>
                        <ms:relation>
                                <ms:relationType xml:lang="en">isAlignedVersionOf</ms:relationType>
                                <ms:relatedLR>
                                        <ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG)</ms:resourceName>
                                        <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
                                </ms:relatedLR>
                        </ms:relation>
                        <ms:LRSubclass>
                                <ms:Corpus>
                                        <ms:lrType>Corpus</ms:lrType>
                                        <ms:corpusSubclass>http://w3id.org/meta-share/meta-share/rawCorpus</ms:corpusSubclass>
                                        <ms:CorpusMediaPart>
                                                <ms:CorpusTextPart>
                                                        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
                                                        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
                                                        <ms:lingualityType>http://w3id.org/meta-share/meta-share/bilingual</ms:lingualityType>
                                                        <ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
                                                        <ms:language>
                                                                <ms:languageTag>bg</ms:languageTag>
                                                                <ms:languageId>bg</ms:languageId>
                                                                <ms:scriptId>Cyrl</ms:scriptId>
                                                        </ms:language>
                                                        <ms:language>
                                                                <ms:languageTag>en</ms:languageTag>
                                                                <ms:languageId>en</ms:languageId>
                                                        </ms:language>
                                                        <ms:textType>
                                                                <ms:categoryLabel xml:lang="en">administrativeTexts</ms:categoryLabel>
                                                        </ms:textType>
                                                        <ms:TextGenre>
                                                                <ms:categoryLabel xml:lang="en">official</ms:categoryLabel>
                                                        </ms:TextGenre>
                                                        <ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
                                                        <ms:hasOriginalSource>
                                                                <ms:resourceName xml:lang="en">ELRC-447</ms:resourceName>
                                                                <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
                                                        </ms:hasOriginalSource>
                                                        <ms:creationDetails xml:lang="en">See description for creation details</ms:creationDetails>
                                                </ms:CorpusTextPart>
                                        </ms:CorpusMediaPart>
                                        <ms:DatasetDistribution>
                                                <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
                                                <ms:accessLocation>https://elrc-share.eu/repository/download/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:accessLocation>
                                                <ms:distributionTextFeature>
                                                        <ms:size>
                                                                <ms:amount>1292</ms:amount>
                                                                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/unit</ms:sizeUnit>
                                                        </ms:size>
                                                        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
                                                        <ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
                                                </ms:distributionTextFeature>
                                                <ms:licenceTerms>
                                                        <ms:licenceTermsName xml:lang="en">publicDomain</ms:licenceTermsName>
                                                        <ms:licenceTermsURL>https://elrc-share.eu/terms/publicDomain.html</ms:licenceTermsURL>
                                                        <ms:LicenceIdentifier ms:LicenceIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">publicDomain</ms:LicenceIdentifier>
                                                </ms:licenceTerms>
                                                <ms:cost>
                                                        <ms:amount>0</ms:amount>
                                                        <ms:currency>http://w3id.org/meta-share/meta-share/euro</ms:currency>
                                                </ms:cost>
                                        </ms:DatasetDistribution>
                                        <ms:personalDataIncluded>false</ms:personalDataIncluded>
                                        <ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
                                </ms:Corpus>
                        </ms:LRSubclass>
                </ms:LanguageResource>
        </ms:DescribedEntity>
</ms:MetadataRecord>

Example 2: Annotated corpus

Greek Textual Entailment corpus

Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/649

<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
        <ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
        <ms:metadataCreationDate>2020-02-02</ms:metadataCreationDate>
        <ms:metadataCurator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>curator@somedomain.com</ms:email>
        </ms:metadataCurator>
        <ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
        <ms:metadataCreator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>curator@somedomain.com</ms:email>
        </ms:metadataCreator>
        <ms:DescribedEntity>
                <ms:LanguageResource>
                        <ms:entityType>LanguageResource</ms:entityType>
                        <ms:resourceName xml:lang="en">Greek Textual Entailment Corpus</ms:resourceName>
                        <ms:resourceShortName xml:lang="en">GTEC</ms:resourceShortName>
                        <ms:description xml:lang="en">GTEC consits of 600 T-H pairs manually annotated for entailment (i.e. whether T entails H or not) by human annotators. The dataset which is tailored to guide training and evaluation of prospect RTE systems, is equally divided in three subsets each one representing the output of a specific HLT application: Question Answering (QA), Comparable Documents (CD) and Machine Translation (MT), and pertaining to specific subject fields (e.g. law, politics, travel). T-H examples that correspond to success and failure cases of the afore-mentioned applications have been included in the corpus. The annotations provided are conformant to the RTE1 and RTE2 challenges.</ms:description>
                        <ms:version>v1.0.0 (automatically assigned)</ms:version>
                        <ms:additionalInfo>
                                        <ms:email>username@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:additionalInfo>
                                        <ms:email>username3@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Giouli</ms:surname>
                                        <ms:givenName xml:lang="en">Voula</ms:givenName>
                                        <ms:email>username@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Piperidis</ms:surname>
                                        <ms:givenName xml:lang="en">Stelios</ms:givenName>
                                        <ms:email>username3@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:keyword xml:lang="en">corpus</ms:keyword>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">law</ms:categoryLabel>
                        </ms:domain>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">politics</ms:categoryLabel>
                        </ms:domain>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">travel</ms:categoryLabel>
                        </ms:domain>
                        <ms:resourceCreator>
                                <ms:Organization>
                                        <ms:actorType>Organization</ms:actorType>
                                        <ms:organizationName xml:lang="en">Institute for Language and Speech Processing</ms:organizationName>
                                        <ms:website>http://www.ilsp.gr</ms:website>
                                </ms:Organization>
                        </ms:resourceCreator>
                        <ms:intendedApplication>
                                <ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
                        </ms:intendedApplication>
                        <ms:actualUse>
                                <ms:usedInApplication>
                                        <ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
                                </ms:usedInApplication>
                                <ms:actualUseDetails xml:lang="en">nlpApplications</ms:actualUseDetails>
                        </ms:actualUse>
                        <ms:isDocumentedBy>
                                <ms:title xml:lang="en">Building a Greek corpus of Textual Entailment</ms:title>
                                <ms:DocumentIdentifier ms:DocumentIdentifierScheme="http://purl.org/spar/datacite/url">http://www.lrec-conf.org/proceedings/lrec2008/pdf/427_paper.pdf</ms:DocumentIdentifier>
                        </ms:isDocumentedBy>
                        <ms:LRSubclass>
                                <ms:Corpus>
                                        <ms:lrType>Corpus</ms:lrType>
                                        <ms:corpusSubclass>http://w3id.org/meta-share/meta-share/annotatedCorpus</ms:corpusSubclass>
                                        <ms:CorpusMediaPart>
                                                <ms:CorpusTextPart>
                                                        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
                                                        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
                                                        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
                                                        <ms:language>
                                                                <ms:languageTag>el</ms:languageTag>
                                                                <ms:languageId>el</ms:languageId>
                                                        </ms:language>
                                                        <ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
                                                        <ms:originalSourceDescription xml:lang="en">web news</ms:originalSourceDescription>
                                                        <ms:originalSourceDescription xml:lang="en">EU texts</ms:originalSourceDescription>
                                                </ms:CorpusTextPart>
                                        </ms:CorpusMediaPart>
                                        <ms:DatasetDistribution>
                                                <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
                                                <ms:accessLocation>http://metashare.ilsp.gr:8080/repository/download/26dca2fe63d211e29b2c842b2b6a04d7db87c85bfbe34326bb4c2e88b8c4da85</ms:accessLocation>
                                                <ms:distributionTextFeature>
                                                        <ms:size>
                                                                <ms:amount>600</ms:amount>
                                                                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/T-HPair</ms:sizeUnit>
                                                        </ms:size>
                                                        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
                                                </ms:distributionTextFeature>
                                                <ms:licenceTerms>
                                                        <ms:licenceTermsName xml:lang="en">CC-BY-4.0</ms:licenceTermsName>
                                                        <ms:licenceTermsURL>https://spdx.org/licenses/CC-BY-4.0.html</ms:licenceTermsURL>
                                                </ms:licenceTerms>
                                                <ms:attributionText xml:lang="en">Greek Textual Entailment Corpus by Athena R.C./ILSP used under CC-BY licence</ms:attributionText>
                                        </ms:DatasetDistribution>
                                        <ms:personalDataIncluded>false</ms:personalDataIncluded>
                                        <ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/Lemma</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
                                                <ms:isAnnotatedBy>
                                                        <ms:resourceName xml:lang="en">ILSP-Lemmatizer</ms:resourceName>
                                                </ms:isAnnotatedBy>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/PartOfSpeech</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:tagset>
                                                        <ms:resourceName xml:lang="en">ILSP/PAROLE tagset</ms:resourceName>
                                                </ms:tagset>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
                                                <ms:isAnnotatedBy>
                                                        <ms:resourceName xml:lang="en">ILSP FBT POS tagger</ms:resourceName>
                                                </ms:isAnnotatedBy>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">Automatic annotation followed by manual correction</ms:annotationModeDetails>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SemanticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/manual</ms:annotationMode>
                                        </ms:annotation>
                                </ms:Corpus>
                        </ms:LRSubclass>
                </ms:LanguageResource>
        </ms:DescribedEntity>
</ms:MetadataRecord>

Step 2: upload

From the ELG catalogue, click the “Upload” link as shown below:

Upload menu

Now upload the file you created in Step 1:

Upload metadata XML

If there are any errors in your XML file, these will be shown to you. Fix them and try the upload again. Eventually, a success message will be shown to you and the metadata will be imported into the database.

Step 3: wait for approval

At this stage, the metadata record is only visible to you and to us, the ELG platform administrators. We will check your contribution and integrate it into the ELG catalogue if everything is in order, and contact you otherwise.