Describe a corpus (dataset)

In this section you will find information on how to describe a corpus with the minimal metadata in order to register it into the ELG platform. If you want to find more on the ELG resource types, see Overview. You will also find instructions for all data resources (technical requirements, registration instructions to the platform) in Provide a Language Resource.

Corpora are structured collections of data selected according to specific criteria in order to represent as comprehensively as possible a research question. The most common cases are:

  • text corpora: monolingual, bilingual or multilingual collections of texts in a specific domain, such as corpora of news articles, scientific publications, legal documents, medical records, tweets, etc.
  • corpora of audio recordings, e.g., of broadcast news, or lists of sentences recorded by individuals from a specific region with a dialect accent, etc.
  • collections of videos, such as interviews with politicians, sign language corpora, etc.
  • corpora combining all of the above, such as a multimedia corpus of video lectures, with their audio recordings, transcripts, subtitles and their translations.

Examples of metadata records for corpora

Bilingual raw corpus: Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)

Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/734

<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
<ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
        <ms:metadataCreationDate>2020-10-03</ms:metadataCreationDate>
        <ms:metadataCurator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>username@someDomain.com</ms:email>
        </ms:metadataCurator>
        <ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
        <ms:metadataCreator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>username@someDomain.com</ms:email>
        </ms:metadataCreator>
        <ms:DescribedEntity>
                <ms:LanguageResource>
                        <ms:entityType>LanguageResource</ms:entityType>
                        <ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)</ms:resourceName>
                        <ms:description xml:lang="en">Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency.

        Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency. It was offered as collection of documents by the Bulgarian National Revenue Agency.  Modules of the ILSP Focused Crawler was used for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied:  TMX files generated from document pairs which have been identified by non-aupidh methods were discarded ;  TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ;  Alignments of non-[1:1] type(s) were discarded. ;  Alignments with a TUV (after normalization) that has less than 1 tokens, were annotated ;  Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were annotated ;  Alignments in which different digits appear in each TUV were kept and annotated. ;  Alignments with identical TUVs (after normalization) were annotated ;  Alignments with only non-letters in at least one of their TUVs were annotated ;  Duplicate alignments were kept and were annotated. The mean value of aligner's scores is 5.714609036504669, the std value is 1.8063256236105307. The mean value of length (in terms of characters) ratios is 1.0040012545201242 and the std value is 0.26545877788005745. There are 832 TUs with no annotation, containing 13336 words and 2604 lexical types in bul and 15010 words and 2031 lexical types in eng. The mean value of aligner's scores is 6.336834960545485, the std value is 1.53829791384023</ms:description>
                        <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_471</ms:LRIdentifier>
                        <ms:version>2.0</ms:version>
                        <ms:additionalInfo>
                                <ms:landingPage>https://elrc-share.eu/repository/browse/bilingual-bulgarian-english-corpus-from-the-national-revenue-agency-bg-processed/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:landingPage>
                        </ms:additionalInfo>
                        <ms:additionalInfo>
                                <ms:email>contact@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Rusinova</ms:surname>
                                        <ms:givenName xml:lang="en">Annie</ms:givenName>
                                        <ms:email>contact@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:iprHolder>
                                <ms:Organization>
                                        <ms:actorType>Organization</ms:actorType>
                                        <ms:organizationName xml:lang="en">National Revenue Agency (BG)</ms:organizationName>
                                        <ms:website>http://www.nap.bg/en/</ms:website>
                                </ms:Organization>
                        </ms:iprHolder>
                        <ms:keyword xml:lang="en">corpus</ms:keyword>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">FINANCE</ms:categoryLabel>
                                <ms:DomainIdentifier ms:DomainClassificationScheme="http://w3id.org/meta-share/meta-share/EUROVOC">24</ms:DomainIdentifier>
                        </ms:domain>
                        <ms:fundingProject>
                                <ms:projectName xml:lang="en">European Language Resource Coordination LOT3</ms:projectName>
                                <ms:ProjectIdentifier ms:ProjectIdentifierScheme="http://w3id.org/meta-share/meta-share/other">Tools and Resources for CEF Automated Translation - LOT3 (SMART 2015/1091 - 30-CE-0816766/00-92)</ms:ProjectIdentifier>
                                <ms:website>http://www.lr-coordination.eu</ms:website>
                        </ms:fundingProject>
                        <ms:validated>true</ms:validated>
                        <ms:validation>
                                <ms:validationDetails xml:lang="en">validated</ms:validationDetails>
                        </ms:validation>
                        <ms:relation>
                                <ms:relationType xml:lang="en">isAlignedVersionOf</ms:relationType>
                                <ms:relatedLR>
                                        <ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG)</ms:resourceName>
                                        <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
                                </ms:relatedLR>
                        </ms:relation>
                        <ms:LRSubclass>
                                <ms:Corpus>
                                        <ms:lrType>Corpus</ms:lrType>
                                        <ms:corpusSubclass>http://w3id.org/meta-share/meta-share/rawCorpus</ms:corpusSubclass>
                                        <ms:CorpusMediaPart>
                                                <ms:CorpusTextPart>
                                                        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
                                                        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
                                                        <ms:lingualityType>http://w3id.org/meta-share/meta-share/bilingual</ms:lingualityType>
                                                        <ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
                                                        <ms:language>
                                                                <ms:languageTag>bg</ms:languageTag>
                                                                <ms:languageId>bg</ms:languageId>
                                                                <ms:scriptId>Cyrl</ms:scriptId>
                                                        </ms:language>
                                                        <ms:language>
                                                                <ms:languageTag>en</ms:languageTag>
                                                                <ms:languageId>en</ms:languageId>
                                                        </ms:language>
                                                        <ms:textType>
                                                                <ms:categoryLabel xml:lang="en">administrativeTexts</ms:categoryLabel>
                                                        </ms:textType>
                                                        <ms:TextGenre>
                                                                <ms:categoryLabel xml:lang="en">official</ms:categoryLabel>
                                                        </ms:TextGenre>
                                                        <ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
                                                        <ms:hasOriginalSource>
                                                                <ms:resourceName xml:lang="en">ELRC-447</ms:resourceName>
                                                                <ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
                                                        </ms:hasOriginalSource>
                                                        <ms:creationDetails xml:lang="en">See description for creation details</ms:creationDetails>
                                                </ms:CorpusTextPart>
                                        </ms:CorpusMediaPart>
                                        <ms:DatasetDistribution>
                                                <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
                                                <ms:accessLocation>https://elrc-share.eu/repository/download/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:accessLocation>
                                                <ms:distributionTextFeature>
                                                        <ms:size>
                                                                <ms:amount>1292</ms:amount>
                                                                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/unit</ms:sizeUnit>
                                                        </ms:size>
                                                        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
                                                        <ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
                                                </ms:distributionTextFeature>
                                                <ms:licenceTerms>
                                                        <ms:licenceTermsName xml:lang="en">publicDomain</ms:licenceTermsName>
                                                        <ms:licenceTermsURL>https://elrc-share.eu/terms/publicDomain.html</ms:licenceTermsURL>
                                                        <ms:LicenceIdentifier ms:LicenceIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">publicDomain</ms:LicenceIdentifier>
                                                </ms:licenceTerms>
                                                <ms:cost>
                                                        <ms:amount>0</ms:amount>
                                                        <ms:currency>http://w3id.org/meta-share/meta-share/euro</ms:currency>
                                                </ms:cost>
                                        </ms:DatasetDistribution>
                                        <ms:personalDataIncluded>false</ms:personalDataIncluded>
                                        <ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
                                </ms:Corpus>
                        </ms:LRSubclass>
                </ms:LanguageResource>
        </ms:DescribedEntity>
</ms:MetadataRecord>

Annotated corpus: Greek Textual Entailment corpus

Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/649

<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
        <ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
        <ms:metadataCreationDate>2020-02-02</ms:metadataCreationDate>
        <ms:metadataCurator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>curator@somedomain.com</ms:email>
        </ms:metadataCurator>
        <ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
        <ms:metadataCreator>
                <ms:actorType>Person</ms:actorType>
                <ms:surname xml:lang="en">Smith</ms:surname>
                <ms:givenName xml:lang="en">John</ms:givenName>
                <ms:email>curator@somedomain.com</ms:email>
        </ms:metadataCreator>
        <ms:sourceOfMetadataRecord>META-SHARE/ILSP</ms:sourceOfMetadataRecord>
        <ms:DescribedEntity>
                <ms:LanguageResource>
                        <ms:entityType>LanguageResource</ms:entityType>
                        <ms:resourceName xml:lang="en">Greek Textual Entailment Corpus</ms:resourceName>
                        <ms:resourceShortName xml:lang="en">GTEC</ms:resourceShortName>
                        <ms:description xml:lang="en">GTEC consits of 600 T-H pairs manually annotated for entailment (i.e. whether T entails H or not) by human annotators. The dataset which is tailored to guide training and evaluation of prospect RTE systems, is equally divided in three subsets each one representing the output of a specific HLT application: Question Answering (QA), Comparable Documents (CD) and Machine Translation (MT), and pertaining to specific subject fields (e.g. law, politics, travel). T-H examples that correspond to success and failure cases of the afore-mentioned applications have been included in the corpus. The annotations provided are conformant to the RTE1 and RTE2 challenges.</ms:description>
                        <ms:version>v1.0.0 (automatically assigned)</ms:version>
                        <ms:additionalInfo>
                                        <ms:email>username@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:additionalInfo>
                                        <ms:email>username3@someDomain.com</ms:email>
                        </ms:additionalInfo>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Giouli</ms:surname>
                                        <ms:givenName xml:lang="en">Voula</ms:givenName>
                                        <ms:email>username@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:contact>
                                <ms:Person>
                                        <ms:actorType>Person</ms:actorType>
                                        <ms:surname xml:lang="en">Piperidis</ms:surname>
                                        <ms:givenName xml:lang="en">Stelios</ms:givenName>
                                        <ms:email>username3@someDomain.com</ms:email>
                                </ms:Person>
                        </ms:contact>
                        <ms:keyword xml:lang="en">corpus</ms:keyword>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">law</ms:categoryLabel>
                        </ms:domain>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">politics</ms:categoryLabel>
                        </ms:domain>
                        <ms:domain>
                                <ms:categoryLabel xml:lang="en">travel</ms:categoryLabel>
                        </ms:domain>
                        <ms:resourceCreator>
                                <ms:Organization>
                                        <ms:actorType>Organization</ms:actorType>
                                        <ms:organizationName xml:lang="en">Institute for Language and Speech Processing</ms:organizationName>
                                        <ms:website>http://www.ilsp.gr</ms:website>
                                </ms:Organization>
                        </ms:resourceCreator>
                        <ms:intendedApplication>
                                <ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
                        </ms:intendedApplication>
                        <ms:actualUse>
                                <ms:usedInApplication>
                                        <ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
                                </ms:usedInApplication>
                                <ms:actualUseDetails xml:lang="en">nlpApplications</ms:actualUseDetails>
                        </ms:actualUse>
                        <ms:isDocumentedBy>
                                <ms:title xml:lang="en">Building a Greek corpus of Textual Entailment</ms:title>
                                <ms:DocumentIdentifier ms:DocumentIdentifierScheme="http://purl.org/spar/datacite/url">http://www.lrec-conf.org/proceedings/lrec2008/pdf/427_paper.pdf</ms:DocumentIdentifier>
                        </ms:isDocumentedBy>
                        <ms:LRSubclass>
                                <ms:Corpus>
                                        <ms:lrType>Corpus</ms:lrType>
                                        <ms:corpusSubclass>http://w3id.org/meta-share/meta-share/annotatedCorpus</ms:corpusSubclass>
                                        <ms:CorpusMediaPart>
                                                <ms:CorpusTextPart>
                                                        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
                                                        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
                                                        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
                                                        <ms:language>
                                                                <ms:languageTag>el</ms:languageTag>
                                                                <ms:languageId>el</ms:languageId>
                                                        </ms:language>
                                                        <ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
                                                        <ms:originalSourceDescription xml:lang="en">web news</ms:originalSourceDescription>
                                                        <ms:originalSourceDescription xml:lang="en">EU texts</ms:originalSourceDescription>
                                                </ms:CorpusTextPart>
                                        </ms:CorpusMediaPart>
                                        <ms:DatasetDistribution>
                                                <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
                                                <ms:accessLocation>http://metashare.ilsp.gr:8080/repository/download/26dca2fe63d211e29b2c842b2b6a04d7db87c85bfbe34326bb4c2e88b8c4da85</ms:accessLocation>
                                                <ms:distributionTextFeature>
                                                        <ms:size>
                                                                <ms:amount>600</ms:amount>
                                                                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/T-HPair</ms:sizeUnit>
                                                        </ms:size>
                                                        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
                                                </ms:distributionTextFeature>
                                                <ms:licenceTerms>
                                                        <ms:licenceTermsName xml:lang="en">CC-BY-4.0</ms:licenceTermsName>
                                                        <ms:licenceTermsURL>https://spdx.org/licenses/CC-BY-4.0.html</ms:licenceTermsURL>
                                                </ms:licenceTerms>
                                                <ms:attributionText xml:lang="en">Greek Textual Entailment Corpus by Athena R.C./ILSP used under CC-BY licence</ms:attributionText>
                                        </ms:DatasetDistribution>
                                        <ms:personalDataIncluded>false</ms:personalDataIncluded>
                                        <ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/Lemma</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
                                                <ms:isAnnotatedBy>
                                                        <ms:resourceName xml:lang="en">ILSP-Lemmatizer</ms:resourceName>
                                                </ms:isAnnotatedBy>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/PartOfSpeech</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:tagset>
                                                        <ms:resourceName xml:lang="en">ILSP/PAROLE tagset</ms:resourceName>
                                                </ms:tagset>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
                                                <ms:isAnnotatedBy>
                                                        <ms:resourceName xml:lang="en">ILSP FBT POS tagger</ms:resourceName>
                                                </ms:isAnnotatedBy>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
                                                <ms:annotationModeDetails xml:lang="en">Automatic annotation followed by manual correction</ms:annotationModeDetails>
                                        </ms:annotation>
                                        <ms:annotation>
                                                <ms:annotationType>http://w3id.org/meta-share/omtd-share/SemanticAnnotationType</ms:annotationType>
                                                <ms:annotationStandoff>false</ms:annotationStandoff>
                                                <ms:annotationMode>http://w3id.org/meta-share/meta-share/manual</ms:annotationMode>
                                        </ms:annotation>
                                </ms:Corpus>
                        </ms:LRSubclass>
                </ms:LanguageResource>
        </ms:DescribedEntity>
</ms:MetadataRecord>

Minimal version metadata for corpora

The set of the metadata (mandatory or recommended) that are common to all kinds of resources including data language resources are presented in section Minimal version - List of elements common to all LRTs. In addition, the metadata elements that are required or recommended for corpora are described below.

For a quick guide to the ELG template, see Template - Explanations.

Corpus

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus

Data type component

Optionality Mandatory

Explanation & Instructions

Wraps together the set of elements that is specific to corpora

Example

<ms:LRSubclass>
        <ms:Corpus>
                <ms:lrType>Corpus</ms:lrType>
        </ms:Corpus>
</ms:LRSubclass>

corpusSubclass

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.corpusSubclass

Data type CV (corpusSubclass)

Optionality Mandatory

Explanation & Instructions

Introduces a classification of corpora into types (used for descriptive reasons)

Use one of the values for raw corpora, annotated corpora (mixed raw with annotations), annotations (only annotations without the original corpus)

Example

<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/rawCorpus</ms:corpusSubclass>

<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/annotatedCorpus</ms:corpusSubclass>

CorpusTextPart

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusTextPart

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

The part of a corpus (or a whole corpus) that consists of textual segments (e.g., a corpus of publications, or transcriptions of an oral corpus, or subtitles , etc.)

You can repeat the group of elements for multiple textual parts.

The mandatory or recommended elements for the text part are:

  • mediaType (Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘text’.
  • lingualityType (Mandatory): Indicates whether the resource includes one, two or more languages.
  • multilingualityType (Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language.
  • language (Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See language.
  • languageVariety (Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.
  • modalityType (Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.
  • TextGenre (Recommended): A category of text characterized by a particular style, form, or content according to a specific classification scheme. See TextGenre.

Example

<ms:CorpusTextPart>
        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>es</ms:languageTag>
                <ms:languageId>es</ms:languageId>
        </ms:language>
</ms:CorpusTextPart>

<ms:CorpusTextPart>
        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/bilingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>es</ms:languageTag>
                <ms:languageId>es</ms:languageId>
        </ms:language>
        <ms:language>
                <ms:languageTag>en</ms:languageTag>
                <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
        <ms:TextGenre>
                <ms:CategoryLabel>administrative texts</ms:CategoryLabel>
        </ms:TextGenre>
</ms:CorpusTextPart>

<ms:CorpusTextPart>
        <ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>en</ms:languageTag>
                <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
</ms:CorpusTextPart>

CorpusAudioPart

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

The part of a corpus (or whole corpus) that consists of audio segments

You can repeat the group of elements for multiple audio parts.

The mandatory or recommended elements for the audio part are:

  • mediaType (Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘audio’
  • lingualityType (Mandatory ): Indicates whether the resource includes one, two or more languages
  • multilingualityType (Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language)
  • language (Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See language
  • languageVariety (Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.
  • modalityType (Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.
  • AudioGenre (Recommended if applicable): A category of audio characterized by a particular style, form, or content according to a specific classification scheme. See AudioGenre
  • SpeechGenre (Recommended if applicable): A category for the conventionalized discourse of the speech part of a language resource, based on extra-linguistic and internal linguistic criteria. See SpeechGenre

Example

<ms:CorpusAudioPart>
        <ms:corpusMediaType>CorpusAudioPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/audio</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>en</ms:languageTag>
                <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:AudioGenre>
                <ms:CategoryLabel>conference noises</ms:CategoryLabel>
        </ms:AudioGenre>
</ms:CorpusAudioPart>

<ms:CorpusAudioPart>
        <ms:corpusMediaType>CorpusAudioPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/audio</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>en</ms:languageTag>
                <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
        <ms:SpeechGenre>
                <ms:CategoryLabel>monologue</ms:CategoryLabel>
        </ms:SpeechGenre>
</ms:CorpusAudioPart>

CorpusVideoPart

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusVideoPart

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

The part of a corpus (or a whole corpus) that consists of video segments (e.g., a corpus of video lectures, a part of a corpus with news, a sign language corpus, etc.)

You can repeat the group of elements for multiple video parts.

The mandatory or recommended elements for the video part are:

  • mediaType (Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘video’.
  • lingualityType (Mandatory ): Indicates whether the resource includes one, two or more languages.
  • multilingualityType (Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language).
  • language (Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See language.
  • languageVariety (Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.
  • modalityType (Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.
  • VideoGenre (Recommended): A classification of video parts based on extra-linguistic and internal linguistic criteria and reflected on the video style, form or content. See VideoGenre
  • typeOfVideoContent (Mandatory): Main type of object or people represented in the video.

Example

<ms:CorpusVideoPart>
        <ms:corpusMediaType>CorpusVideoPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/video</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>en</ms:languageTag>
                <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:modalityType>http://w3id.org/meta-share/meta-share/bodyGesture</ms:modalityType>
        <ms:modalityType>http://w3id.org/meta-share/meta-share/facialExpression</ms:modalityType>
        <ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
        <ms:typeOfVideoContent>people eating at a restaurant</ms:typeOfVideoContent>
</ms:CorpusVideoPart>

<ms:CorpusVideoPart>
        <ms:corpusMediaType>CorpusVideoPart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/video</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>fr</ms:languageTag>
                <ms:languageId>fr</ms:languageId>
        </ms:language>
        <ms:VideoGenre>
                <ms:CategoryLabel>documentary</ms:CategoryLabel>
        </ms:VideoGenre>
        <ms:typeOfVideoContent>birds, wild animals, plants</ms:typeOfVideoContent>
</ms:CorpusVideoPart>

CorpusImagePart

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusImagePart

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

The part of a corpus (or whole corpus) that consists of images (e.g., g a corpus of photographs and their captions)

You can repeat the group of elements for multiple video parts.

The mandatory or recommended elements for the image part are:

  • mediaType (Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘image’.
  • lingualityType (Mandatory ): Indicates whether the resource includes one, two or more languages.
  • multilingualityType (Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language).
  • language (Mandatory): Specifies the language that is used in the resource part, expressed according to the BCP47 recommendation. See language.
  • languageVariety (Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.
  • modalityType (Recommended if applicable): Specifies the type of the modality represented in the resource.
  • ImageGenre (Recommended): A category of images characterized by a particular style, form, or content according to a specific classification scheme. See ImageGenre.
  • typeOfImageContent (Mandatory): Main type of object or people represented in the image.

Example

<ms:CorpusImagePart>
        <ms:corpusMediaType>CorpusImagePart</ms:corpusMediaType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/image</ms:mediaType>
        <ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
        <ms:language>
                <ms:languageTag>el</ms:languageTag>
                <ms:languageId>el</ms:languageId>
        </ms:language>
        <ms:ImageGenre>
                <ms:CategoryLabel>comics</ms:CategoryLabel>
        </ms:ImageGenre>
        <ms:typeOfImageContent>human figures</ms:typeOfVideoContent>
</ms:CorpusImagePart>

TextGenre

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusTextPart.TextGenre

Data type component

Optionality Recommended

Explanation & Instructions

A category of text characterized by a particular style, form, or content according to a specific classification scheme

You can add only a free text value at the CategoryLabel element; if you have used a value from an established controlled vocabulary, you can use the TextGenreIdentifier and the attribute TextGenreClassificationScheme.

Example

<ms:TextGenre>
        <ms:CategoryLabel>movie subtitles</ms:CategoryLabel>
</ms:TextGenre>

<ms:TextGenre>
        <ms:CategoryLabel>news articles</ms:CategoryLabel>
</ms:TextGenre>

AudioGenre

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart

Data type component

Optionality Recommended if applicable

Explanation & Instructions

A category of audio characterized by a particular style, form, or content according to a specific classification scheme

You can add only a free text value at the CategoryLabel element; if you have used a value from an established controlled vocabulary, you can use the AudioGenreIdentifier and the attribute AudioGenreClassificationScheme to provide further details.

Example

<ms:AudioGenre>
        <ms:CategoryLabel>conference noises</ms:CategoryLabel>
</ms:AudioGenre>

SpeechGenre

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart.SpeechGenre

Data type component

Optionality Recommended if applicable

Explanation & Instructions

A category for the conventionalized discourse of the speech part of a language resource, based on extra-linguistic and internal linguistic criteria

You can add only a free text value at the CategoryLabel element; if you have used a value from an established controlled vocabulary, you can use the SpeechGenreIdentifier and the attribute SpeechGenreClassificationScheme to provide further details.

Example

<ms:SpeechGenre>
        <ms:CategoryLabel>broadcast news</ms:CategoryLabel>
</ms:SpeechGenre>

<ms:SpeechGenre>
        <ms:CategoryLabel>monologue</ms:CategoryLabel>
</ms:SpeechGenre>

VideoGenre

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusVideoPart.VideoGenre

Data type string (+ id + scheme)

Optionality Recommended if applicable

Explanation & Instructions

A classification of video parts based on extra-linguistic and internal linguistic criteria and reflected on the video style, form or content

You can add only a free text value at the CategoryLabel element; if you have used a value from an established controlled vocabulary, you can use the VideoGenreIdentifier and the attribute VideoClassificationScheme

Example

<ms:videoGenre>
        <ms:CategoryLabel>documentaries</ms:CategoryLabel>
</ms:videoGenre>

<ms:videoGenre>
        <ms:CategoryLabel>video lectures</ms:CategoryLabel>
</ms:videoGenre>

ImageGenre

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusImagePart.ImageGenre

Data type component

Optionality Recommended

Explanation & Instructions

A category of images characterized by a particular style, form, or content according to a specific classification scheme

You can add only a free text value at the CategoryLabel element; if you have used a value from an established controlled vocabulary, you can use the ImageGenreIdentifier and the attribute ImageClassificationScheme to provide further details.

Example

<ms:imageGenre>
        <ms:CategoryLabel>human faces</ms:CategoryLabel>
</ms:imageGenre>

<ms:imageGenre>
        <ms:CategoryLabel>landscape</ms:CategoryLabel>
</ms:imageGenre>

DatasetDistribution

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution

Data type component

Optionality Mandatory

Explanation & Instructions

Any form with which a dataset is distributed, such as a downloadable form in a specific format (e.g., spreadsheet, plain text , etc.) or an API with which it can be accessed

You can repeat the element for multiple distributions.

The list of mandatory and recommended elements are:

  • DatasetDistributionForm (Mandatory): The form (medium/channel) used for distributing a language resource consisting of data (e.g., a corpus, a lexicon, etc.). The typical values are ‘downloadable’, ‘accessibleThroughInterface’, ‘accessibleThroughQuery’ (see more at DatasetDistributionForm).
  • downloadLocation (Mandatory if applicable): A URL where the language resource (mainly data but also downloadable software programmes or forms) can be downloaded from. Use this element if the value of DatasetDistributionForm is ‘downloadable’ and only for direct download links (i.e., from which the dataset is downloaded without the need of further actions such as clicks on a page).
  • accessLocation (Mandatory if applicable): A URL where the resource can be accessed from; it can be used for landing pages or for cases where the resource is accessible via an interface, i.e. cases where the resource itself is not provided with a direct link for downloading. Use if the value of DatasetDistributionForm is ‘accessibleThroughInterface’ or ‘accessibleThroughQuery’ but also for links used for downloading corpora which are mentioned on a landing page or require some kind of action on the part of the user.
  • samplesLocation (Recommended): Links a resource to a url (or url’s) with samples of a data resource or of the input of output resource of a tool/service.
  • licenceTerms (Mandatory): See licenceTerms
  • cost (Mandatory if applicable): Introduces the cost for accessing a resource, formally described as a set of amount and currency unit. Please use only for resources available at a cost and not for free resources.

Depending on the parts of the corpus, you must also use one or more of the following:

Example

<ms:DatasetDistribution>
        <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
        <ms:accessLocation>https://www.someAccessURL.com</ms:accessLocation>
        <ms:samplesLocation>https://www.URLwithsamples.com</ms:samplesLocation>
        <ms:distributionTextFeature>
                <ms:size>
                        <ms:amount>17601</ms:amount>
                        <ms:sizeUnit>http://w3id.org/meta-share/meta-share/unit</ms:sizeUnit>
                </ms:size>
                <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
                <ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
        </ms:distributionTextFeature>
        <ms:licenceTerms>
                <ms:licenceTermsName xml:lang="en">openUnder-PSI</ms:licenceTermsName>
                <ms:licenceTermsURL>https://elrc-share.eu/terms/openUnderPSI.html</ms:licenceTermsURL>
        </ms:licenceTerms>
</ms:DatasetDistribution>

<ms:DatasetDistribution>
        <ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/accessibleThroughInterface</ms:DatasetDistributionForm>
        <ms:accessLocation>https://www.someAccessURL.com</ms:accessLocation>
        <ms:distributionTextFeature>
                <ms:size>
                        <ms:amount>100</ms:amount>
                        <ms:sizeUnit>http://w3id.org/meta-share/meta-share/text1</ms:sizeUnit>
                </ms:size>
                <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Pdf</ms:dataFormat>
                <ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
        </ms:distributionTextFeature>
        <ms:licenceTerms>
                <ms:licenceTermsName xml:lang="en">some commercial licence</ms:licenceTermsName>
                <ms:licenceTermsURL>https://elrc-share.eu/terms/someCommercialLicence.html</ms:licenceTermsURL>
        </ms:licenceTerms>
        <ms:cost>
                <ms:amount>10000</ms:amount>
                <ms:currency>http://w3id.org/meta-share/meta-share/euro</ms:currency>
        </ms:cost>
</ms:DatasetDistribution>

distributionTextFeature

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionTextFeature

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

Links to a feature that can be used for describing distinct distributable forms of text resources/parts

The following are mandatory or recommended:

  • size (Mandatory): The size of the text part, expressed as a combination of amount and sizeUnit (with a value from a CV for sizeUnit).
  • dataFormat (Mandatory): Indicates the format(s) of a data resource; it takes a value from a CV (dataFormat); the dataFormat includes the IANA mimetype and pointers to additional documentation for specialized formats (e.g., GATE XML, CONLL formats, etc.).
  • characterEncoding (Recommended): Specifies the character encoding used for a language resource data distribution.

Example

<ms:distributionTextFeature>
        <ms:size>
                <ms:amount>9139</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/sentence</ms:sizeUnit>
        </ms:size>
        <ms:size>
                <ms:amount>40</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
        </ms:size>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
        <ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
</ms:distributionTextFeature>

distributionAudioFeature

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionAudioFeature

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

Links to a feature that can be used for describing distinct distributable forms of audio resources/parts

The following are mandatory or recommended:

  • size (Mandatory): The size of the audio part, expressed as a combination of amount and sizeUnit (with a value from a CV for sizeUnit).
  • durationOfAudio (Recommended): Specifies the duration of the audio recording including silences, music, pauses, etc., expressed as a combination of amount and durationUnit (with a value from the CV for durationUnit).
  • durationOfEffectiveSpeech (Recommended): Specifies the duration of effective speech of the audio (part of a) resource, expressed as a combination of amount and durationUnit (with a value from the CV for durationUnit).
  • audioFormat (Mandatory): Indicates the format(s) of the audio (part of a) data resource, expressed as a value of dataFormat (with a value from a CV) and compressed.

Example

<ms:distributionAudioFeature>
        <ms:size>
                <ms:amount>10</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
        </ms:size>
        <ms:durationOfAudio>
                <ms:amount>3</ms:amount>
                <ms:durationUnit>http://w3id.org/meta-share/meta-share/hour</ms:durationUnit>
        </ms:durationOfAudio>
        <ms:audioFormat>
                <ms:dataFormat>http://w3id.org/meta-share/omtd-share/wav</ms:dataFormat>
                <ms:compressed>true</ms:compressed>
        </ms:audioFormat>
</ms:distributionAudioFeature>

distributionVideoFeature

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionVideoFeature

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

Links to a feature that can be used for describing distinct distributable forms of video resources/parts

The following are mandatory or recommended:

  • size (Mandatory): The size of the video part, expressed as a combination of amount and sizeUnit (with a value from a CV for sizeUnit).
  • durationOfVideo (Recommended): Specifies the duration of the video recording, expressed as a combination of amount and durationUnit (with a value from the CV for durationUnit).
  • videoFormat (Mandatory): Indicates the format(s) of the video (part of a) data resource, expressed as a value of dataFormat (with a value from a CV) and compressed.

Example

<ms:distributionVideoFeature>
        <ms:size>
                <ms:amount>9139</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/screen</ms:sizeUnit>
        </ms:size>
        <ms:size>
                <ms:amount>40</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
        </ms:size>
        <ms:durationOfVideo>
                <ms:amount>40</ms:amount>
                <ms:durationUnit>http://w3id.org/meta-share/meta-share/hour</ms:durationUnit>
        </ms:durationOfVideo>
        <ms:videoFormat>
                <ms:dataFormat>http://w3id.org/meta-share/omtd-share/wav</ms:dataFormat>
                <ms:compressed>true</ms:compressed>
        </ms:videoFormat>

distributionImageFeature

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionImageFeature

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

Links to a feature that can be used for describing distinct distributable forms of image resources/parts

The following are mandatory or recommended:

  • size (Mandatory): The size of the image part, expressed as a combination of amount and sizeUnit (with a value from a CV for sizeUnit).
  • imageFormat (Mandatory): Indicates the format(s) of the image (part of a) data resource, expressed as a value of dataFormat (with a value from a CV) and compressed.

Example

<ms:distributionImageFeature>
        <ms:size>
                <ms:amount>100</ms:amount>
                <ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
        </ms:size>
        <ms:imageFormat>
                <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Pdf`</ms:dataFormat>
                <ms:compressed>true</ms:compressed>
        </ms:imageFormat>

personalDataIncluded

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.personalDataIncluded

Data type boolean

Optionality Mandatory

Explanation & Instructions

Specifies whether the language resource contains personal data (mainly in the sense falling under the GDPR)

If the resource contains personal data, you can use the (optional) personalDataDetails to provide more information

Example

<ms:personalDataIncluded>true</ms:personalDataIncluded>
<ms:personalDataDetails>The corpus contains data on the place of living and place of birth of participants</ms:personalDataDetails>

sensitiveDataIncluded

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.sensitiveDataIncluded

Data type boolean

Optionality Mandatory

Explanation & Instructions

Specifies whether the language resource contains sensitive data (e.g., medical/health-related, etc.) and thus requires special handling

If the resource contains sensitive data, you can use the (optional) sensitiveDataDetails to provide more information.

Example

<ms:sensitiveDataIncluded>true</ms:sensitiveDataIncluded>
<ms:sensitiveDataDetails>The corpus contains medical data for persons with disabilities</ms:sensitiveDataDetails>

anonymized

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.anonymized

Data type boolean

Optionality Mandatory if applicable

Explanation & Instructions

Indicates whether the language resource has been anonymized

The element is mandatory if either personalDataIncluded or sensitiveDataIncluded have ‘true’ as value; anonymizationDetails must also be filled in with information on the anonymization mehod, etc.

Example

<ms:anonymized>true</ms:anonmized>
<ms:anonymizationDetails>pseudonymization performed manually</ms:anonymizationDetails>

annotation

Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.annotation

Data type component

Optionality Mandatory if applicable

Explanation & Instructions

Links a corpus to its annotated part(s)

You must use it for annotated corpora and annotations. You can repeat it for corpora that have separate files for each annotation type, or if you want to given information such as the use of different annotation tools for each annotation level.

Enter at least the annotation type(s); if you want, you can give a more detailed description of the annotated parts - see the annotation component of the full schema.

Example

<ms:annotation>
        <ms:annotationType>http://w3id.org/meta-share/omtd-share/Lemma</ms:annotationType>
        <ms:annotationStandoff>false</ms:annotationStandoff>
        <ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
        <ms:isAnnotatedBy>
                <ms:resourceName xml:lang="en">Lemmatizer</ms:resourceName>
        </ms:isAnnotatedBy>
</ms:annotation>

<ms:annotation>
        <ms:annotationType>http://w3id.org/meta-share/omtd-share/PartOfSpeech</ms:annotationType>
        <ms:annotationStandoff>false</ms:annotationStandoff>
        <ms:tagset>
                <ms:resourceName xml:lang="en">Universal Dependencies</ms:resourceName>
        </ms:tagset>
        <ms:isAnnotatedBy>
                <ms:resourceName xml:lang="en">PoS tagger</ms:resourceName>
        </ms:isAnnotatedBy>
</ms:annotation>

<ms:annotation>
        <ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
</ms:annotation>