Describe a corpus (dataset)¶
In this section you will find information on how to describe a corpus with the minimal metadata in order to register it into the ELG platform. If you want to find more on the ELG resource types, see Overview. You will also find instructions for all data resources (technical requirements, registration instructions to the platform) in Provide a Language Resource.
Corpora are structured collections of data selected according to specific criteria in order to represent as comprehensively as possible a research question. The most common cases are:
- text corpora: monolingual, bilingual or multilingual collections of texts in a specific domain, such as corpora of news articles, scientific publications, legal documents, medical records, tweets, etc.
- corpora of audio recordings, e.g., of broadcast news, or lists of sentences recorded by individuals from a specific region with a dialect accent, etc.
- collections of videos, such as interviews with politicians, sign language corpora, etc.
- corpora combining all of the above, such as a multimedia corpus of video lectures, with their audio recordings, transcripts, subtitles and their translations.
Examples of metadata records for corpora¶
Bilingual raw corpus: Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)
Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/734
<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
<ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
<ms:metadataCreationDate>2020-10-03</ms:metadataCreationDate>
<ms:metadataCurator>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Smith</ms:surname>
<ms:givenName xml:lang="en">John</ms:givenName>
<ms:email>username@someDomain.com</ms:email>
</ms:metadataCurator>
<ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
<ms:metadataCreator>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Smith</ms:surname>
<ms:givenName xml:lang="en">John</ms:givenName>
<ms:email>username@someDomain.com</ms:email>
</ms:metadataCreator>
<ms:DescribedEntity>
<ms:LanguageResource>
<ms:entityType>LanguageResource</ms:entityType>
<ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG) (Processed)</ms:resourceName>
<ms:description xml:lang="en">Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency.
Bilingual Bulgarian-English corpus of administrative documents on the Refund of Value Added Tax from the Bulgarian National Revenue Agency. It was offered as collection of documents by the Bulgarian National Revenue Agency. Modules of the ILSP Focused Crawler was used for the normalization, cleaning, (near) de-duplication and identification of parallel documents. The Maligna sentence aligner was used for extracting segment alignments from crawled parallel documents. As a post-processing step, alignments were merged into one TMX file. The following filters were applied: TMX files generated from document pairs which have been identified by non-aupidh methods were discarded ; TMX files with a zeroToOne_alignments/total_alignments ratio larger than 0.16, were discarded ; Alignments of non-[1:1] type(s) were discarded. ; Alignments with a TUV (after normalization) that has less than 1 tokens, were annotated ; Alignments with a l1/l2 TUV length ratio smaller than 0.6 or larger than 1.6, were annotated ; Alignments in which different digits appear in each TUV were kept and annotated. ; Alignments with identical TUVs (after normalization) were annotated ; Alignments with only non-letters in at least one of their TUVs were annotated ; Duplicate alignments were kept and were annotated. The mean value of aligner's scores is 5.714609036504669, the std value is 1.8063256236105307. The mean value of length (in terms of characters) ratios is 1.0040012545201242 and the std value is 0.26545877788005745. There are 832 TUs with no annotation, containing 13336 words and 2604 lexical types in bul and 15010 words and 2031 lexical types in eng. The mean value of aligner's scores is 6.336834960545485, the std value is 1.53829791384023</ms:description>
<ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_471</ms:LRIdentifier>
<ms:version>2.0</ms:version>
<ms:additionalInfo>
<ms:landingPage>https://elrc-share.eu/repository/browse/bilingual-bulgarian-english-corpus-from-the-national-revenue-agency-bg-processed/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:landingPage>
</ms:additionalInfo>
<ms:additionalInfo>
<ms:email>contact@someDomain.com</ms:email>
</ms:additionalInfo>
<ms:contact>
<ms:Person>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Rusinova</ms:surname>
<ms:givenName xml:lang="en">Annie</ms:givenName>
<ms:email>contact@someDomain.com</ms:email>
</ms:Person>
</ms:contact>
<ms:iprHolder>
<ms:Organization>
<ms:actorType>Organization</ms:actorType>
<ms:organizationName xml:lang="en">National Revenue Agency (BG)</ms:organizationName>
<ms:website>http://www.nap.bg/en/</ms:website>
</ms:Organization>
</ms:iprHolder>
<ms:keyword xml:lang="en">corpus</ms:keyword>
<ms:domain>
<ms:categoryLabel xml:lang="en">FINANCE</ms:categoryLabel>
<ms:DomainIdentifier ms:DomainClassificationScheme="http://w3id.org/meta-share/meta-share/EUROVOC">24</ms:DomainIdentifier>
</ms:domain>
<ms:fundingProject>
<ms:projectName xml:lang="en">European Language Resource Coordination LOT3</ms:projectName>
<ms:ProjectIdentifier ms:ProjectIdentifierScheme="http://w3id.org/meta-share/meta-share/other">Tools and Resources for CEF Automated Translation - LOT3 (SMART 2015/1091 - 30-CE-0816766/00-92)</ms:ProjectIdentifier>
<ms:website>http://www.lr-coordination.eu</ms:website>
</ms:fundingProject>
<ms:validated>true</ms:validated>
<ms:validation>
<ms:validationDetails xml:lang="en">validated</ms:validationDetails>
</ms:validation>
<ms:relation>
<ms:relationType xml:lang="en">isAlignedVersionOf</ms:relationType>
<ms:relatedLR>
<ms:resourceName xml:lang="en">Bilingual Bulgarian-English corpus from the National Revenue Agency (BG)</ms:resourceName>
<ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
</ms:relatedLR>
</ms:relation>
<ms:LRSubclass>
<ms:Corpus>
<ms:lrType>Corpus</ms:lrType>
<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/rawCorpus</ms:corpusSubclass>
<ms:CorpusMediaPart>
<ms:CorpusTextPart>
<ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/bilingual</ms:lingualityType>
<ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
<ms:language>
<ms:languageTag>bg</ms:languageTag>
<ms:languageId>bg</ms:languageId>
<ms:scriptId>Cyrl</ms:scriptId>
</ms:language>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:textType>
<ms:categoryLabel xml:lang="en">administrativeTexts</ms:categoryLabel>
</ms:textType>
<ms:TextGenre>
<ms:categoryLabel xml:lang="en">official</ms:categoryLabel>
</ms:TextGenre>
<ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
<ms:hasOriginalSource>
<ms:resourceName xml:lang="en">ELRC-447</ms:resourceName>
<ms:LRIdentifier ms:LRIdentifierScheme="http://w3id.org/meta-share/meta-share/other">ELRC_447</ms:LRIdentifier>
</ms:hasOriginalSource>
<ms:creationDetails xml:lang="en">See description for creation details</ms:creationDetails>
</ms:CorpusTextPart>
</ms:CorpusMediaPart>
<ms:DatasetDistribution>
<ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
<ms:accessLocation>https://elrc-share.eu/repository/download/4ed47824d04a11e7b7d400155d026706dbe4fc9f12424b5ba0a749fd6758072b/</ms:accessLocation>
<ms:distributionTextFeature>
<ms:size>
<ms:amount>1292</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/unit</ms:sizeUnit>
</ms:size>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
<ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
</ms:distributionTextFeature>
<ms:licenceTerms>
<ms:licenceTermsName xml:lang="en">publicDomain</ms:licenceTermsName>
<ms:licenceTermsURL>https://elrc-share.eu/terms/publicDomain.html</ms:licenceTermsURL>
<ms:LicenceIdentifier ms:LicenceIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">publicDomain</ms:LicenceIdentifier>
</ms:licenceTerms>
<ms:cost>
<ms:amount>0</ms:amount>
<ms:currency>http://w3id.org/meta-share/meta-share/euro</ms:currency>
</ms:cost>
</ms:DatasetDistribution>
<ms:personalDataIncluded>false</ms:personalDataIncluded>
<ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
</ms:Corpus>
</ms:LRSubclass>
</ms:LanguageResource>
</ms:DescribedEntity>
</ms:MetadataRecord>
Annotated corpus: Greek Textual Entailment corpus
Published at: https://live.european-language-grid.eu/catalogue/#/resource/service/corpus/649
<?xml version="1.0" encoding="UTF-8"?>
<ms:MetadataRecord xmlns="http://w3id.org/meta-share/meta-share/" xmlns:datacite="http://purl.org/spar/datacite/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:ms="http://w3id.org/meta-share/meta-share/" xmlns:omtd="http://w3id.org/meta-share/omtd-share/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://w3id.org/meta-share/meta-share/ ../../Schema/ELG-SHARE.xsd">
<ms:MetadataRecordIdentifier ms:MetadataRecordIdentifierScheme="http://w3id.org/meta-share/meta-share/elg">value automatically assigned - leave as is</ms:MetadataRecordIdentifier>
<ms:metadataCreationDate>2020-02-02</ms:metadataCreationDate>
<ms:metadataCurator>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Smith</ms:surname>
<ms:givenName xml:lang="en">John</ms:givenName>
<ms:email>curator@somedomain.com</ms:email>
</ms:metadataCurator>
<ms:compliesWith>http://w3id.org/meta-share/meta-share/ELG-SHARE</ms:compliesWith>
<ms:metadataCreator>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Smith</ms:surname>
<ms:givenName xml:lang="en">John</ms:givenName>
<ms:email>curator@somedomain.com</ms:email>
</ms:metadataCreator>
<ms:sourceOfMetadataRecord>META-SHARE/ILSP</ms:sourceOfMetadataRecord>
<ms:DescribedEntity>
<ms:LanguageResource>
<ms:entityType>LanguageResource</ms:entityType>
<ms:resourceName xml:lang="en">Greek Textual Entailment Corpus</ms:resourceName>
<ms:resourceShortName xml:lang="en">GTEC</ms:resourceShortName>
<ms:description xml:lang="en">GTEC consits of 600 T-H pairs manually annotated for entailment (i.e. whether T entails H or not) by human annotators. The dataset which is tailored to guide training and evaluation of prospect RTE systems, is equally divided in three subsets each one representing the output of a specific HLT application: Question Answering (QA), Comparable Documents (CD) and Machine Translation (MT), and pertaining to specific subject fields (e.g. law, politics, travel). T-H examples that correspond to success and failure cases of the afore-mentioned applications have been included in the corpus. The annotations provided are conformant to the RTE1 and RTE2 challenges.</ms:description>
<ms:version>v1.0.0 (automatically assigned)</ms:version>
<ms:additionalInfo>
<ms:email>username@someDomain.com</ms:email>
</ms:additionalInfo>
<ms:additionalInfo>
<ms:email>username3@someDomain.com</ms:email>
</ms:additionalInfo>
<ms:contact>
<ms:Person>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Giouli</ms:surname>
<ms:givenName xml:lang="en">Voula</ms:givenName>
<ms:email>username@someDomain.com</ms:email>
</ms:Person>
</ms:contact>
<ms:contact>
<ms:Person>
<ms:actorType>Person</ms:actorType>
<ms:surname xml:lang="en">Piperidis</ms:surname>
<ms:givenName xml:lang="en">Stelios</ms:givenName>
<ms:email>username3@someDomain.com</ms:email>
</ms:Person>
</ms:contact>
<ms:keyword xml:lang="en">corpus</ms:keyword>
<ms:domain>
<ms:categoryLabel xml:lang="en">law</ms:categoryLabel>
</ms:domain>
<ms:domain>
<ms:categoryLabel xml:lang="en">politics</ms:categoryLabel>
</ms:domain>
<ms:domain>
<ms:categoryLabel xml:lang="en">travel</ms:categoryLabel>
</ms:domain>
<ms:resourceCreator>
<ms:Organization>
<ms:actorType>Organization</ms:actorType>
<ms:organizationName xml:lang="en">Institute for Language and Speech Processing</ms:organizationName>
<ms:website>http://www.ilsp.gr</ms:website>
</ms:Organization>
</ms:resourceCreator>
<ms:intendedApplication>
<ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
</ms:intendedApplication>
<ms:actualUse>
<ms:usedInApplication>
<ms:LTClassRecommended>http://w3id.org/meta-share/omtd-share/AnnotationOfTextualEntailment</ms:LTClassRecommended>
</ms:usedInApplication>
<ms:actualUseDetails xml:lang="en">nlpApplications</ms:actualUseDetails>
</ms:actualUse>
<ms:isDocumentedBy>
<ms:title xml:lang="en">Building a Greek corpus of Textual Entailment</ms:title>
<ms:DocumentIdentifier ms:DocumentIdentifierScheme="http://purl.org/spar/datacite/url">http://www.lrec-conf.org/proceedings/lrec2008/pdf/427_paper.pdf</ms:DocumentIdentifier>
</ms:isDocumentedBy>
<ms:LRSubclass>
<ms:Corpus>
<ms:lrType>Corpus</ms:lrType>
<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/annotatedCorpus</ms:corpusSubclass>
<ms:CorpusMediaPart>
<ms:CorpusTextPart>
<ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>el</ms:languageTag>
<ms:languageId>el</ms:languageId>
</ms:language>
<ms:creationMode>http://w3id.org/meta-share/meta-share/mixed</ms:creationMode>
<ms:originalSourceDescription xml:lang="en">web news</ms:originalSourceDescription>
<ms:originalSourceDescription xml:lang="en">EU texts</ms:originalSourceDescription>
</ms:CorpusTextPart>
</ms:CorpusMediaPart>
<ms:DatasetDistribution>
<ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
<ms:accessLocation>http://metashare.ilsp.gr:8080/repository/download/26dca2fe63d211e29b2c842b2b6a04d7db87c85bfbe34326bb4c2e88b8c4da85</ms:accessLocation>
<ms:distributionTextFeature>
<ms:size>
<ms:amount>600</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/T-HPair</ms:sizeUnit>
</ms:size>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
</ms:distributionTextFeature>
<ms:licenceTerms>
<ms:licenceTermsName xml:lang="en">CC-BY-4.0</ms:licenceTermsName>
<ms:licenceTermsURL>https://spdx.org/licenses/CC-BY-4.0.html</ms:licenceTermsURL>
</ms:licenceTerms>
<ms:attributionText xml:lang="en">Greek Textual Entailment Corpus by Athena R.C./ILSP used under CC-BY licence</ms:attributionText>
</ms:DatasetDistribution>
<ms:personalDataIncluded>false</ms:personalDataIncluded>
<ms:sensitiveDataIncluded>false</ms:sensitiveDataIncluded>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/Lemma</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
<ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
<ms:isAnnotatedBy>
<ms:resourceName xml:lang="en">ILSP-Lemmatizer</ms:resourceName>
</ms:isAnnotatedBy>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/PartOfSpeech</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:tagset>
<ms:resourceName xml:lang="en">ILSP/PAROLE tagset</ms:resourceName>
</ms:tagset>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
<ms:annotationModeDetails xml:lang="en">automatic annotation followed with manual disambiguation</ms:annotationModeDetails>
<ms:isAnnotatedBy>
<ms:resourceName xml:lang="en">ILSP FBT POS tagger</ms:resourceName>
</ms:isAnnotatedBy>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
<ms:annotationModeDetails xml:lang="en">Automatic annotation followed by manual correction</ms:annotationModeDetails>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/SemanticAnnotationType</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/manual</ms:annotationMode>
</ms:annotation>
</ms:Corpus>
</ms:LRSubclass>
</ms:LanguageResource>
</ms:DescribedEntity>
</ms:MetadataRecord>
Minimal version metadata for corpora¶
The set of the metadata (mandatory or recommended) that are common to all kinds of resources including data language resources are presented in section Minimal version - List of elements common to all LRTs. In addition, the metadata elements that are required or recommended for corpora are described below.
For a quick guide to the ELG template, see Template - Explanations.
Corpus¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus
Data type component
Optionality Mandatory
Explanation & Instructions
Wraps together the set of elements that is specific to corpora
Example
<ms:LRSubclass>
<ms:Corpus>
<ms:lrType>Corpus</ms:lrType>
</ms:Corpus>
</ms:LRSubclass>
corpusSubclass¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.corpusSubclass
Data type CV (corpusSubclass)
Optionality Mandatory
Explanation & Instructions
Introduces a classification of corpora into types (used for descriptive reasons)
Use one of the values for raw corpora, annotated corpora (mixed raw with annotations), annotations (only annotations without the original corpus)
Example
<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/rawCorpus</ms:corpusSubclass>
<ms:corpusSubclass>http://w3id.org/meta-share/meta-share/annotatedCorpus</ms:corpusSubclass>
CorpusTextPart¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusTextPart
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
The part of a corpus (or a whole corpus) that consists of textual segments (e.g., a corpus of publications, or transcriptions of an oral corpus, or subtitles , etc.)
You can repeat the group of elements for multiple textual parts.
The mandatory or recommended elements for the text part are:
mediaType
(Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘text’.lingualityType
(Mandatory): Indicates whether the resource includes one, two or more languages.multilingualityType
(Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language.language
(Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See language.languageVariety
(Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.modalityType
(Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.TextGenre
(Recommended): A category of text characterized by a particular style, form, or content according to a specific classification scheme. See TextGenre.
Example
<ms:CorpusTextPart>
<ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>es</ms:languageTag>
<ms:languageId>es</ms:languageId>
</ms:language>
</ms:CorpusTextPart>
<ms:CorpusTextPart>
<ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/bilingual</ms:lingualityType>
<ms:language>
<ms:languageTag>es</ms:languageTag>
<ms:languageId>es</ms:languageId>
</ms:language>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
<ms:TextGenre>
<ms:CategoryLabel>administrative texts</ms:CategoryLabel>
</ms:TextGenre>
</ms:CorpusTextPart>
<ms:CorpusTextPart>
<ms:corpusMediaType>CorpusTextPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
</ms:CorpusTextPart>
CorpusAudioPart¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
The part of a corpus (or whole corpus) that consists of audio segments
You can repeat the group of elements for multiple audio parts.
The mandatory or recommended elements for the audio part are:
mediaType
(Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘audio’lingualityType
(Mandatory ): Indicates whether the resource includes one, two or more languagesmultilingualityType
(Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language)language
(Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See languagelanguageVariety
(Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.modalityType
(Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.AudioGenre
(Recommended if applicable): A category of audio characterized by a particular style, form, or content according to a specific classification scheme. See AudioGenreSpeechGenre
(Recommended if applicable): A category for the conventionalized discourse of the speech part of a language resource, based on extra-linguistic and internal linguistic criteria. See SpeechGenre
Example
<ms:CorpusAudioPart>
<ms:corpusMediaType>CorpusAudioPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/audio</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:AudioGenre>
<ms:CategoryLabel>conference noises</ms:CategoryLabel>
</ms:AudioGenre>
</ms:CorpusAudioPart>
<ms:CorpusAudioPart>
<ms:corpusMediaType>CorpusAudioPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/audio</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
<ms:SpeechGenre>
<ms:CategoryLabel>monologue</ms:CategoryLabel>
</ms:SpeechGenre>
</ms:CorpusAudioPart>
CorpusVideoPart¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusVideoPart
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
The part of a corpus (or a whole corpus) that consists of video segments (e.g., a corpus of video lectures, a part of a corpus with news, a sign language corpus, etc.)
You can repeat the group of elements for multiple video parts.
The mandatory or recommended elements for the video part are:
mediaType
(Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘video’.lingualityType
(Mandatory ): Indicates whether the resource includes one, two or more languages.multilingualityType
(Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language).language
(Mandatory): Specifies the language that is used in the resource part , expressed according to the BCP47 recommendation. See language.languageVariety
(Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.modalityType
(Recommended if applicable): Specifies the type of the modality represented in the resource. For instance, you can use ‘spoken language’ to describe transcribed speech corpora.VideoGenre
(Recommended): A classification of video parts based on extra-linguistic and internal linguistic criteria and reflected on the video style, form or content. See VideoGenretypeOfVideoContent
(Mandatory): Main type of object or people represented in the video.
Example
<ms:CorpusVideoPart>
<ms:corpusMediaType>CorpusVideoPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/video</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>en</ms:languageTag>
<ms:languageId>en</ms:languageId>
</ms:language>
<ms:modalityType>http://w3id.org/meta-share/meta-share/bodyGesture</ms:modalityType>
<ms:modalityType>http://w3id.org/meta-share/meta-share/facialExpression</ms:modalityType>
<ms:modalityType>http://w3id.org/meta-share/meta-share/spokenLanguage</ms:modalityType>
<ms:typeOfVideoContent>people eating at a restaurant</ms:typeOfVideoContent>
</ms:CorpusVideoPart>
<ms:CorpusVideoPart>
<ms:corpusMediaType>CorpusVideoPart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/video</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>fr</ms:languageTag>
<ms:languageId>fr</ms:languageId>
</ms:language>
<ms:VideoGenre>
<ms:CategoryLabel>documentary</ms:CategoryLabel>
</ms:VideoGenre>
<ms:typeOfVideoContent>birds, wild animals, plants</ms:typeOfVideoContent>
</ms:CorpusVideoPart>
CorpusImagePart¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusImagePart
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
The part of a corpus (or whole corpus) that consists of images (e.g., g a corpus of photographs and their captions)
You can repeat the group of elements for multiple video parts.
The mandatory or recommended elements for the image part are:
mediaType
(Mandatory): Specifies the media type of a language resource (the physical medium of the contents representation). For text parts, always use the value ‘image’.lingualityType
(Mandatory ): Indicates whether the resource includes one, two or more languages.multilingualityType
(Mandatory if applicable): Indicates whether the resource (part) is parallel, comparable or mixed. If lingualityType = bilingual or multilingual, it is required; select one of the values for parallel (e.g., original text and its translations), comparable (e.g. corpus of the same domain in multiple languages) and multilingualSingleText (for corpora that consist of segments including text in two or more languages (e.g., the transcription of a European Parliament session with MPs speaking in their native language).language
(Mandatory): Specifies the language that is used in the resource part, expressed according to the BCP47 recommendation. See language.languageVariety
(Mandatory if applicable): Relates a language resource that contains segments in a language variety (e.g., dialect, jargon) to it. Please use for dialect corpora.modalityType
(Recommended if applicable): Specifies the type of the modality represented in the resource.ImageGenre
(Recommended): A category of images characterized by a particular style, form, or content according to a specific classification scheme. See ImageGenre.typeOfImageContent
(Mandatory): Main type of object or people represented in the image.
Example
<ms:CorpusImagePart>
<ms:corpusMediaType>CorpusImagePart</ms:corpusMediaType>
<ms:mediaType>http://w3id.org/meta-share/meta-share/image</ms:mediaType>
<ms:lingualityType>http://w3id.org/meta-share/meta-share/monolingual</ms:lingualityType>
<ms:language>
<ms:languageTag>el</ms:languageTag>
<ms:languageId>el</ms:languageId>
</ms:language>
<ms:ImageGenre>
<ms:CategoryLabel>comics</ms:CategoryLabel>
</ms:ImageGenre>
<ms:typeOfImageContent>human figures</ms:typeOfVideoContent>
</ms:CorpusImagePart>
TextGenre¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusTextPart.TextGenre
Data type component
Optionality Recommended
Explanation & Instructions
A category of text characterized by a particular style, form, or content according to a specific classification scheme
You can add only a free text value at the CategoryLabel
element; if you have used a value from an established controlled vocabulary, you can use the TextGenreIdentifier
and the attribute TextGenreClassificationScheme
.
Example
<ms:TextGenre>
<ms:CategoryLabel>movie subtitles</ms:CategoryLabel>
</ms:TextGenre>
<ms:TextGenre>
<ms:CategoryLabel>news articles</ms:CategoryLabel>
</ms:TextGenre>
AudioGenre¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart
Data type component
Optionality Recommended if applicable
Explanation & Instructions
A category of audio characterized by a particular style, form, or content according to a specific classification scheme
You can add only a free text value at the CategoryLabel
element; if you have used a value from an established controlled vocabulary, you can use the AudioGenreIdentifier
and the attribute AudioGenreClassificationScheme
to provide further details.
Example
<ms:AudioGenre>
<ms:CategoryLabel>conference noises</ms:CategoryLabel>
</ms:AudioGenre>
SpeechGenre¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusAudioPart.SpeechGenre
Data type component
Optionality Recommended if applicable
Explanation & Instructions
A category for the conventionalized discourse of the speech part of a language resource, based on extra-linguistic and internal linguistic criteria
You can add only a free text value at the CategoryLabel
element; if you have used a value from an established controlled vocabulary, you can use the SpeechGenreIdentifier
and the attribute SpeechGenreClassificationScheme
to provide further details.
Example
<ms:SpeechGenre>
<ms:CategoryLabel>broadcast news</ms:CategoryLabel>
</ms:SpeechGenre>
<ms:SpeechGenre>
<ms:CategoryLabel>monologue</ms:CategoryLabel>
</ms:SpeechGenre>
VideoGenre¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusVideoPart.VideoGenre
Data type string (+ id + scheme)
Optionality Recommended if applicable
Explanation & Instructions
A classification of video parts based on extra-linguistic and internal linguistic criteria and reflected on the video style, form or content
You can add only a free text value at the CategoryLabel
element; if you have used a value from an established controlled vocabulary, you can use the VideoGenreIdentifier
and the attribute VideoClassificationScheme
Example
<ms:videoGenre>
<ms:CategoryLabel>documentaries</ms:CategoryLabel>
</ms:videoGenre>
<ms:videoGenre>
<ms:CategoryLabel>video lectures</ms:CategoryLabel>
</ms:videoGenre>
ImageGenre¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.CorpusMediaPart.CorpusImagePart.ImageGenre
Data type component
Optionality Recommended
Explanation & Instructions
A category of images characterized by a particular style, form, or content according to a specific classification scheme
You can add only a free text value at the CategoryLabel
element; if you have used a value from an established controlled vocabulary, you can use the ImageGenreIdentifier
and the attribute ImageClassificationScheme
to provide further details.
Example
<ms:imageGenre>
<ms:CategoryLabel>human faces</ms:CategoryLabel>
</ms:imageGenre>
<ms:imageGenre>
<ms:CategoryLabel>landscape</ms:CategoryLabel>
</ms:imageGenre>
DatasetDistribution¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution
Data type component
Optionality Mandatory
Explanation & Instructions
Any form with which a dataset is distributed, such as a downloadable form in a specific format (e.g., spreadsheet, plain text , etc.) or an API with which it can be accessed
You can repeat the element for multiple distributions.
The list of mandatory and recommended elements are:
DatasetDistributionForm
(Mandatory): The form (medium/channel) used for distributing a language resource consisting of data (e.g., a corpus, a lexicon, etc.). The typical values are ‘downloadable’, ‘accessibleThroughInterface’, ‘accessibleThroughQuery’ (see more at DatasetDistributionForm).downloadLocation
(Mandatory if applicable): A URL where the language resource (mainly data but also downloadable software programmes or forms) can be downloaded from. Use this element if the value ofDatasetDistributionForm
is ‘downloadable’ and only for direct download links (i.e., from which the dataset is downloaded without the need of further actions such as clicks on a page).accessLocation
(Mandatory if applicable): A URL where the resource can be accessed from; it can be used for landing pages or for cases where the resource is accessible via an interface, i.e. cases where the resource itself is not provided with a direct link for downloading. Use if the value ofDatasetDistributionForm
is ‘accessibleThroughInterface’ or ‘accessibleThroughQuery’ but also for links used for downloading corpora which are mentioned on a landing page or require some kind of action on the part of the user.samplesLocation
(Recommended): Links a resource to a url (or url’s) with samples of a data resource or of the input of output resource of a tool/service.licenceTerms
(Mandatory): See licenceTermscost
(Mandatory if applicable): Introduces the cost for accessing a resource, formally described as a set of amount and currency unit. Please use only for resources available at a cost and not for free resources.
Depending on the parts of the corpus, you must also use one or more of the following:
distributionTextFeature
: See distributionTextFeaturedistributionAudioFeature
: See distributionAudioFeaturedistributionVideoFeature
: See distributionVideoFeaturedistributionImageFeature
: See distributionImageFeature
Example
<ms:DatasetDistribution>
<ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/downloadable</ms:DatasetDistributionForm>
<ms:accessLocation>https://www.someAccessURL.com</ms:accessLocation>
<ms:samplesLocation>https://www.URLwithsamples.com</ms:samplesLocation>
<ms:distributionTextFeature>
<ms:size>
<ms:amount>17601</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/unit</ms:sizeUnit>
</ms:size>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
<ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
</ms:distributionTextFeature>
<ms:licenceTerms>
<ms:licenceTermsName xml:lang="en">openUnder-PSI</ms:licenceTermsName>
<ms:licenceTermsURL>https://elrc-share.eu/terms/openUnderPSI.html</ms:licenceTermsURL>
</ms:licenceTerms>
</ms:DatasetDistribution>
<ms:DatasetDistribution>
<ms:DatasetDistributionForm>http://w3id.org/meta-share/meta-share/accessibleThroughInterface</ms:DatasetDistributionForm>
<ms:accessLocation>https://www.someAccessURL.com</ms:accessLocation>
<ms:distributionTextFeature>
<ms:size>
<ms:amount>100</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/text1</ms:sizeUnit>
</ms:size>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Pdf</ms:dataFormat>
<ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
</ms:distributionTextFeature>
<ms:licenceTerms>
<ms:licenceTermsName xml:lang="en">some commercial licence</ms:licenceTermsName>
<ms:licenceTermsURL>https://elrc-share.eu/terms/someCommercialLicence.html</ms:licenceTermsURL>
</ms:licenceTerms>
<ms:cost>
<ms:amount>10000</ms:amount>
<ms:currency>http://w3id.org/meta-share/meta-share/euro</ms:currency>
</ms:cost>
</ms:DatasetDistribution>
distributionTextFeature¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionTextFeature
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
Links to a feature that can be used for describing distinct distributable forms of text resources/parts
The following are mandatory or recommended:
size
(Mandatory): The size of the text part, expressed as a combination ofamount
andsizeUnit
(with a value from a CV for sizeUnit).dataFormat
(Mandatory): Indicates the format(s) of a data resource; it takes a value from a CV (dataFormat); the dataFormat includes the IANA mimetype and pointers to additional documentation for specialized formats (e.g., GATE XML, CONLL formats, etc.).characterEncoding
(Recommended): Specifies the character encoding used for a language resource data distribution.
Example
<ms:distributionTextFeature>
<ms:size>
<ms:amount>9139</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/sentence</ms:sizeUnit>
</ms:size>
<ms:size>
<ms:amount>40</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
</ms:size>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
<ms:characterEncoding>http://w3id.org/meta-share/meta-share/UTF-8</ms:characterEncoding>
</ms:distributionTextFeature>
distributionAudioFeature¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionAudioFeature
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
Links to a feature that can be used for describing distinct distributable forms of audio resources/parts
The following are mandatory or recommended:
size
(Mandatory): The size of the audio part, expressed as a combination ofamount
andsizeUnit
(with a value from a CV for sizeUnit).durationOfAudio
(Recommended): Specifies the duration of the audio recording including silences, music, pauses, etc., expressed as a combination ofamount
anddurationUnit
(with a value from the CV for durationUnit).durationOfEffectiveSpeech
(Recommended): Specifies the duration of effective speech of the audio (part of a) resource, expressed as a combination ofamount
anddurationUnit
(with a value from the CV for durationUnit).audioFormat
(Mandatory): Indicates the format(s) of the audio (part of a) data resource, expressed as a value ofdataFormat
(with a value from a CV) andcompressed
.
Example
<ms:distributionAudioFeature>
<ms:size>
<ms:amount>10</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
</ms:size>
<ms:durationOfAudio>
<ms:amount>3</ms:amount>
<ms:durationUnit>http://w3id.org/meta-share/meta-share/hour</ms:durationUnit>
</ms:durationOfAudio>
<ms:audioFormat>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/wav</ms:dataFormat>
<ms:compressed>true</ms:compressed>
</ms:audioFormat>
</ms:distributionAudioFeature>
distributionVideoFeature¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionVideoFeature
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
Links to a feature that can be used for describing distinct distributable forms of video resources/parts
The following are mandatory or recommended:
size
(Mandatory): The size of the video part, expressed as a combination ofamount
andsizeUnit
(with a value from a CV for sizeUnit).durationOfVideo
(Recommended): Specifies the duration of the video recording, expressed as a combination ofamount
anddurationUnit
(with a value from the CV for durationUnit).videoFormat
(Mandatory): Indicates the format(s) of the video (part of a) data resource, expressed as a value ofdataFormat
(with a value from a CV) andcompressed
.
Example
<ms:distributionVideoFeature>
<ms:size>
<ms:amount>9139</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/screen</ms:sizeUnit>
</ms:size>
<ms:size>
<ms:amount>40</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
</ms:size>
<ms:durationOfVideo>
<ms:amount>40</ms:amount>
<ms:durationUnit>http://w3id.org/meta-share/meta-share/hour</ms:durationUnit>
</ms:durationOfVideo>
<ms:videoFormat>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/wav</ms:dataFormat>
<ms:compressed>true</ms:compressed>
</ms:videoFormat>
distributionImageFeature¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.DatasetDistribution.distributionImageFeature
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
Links to a feature that can be used for describing distinct distributable forms of image resources/parts
The following are mandatory or recommended:
Example
<ms:distributionImageFeature>
<ms:size>
<ms:amount>100</ms:amount>
<ms:sizeUnit>http://w3id.org/meta-share/meta-share/file</ms:sizeUnit>
</ms:size>
<ms:imageFormat>
<ms:dataFormat>http://w3id.org/meta-share/omtd-share/Pdf`</ms:dataFormat>
<ms:compressed>true</ms:compressed>
</ms:imageFormat>
personalDataIncluded¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.personalDataIncluded
Data type boolean
Optionality Mandatory
Explanation & Instructions
Specifies whether the language resource contains personal data (mainly in the sense falling under the GDPR)
If the resource contains personal data, you can use the (optional) personalDataDetails
to provide more information
Example
<ms:personalDataIncluded>true</ms:personalDataIncluded>
<ms:personalDataDetails>The corpus contains data on the place of living and place of birth of participants</ms:personalDataDetails>
sensitiveDataIncluded¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.sensitiveDataIncluded
Data type boolean
Optionality Mandatory
Explanation & Instructions
Specifies whether the language resource contains sensitive data (e.g., medical/health-related, etc.) and thus requires special handling
If the resource contains sensitive data, you can use the (optional) sensitiveDataDetails
to provide more information.
Example
<ms:sensitiveDataIncluded>true</ms:sensitiveDataIncluded>
<ms:sensitiveDataDetails>The corpus contains medical data for persons with disabilities</ms:sensitiveDataDetails>
anonymized¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.anonymized
Data type boolean
Optionality Mandatory if applicable
Explanation & Instructions
Indicates whether the language resource has been anonymized
The element is mandatory if either personalDataIncluded
or sensitiveDataIncluded
have ‘true’ as value; anonymizationDetails
must also be filled in with information on the anonymization mehod, etc.
Example
<ms:anonymized>true</ms:anonmized>
<ms:anonymizationDetails>pseudonymization performed manually</ms:anonymizationDetails>
annotation¶
Path MetadataRecord.DescribedEntity.LanguageResource.LRSubclass.Corpus.annotation
Data type component
Optionality Mandatory if applicable
Explanation & Instructions
Links a corpus to its annotated part(s)
You must use it for annotated corpora and annotations. You can repeat it for corpora that have separate files for each annotation type, or if you want to given information such as the use of different annotation tools for each annotation level.
Enter at least the annotation type(s); if you want, you can give a more detailed description of the annotated parts - see the annotation component of the full schema.
Example
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/Lemma</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:annotationMode>http://w3id.org/meta-share/meta-share/mixed</ms:annotationMode>
<ms:isAnnotatedBy>
<ms:resourceName xml:lang="en">Lemmatizer</ms:resourceName>
</ms:isAnnotatedBy>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/PartOfSpeech</ms:annotationType>
<ms:annotationStandoff>false</ms:annotationStandoff>
<ms:tagset>
<ms:resourceName xml:lang="en">Universal Dependencies</ms:resourceName>
</ms:tagset>
<ms:isAnnotatedBy>
<ms:resourceName xml:lang="en">PoS tagger</ms:resourceName>
</ms:isAnnotatedBy>
</ms:annotation>
<ms:annotation>
<ms:annotationType>http://w3id.org/meta-share/omtd-share/SyntacticAnnotationType</ms:annotationType>
</ms:annotation>