Internal LT Service API specification ¶

Note

This specification details the API that LT tool containers need to implement in order to be runnable as functional services within the ELG infrastructure. This is distinct from (though closely related to) the public-facing service execution API that outside users use to send requests to ELG services - the public APIs are documented separately.

Contents

Internal LT Service API specification

Where possible, this document SHOULD use the MUST/SHOULD/MAY terms from RFC 2119 to indicate requirement levels.

Basic API pattern ¶

In order to integrate an LT tool as a functional service in the ELG infrastructure, the tool MUST offer at least one endpoint that can accept HTTP (1.1 or 2 - preferably cleartext HTTP/2) POST requests conforming to the appropriate request schema, and return an appropriate response as application/json. This specification also details a response pattern based on Server-Sent Events (SSE, a protocol defined as part of HTML5) that long-running tools can use to report progress information - support for this mechanism is RECOMMENDED for all tools but not required.

Endpoints may be sent multiple parallel requests by the ELG platform, and there is no requirement that a service must respond to requests in any particular order - certain services may, for example, be more efficient if they can batch up several requests into one back end process (e.g. for GPU computing) and send the responses in one go. If a tool has limits on the number of concurrent requests a single instance can handle then this information should be supplied to the ELG platform administrators as part of the on-boarding process, so the platform can use this data to decide how to scale the pod replicas to match the level of load on the service at any given time.

Where a tool already has its own native HTTP API it may be more convenient for integrators to provide a separate service adapter image which can handle requests matching the ELG specification and transform them into calls on the tool’s native API. The tool container and the adapter container will run within the same “pod” in Kubernetes and can access each other as localhost.

Utility datatypes ¶

The following JSON structures are used in several places in this specification, they are documented here to avoid duplication.

Status message ¶

Since the ELG is supposed to be a multilingual platform, error and other status messages are handled using an approach modelled on the i18n mechanism from the Spring Framework - the message is represented by a code, along with a template text with numbered placeholders that are zero-based indices into an array of params replacement values.

{
  "code":"elg.example.no.translation",
  "text":"Default text to use for the {0} if no {1} can be found",
  "params":["message", "translation"],
  "detail":{
    // arbitrary further details that don't need translation,
    // such as a stack trace, service-native error code, etc.
  }
}

ELG provides a common library of fully-translated message codes for service developers to use, as detailed below - developers are free to use their own codes in their own namespaces (i.e. not prefixed elg.) on the understanding that it is their responsibility to provide translations. A mechanism for developers to contribute their translated messages to the platform is under development but not yet generally available.

Annotations ¶

Many of the request and response types need to represent annotations - pieces of metadata about specific parts of a text or audio data stream, rather than about the stream as a whole. For example, a named entity recogniser might want to state that characters 10 to 15 in the request text represent the name of a female person, or a speech recogniser might want to state that characters 75 to 80 in the transcription represent a word, and map to the time period 1.37 to 1.6 seconds in the source audio. Such structures are represented in a consistent way across all the ELG API messages:

"annotations":{
    "<annotation type>":[
      {
        "start":number,
        "end":number,
        "sourceStart":number,
        "sourceEnd":number,
        "features":{ /* arbitrary JSON */ }
      }
    ]
  }

The <annotation type> is an arbitrary string representing the type of annotation, e.g. “Person” or “Word” in the examples above. For each type of annotation, the matching value is a JSON array of objects, each object representing one annotation of that type. Note that when generating these structures in your API responses the value here MUST be an array even if there is only one annotation of the relevant type - some JSON generation libraries “unwrap” singleton arrays by default. The properties of each annotation object are:

start and end: The position of the annotation in the main data stream to which it refers - this is typically the content directly associated with this annotations structure (for example the text of a translation). When the stream is text these would be Unicode character offsets from the start of the text, for audio they would typically be time points in seconds, etc. Subtracting the start value from the end value should give the length of the annotated area - there are several equivalent ways to conceptualise this, for example with text you could consider the characters as numbered from zero with the start offset inclusive and the end offset exclusive, or you could consider the offsets to represent the positions between characters (so 0 is before the first character, 1 is between the first and second, etc.).
sourceStart and sourceEnd: Where these annotations are relative to a data stream that has been generated from another “source” data stream (e.g. a translation of text in another language, or a transcription of audio), these properties can be optionally used to link to the positions in the source stream (e.g. to align words in the translation with words in the original).
features: Arbitrary JSON representing other properties of the annotation, e.g. a “Person” annotation might have a feature for “gender”, a “Word” from a morphological analyser might have “root” and “suffix”, etc.

Request structure ¶

There are two main types of endpoint currently supported for this specification, one for services whose input is structured or unstructured text and one for services whose input is audio.

Text requests ¶

Services that take plain text (or something from which plain text can be extracted, e.g. HTML) as their input are expected to offer an endpoint that accepts POST requests with Content-Type: application/json that conforms to the following structure.

{
  "type":"text",
  "params":{...},   /* optional */
  "content":"The text of the request",
  // mimeType optional - this is the default if omitted
  "mimeType":"text/plain",
  "features":{ /* arbitrary JSON metadata about this content, optional */ },
  "annotations":{ /* optional */
    "<annotation type>":[
      {
        "start":number,
        "end":number,
        "features":{ /* arbitrary JSON */ }
      }
    ]
  }
}

We expect that across the ELG from amongst the large number of possible and supported document types, a set of a smaller number of document types will emerge as being preferred and well supported (for example, plain text, HTML, XML - we do not intend to support binary formats such as PDF or Word as “text” requests, but may introduce other formats to this specification at a later date).

The only part of this request that is guaranteed to be present is the type (which will always be “text”) and the content. So a minimal request would look like this:

{"type":"text", "content":"This is an example request"}

The optional elements are:

mimeType: the MIME type of the content, if it is not simply plain text
params: vendor-specific parameters - it is up to the individual service implementor to decide how (or indeed whether) to interpret these
features: metadata about the input as a whole
annotations: as described above - the start and end are Unicode character offsets within the content and the sourceStart and sourceEnd are ignored.

Tools that are able to accept text requests are RECOMMENDED to also offer an endpoint that can accept just the plain text (or other types of) “content” posted directly, and treat that the same as they would a message with the "content" property equal to the post data, the "mimeType" taken from the request Content-Type header, and no features or annotations. The "params" should be populated from the URL query string parameters. This endpoint will not be called by the ELG platform internally but it will make the service easier to test outside of the ELG platform infrastructure, and for open-source tools it will allow users to easily download and run the tool locally in Docker on their own hardware.

Structured text request ¶

This is very similar to the plain text request, but for services that require some structure to their input, for example a list of sentences for some MT services, a list of words for a service that re-segments a stream of ASR output into a list of sentences, etc. Again, services that accept this kind of input should provide a POST endpoint that accepts Content-Type: application/json conforming to the following structure:

{
  "type":"structuredText",
  "params":{...},   /* optional */
  "texts":[
    {
      "content":"The text of this node",           // either
      "texts":[/* same structure, recursive */],   // or
      // mimeType optional - this is the default if omitted
      "mimeType":"text/plain",
      "features":{ /* arbitrary JSON metadata about this node, optional */ },
      "annotations":{ /* optional */
        "<annotation type>":[
          {
            "start":number,
            "end":number,
            "features":{ /* arbitrary JSON */ }
          }
        ]
      }
    }
  ]
}

The type will always be “structuredText”, params (optional) allows for vendor-specific parameters whose interpretation is up to the individual service implementor, and texts will always be an array of at least one JSON object. The texts property forms a recursive tree-shaped data structure, each object will be either a leaf node containing a piece of content or a branch node containing another list of texts.

Leaf nodes have one required property content containing the text of this node, plus zero or more of the following optional properties:

mimeType: the MIME type of the content, if it is not simply plain text
features: metadata about this node as a whole
annotations: as described above - the start and end are Unicode character offsets within the content and the sourceStart and sourceEnd are ignored.

Branch nodes have one required property texts containing an array of child nodes (which may in turn be branch or leaf nodes), plus zero or more of the following optional properties:

features: metadata about this node as a whole
annotations: as described above - the start and end are array offsets within the texts array (e.g. "start":0, "end":2 would refer to the first and second children - treat them as zero-based array indices where the start is inclusive and the end is exclusive) and the sourceStart and sourceEnd are ignored.

Here is the simplest possible example of a structured text request representing two sentences, each with several words, with no features and no annotations.

{
  "type":"structuredText",
  "texts":[
    {
      "texts":[
        {"content":"The"},{"content":"European"},{"content":"Language"},{"content":"Grid"}
      ]
    },
    {
      "texts":[
        {"content":"An"},{"content":"API"},{"content":"example"}
      ]
    }
  ]
}

Audio requests ¶

Services that accept audio as input (e.g. speech recognition) are slightly more complex, given the input data cannot be easily encoded directly in JSON. Audio services must accept a POST of Content-Type: multipart/form-data with two parts, the first part named “request” will be application/json conforming to the following structure, and the second part named “content” will be audio/x-wav or audio/mpeg containing the actual audio data.

{
  "type":"audio",
  "params":{...}, // optional
  "format":"string", // LINEAR16 for WAV or MP3 for MP3, other types are service specific
  "sampleRate":number,
  "features":{ /* arbitrary JSON metadata about this content, optional */ },
  "annotations":{ /* optional */
    "<annotation type>":[
      {
        "start":number,
        "end":number,
        "features":{ /* arbitrary JSON */ }
      }
    ]
  }
}

The ELG platform typically expects audio to be a single channel - this is not guaranteed, as it depends what the requesting user submits, and a service receiving multiple audio channels may handle this situation in any way it sees fit including processing only the first channel or mixing down the multi-channel stream to mono before processing.

As with text requests we expect that there will be a small number of standard audio formats that are well supported across services (e.g. 16kHz uncompressed WAV) but individual services may support other types. The format and sample rate parameters may be ignored if the audio is in a format with a self-describing file header (e.g. WAV) which specifies other values.

Optional properties of this request type are:

params: vendor-specific parameters - it is up to the individual service implementor to decide how (or indeed whether) to interpret these
features: metadata about the input as a whole
annotations: as described above - the start and end are floating point timestamps in seconds from the start of the audio and the sourceStart and sourceEnd are ignored.

Response structure ¶

Services are expected to return their responses as JSON as described in the rest of this document. The minimal requirement is for services to be able to respond with Content-Type: application/json containing a successful or failed response message, but long-running services may also choose to offer Content-Type: text/event-stream to be able to stream progress reports during processing of the request. This mechanism is described at the end of this document.

Failure message ¶

If processing fails for any reason (whether due to bad input, overloading of the service, or internal errors during processing) then the service should return the following JSON structure to describe the failure.

{
  "failure":{
    "errors":[array of status messages]
  }
}

The errors property is an array of i18n status messages (JSON objects with properties “code”, “text” and “params”) as described above - standard message codes are given in the appendix to this document.

Successful response message ¶

All the successful responses follow this basic format:

{
  "response":{
    "type":"Response type code",
    "warnings":[/* array of status messages, optional*/],
    // other properties type-specific
  }
}

As with the request, the response type code will likely be constant for any given service. The exact format of rest of a successful response message depends on the type of the service.

The warnings list is a slot to report warning messages that did not cause processing to fail entirely but may need to be fed back to the user (e.g. if the process involves several independent steps and only some of the steps failed, or the input was too long and the service chose to truncate it rather than fail altogether). Again, the individual messages in this array are i18n status messages as described above.

Annotations response ¶

This response is suitable for any service that returns standoff annotations that are anchored to locations in text (e.g. named entity recognition) or time points in an audio/video stream (in general: anything compatible with a 1-dimensional coordinate system that uses a single number).

{
  "response":{
    "type":"annotations",
    "warnings":[...], /* optional */
    "features":{...}, /* optional */
    "annotations":{
      "<annotation type>":[
        {
          "start":number,
          "end":number,
          "features":{ /* arbitrary JSON */ }
        }
      ]
    }
  }
}

features (optional): metadata about the input as a whole
annotations (required, but may be empty "annotations":{}): as described above - for plain text data start and end would be character offsets into the text (Unicode code points), for audio data they would be the time point within the audio in seconds. The sourceStart and sourceEnd are ignored since there are no separate “source” and “target” data streams in this situation.

Classification response ¶

For document-level (or more generally whole-input-level) classification services, e.g. language identification

{
  "response":{
    "type":"classification",
    "warnings":[...], /* optional */
    "classes":[
      {
        "class":"string",
        "score":number /* optional */
      }
    ]
  }
}

We allow for zero or more classifications, each with an optional score. Services should return multiple classes in whatever order they feel is most useful (e.g. “most probable class” first), this order need not correspond to a monotonic ordering by score - we don’t assume scores are all mutually comparable - and the order will be preserved by any subsequent processing steps.

Classification tools that classify segments of the input rather than the whole input should use the annotations or texts response formats instead of this one.

Texts response ¶

A response consisting of one or more new texts with optional annotations, for example multiple alternative possible translations from an MT service or transcriptions from an ASR service.

{
  "response":{
    "type":"texts",
    "warnings":[...], /* optional */
    "texts":[
      {
        "role":"string", /* optional */
        "content":"string of translated/transcribed text", // either
        "texts":[/* same structure, recursive */],         // or
        "score":number, /* optional */
        "features":{ /* arbitrary JSON, optional */ },
        "annotations":{ /* optional */
          "<annotation type>":[
            {
              "start":number,
              "end":number,
              "sourceStart":number, // optional
              "sourceEnd":number,   // optional
              "features":{ /* arbitrary JSON */ }
            }
          ]
        }
      }
    ]
  }
}

As with the structured text request format above, this texts response structure is recursive, so it is possible for each object in the list to be a branch node containing a set of child texts or a leaf node containing a single string.

Leaf nodes have one required property content, plus zero or more of the following optional properties:

role: the role of this node in the response, “alternative” if it represents one of a list of alternative translations/transcriptions, “segment” if it represents a segment of a longer text, or “paragraph”, “sentence”, “word” etc. for specific types of text segment.
score: if this is one of a list of alternatives, each alternative may have a score representing the quality of the alternative
features: metadata about this node as a whole
annotations: as described above - the start and end are Unicode character offsets within the content and the sourceStart and sourceEnd are the offsets into the source data (the interpretation depends on the nature of the source data).

Branch nodes have one required property texts containing an array of child nodes (which may in turn be branch or leaf nodes), plus zero or more of the following optional properties:

role: the role of this node in the response, “alternative” if it represents one of a list of alternative translations/transcriptions, “segment” if it represents a segment of a longer text, or “paragraph”, “sentence”, “word” etc. for specific types of text segment.
features: metadata about this node as a whole
annotations: as described above - the start and end are array offsets within the texts array (e.g. "start":0, "end":2 would refer to the first and second children - treat them as zero-based array indices where the start is inclusive and the end is exclusive) and the sourceStart and sourceEnd are the offsets into the source data (the interpretation depends on the nature of the source data).

The texts response type will typically be used in two different ways, either

the top-level list of texts is interpreted as a set of alternatives for the whole result - in this case we would expect the content property to be populated but not the texts one, and a “role” value of “alternative” - tools should return the alternatives in whatever order they feel is most useful, typically descending order of likelihood (though as for classification results we don’t assume scores are mutually comparable and the order of alternatives in the array need not correspond to a monotonic ordering by score).
the top-level list of texts is interpreted as a set of segments of the result, where each segment can have N-best alternatives (e.g. a list of sentences, with N possible translations for each sentence). In this case we would expect texts to be populated but not content, and a “role” value of either “segment” or something more detailed indicating the nature of the segmentation such as “sentence”, “paragraph”, “turn” (for speaker detection), etc. - in this case the order of the texts should correspond to the order of the segments in the result.

Audio response ¶

A response consisting of a piece of audio (e.g. an audio rendering of text in a text-to-speech tool), optionally with annotations linked to either or both of the source and target data.

{
  "response":{
    "type":"audio",
    "warnings":[...], /* optional */
    "content":"base64 encoded audio for shorter snippets",
    "format":"string",
    "features":{/* arbitrary JSON, optional */},
    "annotations":{
      "<annotation type>":[
        {
          "start":number,
          "end":number,
          "sourceStart":number, // optional
          "sourceEnd":number,   // optional
          "features":{ /* arbitrary JSON */ }
        }
      ]
    }
  }
}

Here the content property contains base64-encoded audio data, and the format specifies the audio format used - in this version of the ELG platform the supported formats are LINEAR16 (uncompressed WAV) or MP3. In addition the response may contain zero or more of the following optional properties:

features: metadata about this node as a whole
annotations: as described above - the start and end are time offsets within the audio content expressed as floating point numbers of seconds, and the sourceStart and sourceEnd are the offsets into the source data (the interpretation depends on the nature of the source data).

As an alternative to embedding the audio data in base64 encoding within the JSON payload, a service MAY simply return the audio data directly with the appropriate Content-Type (audio/x-wav or audio/mpeg), however this approach means the service will be unable to return features or annotations over the audio, and will be unable to report partial progress.

Progress Reporting ¶

Some LT services can take a long time to process each request, and in these cases it may be useful to be able to send intermediate progress reports back to the caller. This serves both to reassure the caller that processing has not silently failed, and also to ensure the HTTP connection is kept alive. The mechanism for this in ELG leverages the standard “Server-Sent Events” (SSE) protocol format - if the client sends an Accept header that announces that it is able to understand the text/event-stream response type, then the service may choose to immediately return a 200 “OK” response with Content-Type: text/event-stream and hold the connection open (using chunked transfer encoding in HTTP/1.1 or simply not sending a Content-Length in HTTP2). It may then dispatch zero or more SSE “events” with JSON data in the following structure:

{
  "progress":{
    "percent"://number between 0.0 and 100.0,
    "message":{
      // optional status message, with code, text and params as above
    }
  }
}

followed by exactly one successful or failed response in the usual format. Services should not send any further progress messages once the success or failure response has been sent. Note that if a message is provided in a progress report it must be an i18n status message, not simply a plain string.

For example:

Content-Type: text/event-stream

data:{"progress":{"percent":0.0}}

data:{"progress":{"percent":20.0}}

data:{"progress":{
data:    "percent":70.0
data:  }
data:}

data:{"response":{...}}

As per the SSE specification, each line of data within an event is prefixed data:, and an event is terminated by a blank line - there MUST be two consecutive newlines or CRLF sequences between the end of one event and the start of the next.

One would normally expect the progress percentage to increase over time but this is not necessarily a requirement of the specification - services are free to publish progress messages without a "percent" property if they wish to provide a status update message but cannot quantify their progress numerically, or even with a lower percentage than the previous message if they now have information to suggest that the overall process will take longer than first estimated.

Services are RECOMMENDED to support this response format, and to send it if the client indicates they can accept text/event-stream, but it is not required. The clients which will call your services within the ELG infrastructure will accept both text/event-stream and application/json responses, and you are encouraged to return an event stream if you can, but you are free to return application/json if it makes more sense for your service, and you MUST return application/json if the calling client does not indicate in the Accept header that they can understand text/event-stream.

Appendix: Standard status message codes ¶

#
#   Copyright 2019 The European Language Grid
#
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.
#
# This file contains the standard ELG status messages, translations should
# be placed in files named elg-messages_LANG.properties alongside this file.
#

# general bad request errors
elg.request.invalid=Invalid request message
elg.request.missing=No request provided in message
elg.request.type.unsupported=Request type {0} not supported by this service
elg.request.property.unsupported=Unsupported property {0} in request

elg.request.too.large=Request size too large

# Errors specific to text requests
elg.request.text.mimeType.unsupported=MIME type {0} not supported by this service

# Errors specific to audio requests
elg.request.audio.format.unsupported=Audio format {0} not supported by this service
elg.request.audio.sampleRate.unsupported=Audio sample rate {0} not supported by this service

# Errors specific to structured text requests
elg.request.structuredText.property.unsupported=Unsupported property {0} in "texts" of structuredText request

# General bad response errors
elg.response.invalid=Invalid response message
elg.response.type.unsupported=Response type {0} not supported

# Unknown property in response
elg.response.property.unsupported=Unsupported property {0} in response
elg.response.texts.property.unsupported=Unsupported property {0} in "texts" of texts response
elg.response.classification.property.unsupported=Unsupported property {0} in "classes" of classification response

# User requested a service that does not exist
elg.service.not.found=Service {0} not found

# generic internal error when there's no more specific option
elg.service.internalError=Internal error during processing: {0}