Guide to: Running an Audio Transcription Job – Appen Success Center

Overview

The cml:audio_transcription tag allows users to create an audio transcription job with custom labels and tag sets.

Note: For Dedicated customers, this feature is currently not available On-Premises and only available via the cloud multi-tenant Appen Data Annotation Platform.

Fig. 1: Audio Transcription tool interface for Contributors

Building a Job

Data

The audio transcription tool supports the transcription of .wav, .mp3, and .ogg file types, as well as .mp4 and .mov (see video parameter, below)
Your data must be CORS configured.
All data access control features are supported for this tool, including Secure Data Access.

CML

As this tool is in open beta, there is no Graphical Editor support yet. For access to the tool's CML gem, please reach out to your Appen contact.

Parameters

Below are the parameters available for the job design. Some are required in the element, while some are optional.

type (required - as of 16 January 2023)
- transcription- enables the transcription field, as well as the tags and timestamps.
  - timestamps still also require allow-timestamping
- labeling- enables the labels, as configured in the ontology.
- segmentation- Allows contributors to create or modify segments, to subsequently be transcribed and/or labeled.
  - NOTE: If contributors would like to delete a segment, they can use the CTRL + Backspace hotkey (compatible with Mac and Windows)
- play-only- only the audio player (and video, if configured) will be available to contributors.
  - this type can only be used by itself. For example: type="['play-only']"
- none- if no type is configured, only the audio player will be available to contributors.
- Examples:
  - type="['labeling', 'transcription']"Allows labeling and transcription (including tags/timestamps)
  - type="['labeling', 'segmentation']"Allows labeling and segmentation (including tags/timestamps)
- A note about type and its interaction with review-data (see below): When you load data for review into the tool using review-data, all annotations provided will be visible, including transcriptions, tags, and labels. Use type to control which parts of the data a contributor can edit. For example, if you use type="['transcription']" and your review data contains both transcription and labels, contributors will be able to see both the labels and the transcriptions, but they will only be able to edit the transcriptions.

source-data (required)
- The column header from your source data containing the audio URLs to be annotated.

name (required)
- The results header where the annotations will be stored.

segments-data (optional)
- The column header from your source data containing the audio segmentation data (the start and end timestamps of each segment).
  - The tool uses this data to create the transcription box for each segment.
  - The tool expects the data to be in the format found below.
  - If you do not have segmentation data, omit this parameter.

label (optional)
- The question label that the contributors will see.

validates (optional)
- validates="required"
  - Defines whether or not the element is required to be answered
  - Defaults to not required if not present
  - Defaults to "required" if this is the only CML tag present in the job design, as there must be at least one required element
- validates="timestamp_direction"
  - Checks whether the timestamps are in the right order upon submission
  - Regardless of the specified text direction or language, the waveform always runs right to left, therefore if timestamps are not placed in left-to-right order, submission will be blocked and contributors will encounter an error
- validates="minTimestamps:1"
  - Checks whether the contributor has placed the minimum number of specified timestamps
  - Defaults to minTimestamps:0 if not present
- to use multiple validators, separate them with spaces:
  - example: validates="required timestamp_direction"

review-data (optional)
- This will read in existing transcriptions on an audio file.
- If used, a source column with links to the transcriptions formatted as is outputted by the audio transcription tool is required (format as seen below in the 'Results' section).
- This parameter may be used to do a peer review or model validation job.
- Please see the “Review Mode” section for more details
- You can use raw text input as your review-data (e.g. for prompt-audio validation) as long as you have no other annotation data as input.
- As mentioned above, when you load data for review into the tool using review-data, all annotations provided will be visible, including transcriptions, tags, and labels. Use type to control which parts of the data a contributor can edit. For example, if you use type="['transcription']" and your review data contains both transcription and labels, contributors will be able to see both the labels and the transcriptions, but they will only be able to edit the transcriptions.

subset (optional)
- This parameter allows you to set up the tool to display only a subset of all the segments in each unit
- Only use this if the “review-data” parameter is present
- Accepts value from 0 to 1
- Defaults to 1
- See the “Review Mode” section for more details

force-fullscreen(optional)
- Accepts 'true' or 'false'.
- If 'true', a page of work contains a preview of each data row. When a contributor clicks to open a data row, the audio transcription tool loads into a fullscreen view.
- If 'false', a page of work contains a view of the audio transcription tool for each data row. The contributor can open a fullscreen view of the tool at their discretion by clicking an icon in the top right corner of the tool or using a hotkey.
- Defaults to 'false'.

task-type (optional)
- Please set task-type=”qa” when designing a review or QA job. This parameter needs to be used in conjunction with review-data . See this article for more details.

listen-to (optional)
- This parameter allows a you to configure specific amounts of audio that the contributor must listen to in order to submit the task. The parameter is optional but if not configured, it will default to requiring the contributor to listen to 100% of the audio in order to submit the task. The parameter accepts an array of triplets. Each triplet specifies, in order:
  - the beginning point of a range of audio (described as out of 1)
    - e.g., [0.3,0.6,0.5]: beginning at 30% of the way through the audio
  - the closing point of a range of audio (described as out of 1)
    - e.g., [0.3,0.6,0.5]: ending at 60% of the way through the audio
  - how much of that range of audio the contributor must listen to (described as out of 1)
    - e.g., [0.3,0.6,0.5]: listen to 50% of the audio in the specified range
    - The contributor can listen to any audio within the range so long as it cumulatively sums to the required quantity.
    - If a contributor listens to the same portion of audio more than once, it only counts towards validation once.
  - The sum of the differences between all of the beginnings and endings specified in the array must sum to 1. That is, the entire audio must have a listening requirement specified:
    - Invalid: “[[0.3,0.6,0.5],[0.6,1,0.25]]” Because the range from 0-0.3 is unspecified
    - Valid: “[[0,0.3,0.75],[0.3,0.6,0.5],[0.6,1,0.25]]” Because the ranges sum to 1
  - If you want to turn off validation, set the parameter as: listen-to="[[0,1,0]]" (requires the contributor to listen to 0% of the audio between 0 (beginning) and 1 (end).
  - Listening validation makes the “validation” checkbox that used to appear during task-type=”qa” obsolete since we can now require contributors to listen to audios before submitting.

speed(optional)
- This parameter allows you to specify the permitted playback speeds for the audio according to the following syntax: speed="[0.5,1,1.5,3]"
  - If I only provide 1 speed, then all contributors can listen only at that speed
  - If multiple speeds are provided, then contributors can select which speed(s) they want to listen at from the dropdown in the tool UI
  - When the tool opens, it defaults the speed control to the value closest to 1 and larger than 1 if there is a tie, e.g. [0.5,1.5]
  - If the speed configuration is not provided, the default behavior will provide the contributor with the following options:

allow-timestamping (optional)
- set allow-timestamping="true" to enable the timestamping functionality in your transcription task
  - contributors will see the “add timestamp” button in each transcription box that allows them to insert timestamps within their transcription
  - timestamps can be used to generate more granular text/audio alignment and/or to allow contributors to correct and improve the segmentation points in the source data
  - timestamps appear in the output data like this: this is a <12.345/> transcription
- Note: This parameter defaults to "false" if not declared

text-direction (only compatible with type="transcription", optional)
- Set text-direction="rtl" to specify that the language you are transcribing is written from right-to-left (e.g. Arabic). This will ensure that any tags and timestamps are placed in the correct sequential location in the text.
- It is recommended to use this in combination with the timestamp-direction validator described above
- If the tags themselves are in English or another left-to-right language, (e.g. </noise>) they will continue to be displayed in the right direction
- Note: this parameter defaults to "ltr" if not explicitly defined.

video (optional)
- Set video="true" and beta="true" to enable display of video along with the audio.
- Ensure that your data is in one of the supported formats: .mp4 or .mov
- Video data must include an audio track to ensure the tool is usable.
- Note: this parameter defaults to "false" if not explicitly defined. If your data is .mp4 or .mov, the tool will play audio, but the video will not be displayed.

Screenshot 2024-03-22 at 11.35.28 AM.png

Ontology

The audio transcription ontology is where you define the metadata that transcribers will use to label and tag audio files or segments.

You can access the ontology by clicking the link to 'Manage Audio Transcription Ontology' that appears on the right corner of the job's Design Page.

Fig 2: Audio Transcription Ontology Manager

The top-level metadata defined in the audio transcription ontology consists of labels, event tags, and span tags.

Labels
- Contributors will apply your labels at the segment level.
- Labels are defined as members of label groups. Groups are just a way to keep related labels together according to common attributes and rules; groups themselves are not metadata to be labeled.
- Label groups require:
  - a name
  - at least one label inside them.
    - Labels must be unique, even between groups.
- At the “label group” level, we can define:
  - if selecting a label from the group is mandatory
  - if users can select multiple labels from the group
  - if the group is not transcribable
    - By default, we assume a segment is transcribable and show the transcribable labels.
      - In this case, mandatory|transcribable applies.
    - If a segment is marked as “nothing to transcribe”, only non-transcribable labels should be available.
      - In this case, mandatory|non-transcribable applies.
      - The transcription box is not available.
- You do not need to create any labels (i.e. you needn’t create any groups if you need no labels), but if you want to include labels, they must be inside a group.

Event Tags
- Event tags are optional.
- Event tags require a name.
- Event tags are initially displayed in alphabetical order. However, you have the ability to customize their order of presentation by dragging and arranging the tags as per you prefer.
- You may also provide a description for each tag, which will be visible to the contributor in the tool when they click on the info icon for that tag.
Event Groups
- Event groups are optional
- Event groups require a name
- Event groups are initially displayed in alphabetical order. However, you have the ability to customize their order of presentation by dragging and arranging the groups as per you prefer.
- You can assign event tags to event groups using a drop down within the event tag OR by dragging an existing event tag inside an event group.

Span Tags
- Span tags are optional.
- Span tags require a name.
- Span tags are initially displayed in alphabetical order. However, you have the ability to customize their order of presentation by dragging and arranging the tags as per you prefer.
- You may also provide a description for each tag, which will be visible to the contributor in the tool when they click on the info icon for that tag.
Span Groups
- Span groups are optional
- Span groups require a name
- Span groups are initially displayed in alphabetical order. However, you have the ability to customize their order of presentation by dragging and arranging the groups as per you prefer.
- You can assign event tags to span groups using a drop down within the span tag OR by dragging an existing span tag inside an span group.

Reviewing Results

Results will be provided as a secure link to a JSON file describing the annotations.

Important note: Due to security reasons, JSON result links will expire 15 days after generation. To receive non-expired result links, please re-generate the result reports.

The objects in the JSON include the following:

For each segment:
- id
  - The universally unique identifier of every segment.
- startTime and endTime
  - This will be displayed in seconds, to the millisecond.
  - These fields are inherited from the segmentation data.
- labels
  - This field will contain the labels as indicated by the contributor.
- transcription
  - This field will contain the transcription text as entered by the contributor. Tags and timestamps will also appear in the transcription field.

For the entire audio file:
- nothingToTranscribe
  - This will be Booleantrue or false
  - This will be true if the contributor has indicated they were unable to transcribe the entire audio file.
- abletoAnnotate
  - This will be Booleantrue or false
  - This will be false if the tool was unable to load the audio file.

Annotation Schema

interface AudioToolNewOutput {
annotation: {
segments: {
// main information about the segment
id: string; //prefixed with segment so as not to confuse this as annotation id
startTime: number;
endTime: number;

// Kept the following one as extra info until the layers feature is removed from the tool entirely,
// otherwise users would see layers in first load and if they load judgment or autosaved data (on page refresh) subsequently, they wouldn't.
// this inconsistency would bring confusion
layerId: string;

labels: string[]; //comes if 'type' includes labelling & 'task-type' is labelling or qa
transcription: string[]; // comes if 'type' includes transcription & 'task-type' is labelling or qa

metadata: {
// all other info for the segment
comment?: string; // can be present in any qa judgment
feedbackAcknowledged: boolean; // comes from acknowledgment task
};
}[];
}

ableToAnnotate: boolean;
nothingToTranscribe: boolean; // did not change it to nothingToAnnotate since this field is not being used for anything else so it was not useful to update

Review Mode

We have created an experience specially designed for the purpose of quality management.

For job creators

On the job creator’s side, when using the task-type="qa" & review-data CML parameters, you can also specify a subset , so that the tool display only a random sample of all the segments in each audio file. The reviewers can review each audio file, but much quicker. This feature is ideal if your goal is to get an idea of the transcription quality of each audio file.

In the review job’s output, you will see two additional fields under “metadata”:

“original_text”: in case the transcription is changed by the reviewer, this field records the original transcriptions that are loaded as input data. This field makes it easy to calculate the word error rate of each unit.
“review_status”: for all the segments that have been randomly selected for the review job, they will have this attribute set to “reviewed”.

Please note that this feature is designed to review jobs with only 1 judgment per row.

For contributors

We have added a “reviewed” button for review jobs, which ensures that the reviewer must go through every single segment before being able to submit. As a reviewer working on a review job, for each segment, they will need to click on the “reviewed” button after reviewing the transcription.

Additional Notes:

This product is in BETA, so please consider the following important notes:

Reviewing/viewing submitted annotations via the unit page is not currently supported.
The job must be set up in the Code Editor; the tool is not supported in the Graphical Editor yet.
Audio Transcription jobs do not support test questions or aggregation at this stage.
Launching this type of job requires one of our trusted contributor channels. Please reach out to your Customer Success Manager to set this up.

Accepted Segmentation Input Schemas

OLD:

SegmentInput {
  id: string;
  startTime: number;
  endTime: number;
  ontologyName?: string;
  layerId?: string;
}

SegmentsDataInput {
  annotation: SegmentInput[][];
  nothingToAnnotate: boolean;
}

NEW:

interface SegmentInput {
  id: string;
  startTime: number;
  endTime: number;
  ontologyName?: string;
  layerId?: string;
}

interface AudioAnnotationData {
  annotation: SegmentInput[][];
  nothingToAnnotate: boolean;
}

type SegmentsDataInput = SegmentInput[] | AudioAnnotationData;

Segments input data is not required
The old input format (cml:audio_annotation output) is partially supported by the tool. Segment boundaries will be respected but ontology classes and names will not work in the new tool or ontology format.
If you have segments or other pre-annotations to display they should be in the above format.