Guide to: Text Annotation Job Design – Appen Success Center

Overview

The cml:text_annotation tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.

💥 Exciting update! The text annotation tool now supports character-level annotation. Enable the 'non-tokenized' configuration to try it out!

Glossary

Token - the smallest possible annotatable fragment of a string, this is predefined for the contributor by the tokenizer.
Tokenizer - the rules used to split text/strings into tokens.
Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
Nested Span- a smaller span annotated within a bigger span, or a single span with multiple annotations.
Non-Tokenized - an annotation is not constrained by predefined tokens. Instead of working with preset word or subword units, contributors interact directly with the raw text and can select and label spans at the character level.

Note: For Secure Data option, please contact your Customer Success Manager for additional information.

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation source-data="{{source_data_column}} name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>

Figure 1. How to Edit Text Annotation Job via the Graphical Editor

🚧 Note: The Non-Tokenized version of the tool is currently in beta. The following features are not yet supported:

Basic QA settings in Quality Flow (Advanced QA settings are supported)
Test questions
Aggregation
Multi-selection of spans

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

type="non-tokenized"or type="tokenized" (required)
- This parameter determines the type of annotation tool you want to use:
  - If type="non-tokenized":
    - Tokenizers are not required.
    - Contributors can annotate at the character level.
  - If type="tokenized":
    - Tokenizers are required.
    - Contributors can annotate at the token level.
    - Learn more about tokenizers in the section below.
source-type="text"or source-type="json"(required)
- This attribute tells the tool whether to expect text or JSON
  - If source-type="text", a text string will be expected in the {{source_data_column}} and (when type="tokenized") it will be necessary to specify a tokenizer to use on the text. You can optionally also specify the language in the text.
    - tokenizer (required if type="tokenized" and source-type="text" )
      - This tool accepts "Spacy" (spacy), "NLTK" (nltk), "Stanford NLP" (stanford), or "Split on  " (nbsp). See below for GUI view.
      - Note: Use the nbsp tokenizer if you'd like to bring a custom tokenizer in via a text upload. If used, tokens will be created based on the location of " " in the text. You can use this to create irregular tokens to label like whole sentences or partial clauses.
    - language (optional, only supported with type="tokenized")
      - Set which language the text that is being tokenized is in; this is required if and the data is non-English.
        
        Example:language="fr"
      - The available options by tokenizer are as follows (default is English):
        
        Spacy: en, fr, de, pt, it, nl, es
        
        NLTK: en, de, es, pt, dr, it, nl
        
        Stanford NLP: en, fr, de, es
    - If 'json', the tool will attempt to access the files with tokens, spans, and predictions whenever it loads. JSON annotations must be provided via a URL to a cloud storage location. JSON cannot be read directly from a CSV (See Upload Data section for more information). Ontology used in JSON input must match Ontology of the job. Reference Text_Annotation_Sample_Tokenization_Schema.json below as an example for uploading pre-tokenized text to the text annotation tool.
    - Important note: The text annotation tool is likely to experience latency if the passage is over 2,000 characters. To avoid performance issues, we recommend splitting long passages into smaller segments and use the context feature to provide context.
source-data (required)
- The name of the column containing the source data to be annotated.
name (required)
- The results header where the name will be stored
validates (optional)
- Accepts "required" (default), or "all-tokensall_tokens"
  - If validates="required" : Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case .
    - If validates="required all-tokensall_tokens" (only supported with type="tokenized"): Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
search-url(optional)
- Include search engine URL to link the tool's lookup function
- Replace the query with "%s"
  - Example: search-url="https://www.google.com/search?q%s"
allow-nesting (optional)
- Accepts "true" and "false"
  - Default to "false" if not set
  - This options affects how test questions are evaluated. Please refer to this article for more information.
span-creation (optional)
- Accepts "true" and "false"
  - Default to "true" if not present
  - Switching this attribute to "false" would prevent the merging of tokens
context-column (optional)
- A larger piece of text in your source data containing the text to annotate.
  - Please note: This must be a column of text strings, even if your source-type="json"
review-data (Optional)
- This parameter accepts the column header containing pre-created annotations. The format must match the output of the text annotation tool (JSON in a hosted URL).
task-type (optional)
- Please set task-type="qa" when designing a review or QA job. This parameter needs to be used in conjunction with review-data. See this article for more details.
direction (optional):
- Renders text in a specific direction
- Accepts RTL and LTR for right-to-left and left-to-right scripts, respectively
- Defaults to left-to-right if not set
trim-whitespace (optional, only supported with type="non-tokenized")
- When enabled, any leading or trailing spaces in a contributor’s annotation will be ignored by the tool and excluded from the output.
- By default, trim-whitespace is set to "true".
- Example:
  - If a contributor annotates _big_red_dog_, the tool will capture and label only big_red_dog in the output.

Ontology

The Ontology Manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.

When the CML for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.

Screen_Shot_2020-09-17_at_3.09.00_PM.png

Figure 2. Ontology Manager for Text Annotation

Ontology Manager Best Practices

The limit of ontology is 1,000 classes, however, as best practice, we recommend not exceeding 16 classes in a job to ensure contributors can understand and process the different classes.
Choose from 16 colors pre-selected or upload custom colors as hex code via the CSV ontology upload.
If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology.

Upload Data

Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:

Text
- CML attribute
- File content:
  - 1 column of text strings (required)
  - 1 column of context (optional)
Links to JSON
- CML attribute
- File content:
  - 1 column of link to hosted JSON
    - Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article.

*Below are example files on how to structure source data.

💡 Note: Tokenized and Non-Tokenized version of the tools have different output formats.

Results

Results of Text Annotation are stored as JSON files that contain original text, span, and classes associated with each token.
Results can be acquired using Full, Aggregated, Download Annotations Only or Download Annotations and Aggregations report types.
- When using Full or Aggregated report types, see this article on how to retrieve individual JSON files from "url" or "valueRef" metadata.
- Result links in reports will expire 15 days after generation. To get new result links, you can re-generate reports via the Results page.
- Download Annotations and Aggregations report:
  - This report directly downloads a zip file with all the judgments from the full report and aggregate report as JSON files. Large reports may take several hours to generate.
- Download Annotations Only report:
  - This report directly downloads a zip file with the judgments from the full report (without aggregation).

Below is a partial snapshot of an output example.

Tokenized Tool output example:

Screenshot_2023-04-17_at_3.26.28_PM.png

- "text" = original line of text that requires annotation from your source data
  - - the "text" key-value pair can refer to individual tokens or the entirety of source data for a row dependent on its location in the JSON.
- "classnames" = class assigned to the token or span by an annotator. If the token did not get a classname the resulting value will be an empty array.
- "tokens" = string associated with "classname"
  - - "startIDx" = start of annotation, which starts from index 0
    - "endIDx" = end of annotation
- "annotated_by" = "human" if the span is annotated by a contributor in this job, "machine" if the classname was pre-loaded to the job via data upload.
- "parent" = the parent span of the current span. This is only relevant if you have enabled span nesting.
- "children" = the child spans of the current span. This is only relevant if you have enabled span nesting.

Non-Tokenized Tool output example:

"full_text" = original line of text that requires annotation from your source data
"text" = the text that was annotated
"class_name" = class assigned to the annotation
"start" = the index of the first character in the annotation
"end" = the index of the last character in the annotation