Follow

Guide to: Text Annotation Job Design

Overview

The cml:text_annotation tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.

Glossary

    • Token - the smallest possible annotatable fragment of a string, this is predefined for the contributor by the tokenizer.
    • Tokenizer - the rules used to split text/strings into tokens.
    • Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
    • Nested Span - a smaller span annotated within a bigger span, or a single span with multiple annotations.

Note: For Secure Data option, please contact your Customer Success Manager for additional information. 

Build a Job

The following CML contains the possible parameters for a text annotation job:

<cml:text_annotation source-data="{{source_data_column}} name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>

text_anno.gif

Figure 1.  How to Edit Text Annotation Job via the Graphical Editor

Parameters

Below are the parameters available for the cml:text_annotation tag. Some are required in the element, some can be left out.

  • source-type='text' or source-type='json'(required)
    • This attribute tells the tool whether to expect text or JSON
        • If source-type='text', a text string will be expected in the {{source_data_column}} and it will be necessary to to specify a tokenizer to use on the text. You can optionally also specify the language in the text.
          • tokenizer (required if source-type='text')
          • This tool accepts "Spacy" (spacy), "NLTK" (nltk), "Stanford NLP" (stanford), or "Split on &nbsp;" (nbsp). See below for GUI view.Screenshot 2023-11-30 at 4.28.18 PM.png
          • Note: Use the nbsp tokenizer if you'd like to bring a custom tokenizer in via a text upload. If used, tokens will be created based on the location of "&nbsp;" in the text. You can use this to create irregular tokens to label like whole sentences or partial clauses.
          • language (optional)
          • Set which language the text that is being tokenized is in; this is required if and the data is non-English.
            • Example:language="fr"
          • The available options by tokenizer are as follows (default is English):
            • Spacy: en, fr, de, pt, it, nl, es
            • NLTK: en, de, es, pt, dr, it, nl
            • Stanford NLP: en, fr, de, es
        • If 'json', the tool will attempt to access the files with tokens, spans, and predictions whenever it loads. JSON annotations must be provided via a URL to a cloud storage location. JSON cannot be read directly from a CSV (See Upload Data section for more information). Ontology used in JSON input must match Ontology of the job. Reference Text_Annotation_Sample_Tokenization_Schema.json below as an example for uploading pre-tokenized text to the text annotation tool.

        • Important note: The text annotation tool is likely to experience latency if the passage is over 2,000 characters. To avoid performance issues, we recommend splitting long passages into smaller segments and use the context feature to provide context.
  • source-data (required)
    • The name of the column containing the source data to be annotated.
  • name(required)
    • the results header where the name will be stored
  • validates (optional)
    • Accepts "required" (default), or "all-tokensall_tokens"
        • If validates="required": Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case .
        • If validates="required all-tokensall_tokens": Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
  • search-url (optional)
    • Include search engine URL to link the tool's lookup function
    • Replace the query with "%s"
        • Example: search-url="https://www.google.com/search?q%s"
  • allow-nesting(optional)
      • Accepts "true" and "false"
      • Default to "false" if not set
      • This options affects how test questions are evaluated. Please refer to this article for more information.
  • span-creation(optional)
      • Accepts "true" and "false"
      • Default to "true" if not present
      • Switching this attribute to "false" would prevent the merging of tokens
  • context-column (optional)
      • A larger piece of text in your source data containing the text to annotate.  
      • Please note: This must be a column of text strings, even if your source-type ="json"
  • review-data (Optional)

      • This parameter accepts the column header containing pre-created annotations. The format must match the output of the text annotation tool (JSON in a hosted URL).

  • task-type (optional)

      • Please set task-type=”qa” when designing a review or QA job. This parameter needs to be used in conjunction with review-data. See this article for more details.

  • direction (optional):

    • Renders text in a specific direction

    • Accepts rtl and ltr for right-to-left and left-to-right scripts respectively

    • Defaults to left-to-right if not set

Ontology

The Ontology Manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.

When the CML for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.

Screen_Shot_2020-09-17_at_3.09.00_PM.png

Figure 2. Ontology Manager for Text Annotation

Ontology Manager Best Practices 

  • The limit of ontology is 1,000 classes, however, as best practice, we recommend not exceeding 16 classes in a job to ensure contributors can understand and process the different classes. 
  • Choose from 16 colors pre-selected or upload custom colors as hex code via the CSV ontology upload.
  • If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology.

Upload Data

Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:

  • Text
    • CML attribute
    • File content: 
      • 1 column of text strings (required)
      • 1 column of context (optional) 
  • Links to JSON
    • CML attribute 
    • File content: 
      • 1 column of link to hosted JSON
        • Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article.

*Note: Below are example files on how to structure source data.

Results

  • Results of Text Annotation are stored as JSON files that contain original text, span, and classes associated with each token. 

  • Results can be acquired using FullAggregated, Download Annotations Only or Download Annotations and Aggregations report types.

    • When using Full or Aggregated report types, see this article on how to retrieve individual JSON files from "url" or "valueRef" metadata. 

    • Result links in reports will expire 15 days after generation. To get new result links, you can re-generate reports via the Results page.

    • Download Annotations and Aggregations report:

      • This report directly downloads a zip file with all the judgments from the full report and aggregate report as JSON files. Large reports may take several hours to generate.

    • Download Annotations Only report:

      • This report directly downloads a zip file with the judgments from the full report (without aggregation).

  • Below is a partial snapshot of an output example.

Screenshot_2023-04-17_at_3.26.28_PM.png

    • "text" = original line of text that requires annotation from your source data
        • the "text" key-value pair can refer to individual tokens or the entirety of source data for a row dependent on its location in the JSON.                                                                                   
    • "classnames" = class assigned to the token or span by an annotator. If the token did not get a classname the resulting value will be an empty array.
    • "tokens" = string associated with "classname" 
        • "startIDx" = start of annotation, which starts from index 0 
        • "endIDx" = end of annotation
    • "annotated_by" = "human" if the span is annotated by a contributor in this job, "machine" if the classname was pre-loaded to the job via data upload.
    • "parent" = the parent span of the current span. This is only relevant if you have enabled span nesting.
    • "children" = the child spans of the current span. This is only relevant if you have enabled span nesting.

Was this article helpful?
9 out of 15 found this helpful


Have more questions? Submit a request
Powered by Zendesk