Overview
The cml:text_annotation
tag allows users to create a text annotation job with a custom ontology, test questions, and aggregation.
Glossary
- Token - the smallest possible unit of data able to be annotated in a string, predefined for the contributor by the tokenizers, or provided by the user.
- Span - a set of tokens (1 or more) with an assigned class label - the output of a model or a contributor judgment.
- Tokenizer - the rules by which to split text/strings into tokens.
- Nested Spans - the structure where smaller spans can be annotated within bigger spans. This would also allow one span to be assigned with multiple classes.
*Note: For Secure Data option, please contact your Customer Success Manager for additional information.
Build a Job
The following CML contains the possible parameters for a text annotation job:
<cml:text_annotation source-data="{{source_data_column}} name="output_column" tokenizer="spacy" source-type="text" search-url="https://www.google.com/search?q=%s" validates="required"/>
Figure 1. How to Edit Text Annotation Job via the Graphical Editor
Parameters
Below are the parameters available for the cml:text_annotation
tag. Some are required in the element, some can be left out.
source-type='text'
orsource-type='json'
(required)-
- This attribute tells the tool whether to expect text or JSON
- If text, text string will be expected and it will be required to specify the language and tokenizers to use on the text.
-
If JSON, the tool will attempt to access the files with tokens, spans, and predictions whenever it loads. JSON annotations must be provided via a URL to a cloud storage location. JSON cannot be read directly from a CSV (See Upload Data section for more information). Ontology used in JSON input must match Ontology of the job.
- Important note: The text annotation tool is likely to experience latency if the passage is over 2,000 characters. To avoid performance issues, we recommend splitting long passages into smaller segments and use the context feature to provide context.
- This attribute tells the tool whether to expect text or JSON
-
source-data
(required)- The name of the column containing the source data to be annotated.
name
(required)- The results header where the results links will be stored.
validates
(optional)- Accepts "required" (default), or "all-tokensall_tokens"
- If
validates="required"
: Contributors must assign a class label to at least one token on each unit. The "none" class does not count as a valid class label in this case . - If
validates="required all-tokensall_tokens"
: Contributors must assign a class label to each token before being allowed to submit. The "none" class is considered a valid label in this case.
- If
- Accepts "required" (default), or "all-tokensall_tokens"
search-url
(optional)- Include search engine URL to link the tool's lookup function
- Replace the query with "%s"
- Example:
search-url="https://www.google.com/search?q%s"
- Example:
allow-nesting
(optional)- Accepts
"true"
and"false"
- Default to
"false"
if not set - This options affects how test questions are evaluated. Please refer to this article for more information.
- Accepts
span-creation
(optional)- Accepts
"true"
and"false"
- Default to
"true"
if not present - Switching this attribute to
"false"
would prevent the merging of tokens
- Accepts
context-column
(optional)- A larger piece of text in your source data containing the text to annotate.
- Please note: This must be a column of text strings, even if your
source-type ="json"
-
review-data
(Optional)-
This parameter accepts the column header containing pre-created annotations. The format must match the output of the text annotation tool (JSON in a hosted URL).
-
-
task-type
(optional)-
Please set task-type=”qa” when designing a review or QA job. This parameter needs to be used in conjunction with review-data. See this article for more details.
-
-
direction
(optional):-
Renders text in a specific direction
-
Accepts
rtl
andltr
for right-to-left and left-to-right scripts respectively -
Defaults to left-to-right if not set
-
If your source data is in text, you can use the following parameters:
tokenizer
- This is required if
source-type="text"
- This tool accepts "Spacy" (
spacy
), "NLTK" (nltk
), "Stanford NLP" (stanford
), or "Split on " (nbsp
).- Note: Use the
nbsp
tokenizer if you'd like to bring a custom tokenizer in via a text upload. If used, tokens will be created based on the location of " " in the text. You can use this to create irregular tokens to label like whole sentences or partial clauses.
- Note: Use the
- This is required if
language
(optional)- Set which language the text that is being tokenized is in; this is required if and the data is non-English.
- The available options by tokenizer are as follows (default is English):
- Spacy: en, fr, de, pt, it, nl, es
- NLTK: en, de, es, pt, dr, it, nl
- Stanford NLP: en, fr, de, es
- Example:
language="fr"
Ontology
The Ontology Manager allows job owners to create and edit the ontology within a Text Annotation job. Text Annotation Jobs require an ontology to launch.
When the CML for a text annotation job is saved, the Ontology Manager link will appear at the top of the Design page.
Figure 2. Ontology Manager for Text Annotation
Ontology Manager Best Practices
- The limit of ontology is 1,000 classes, however, as best practice, we recommend not exceeding 16 classes in a job to ensure contributors can understand and process the different classes.
- Choose from 16 colors pre-selected or upload custom colors as hex code via the CSV ontology upload.
- If you uploaded model predictions as JSONs, the predicted classes should also be added to the ontology.
Upload Data
Upload data into the job as a CSV where each row represents text that will be tokenized and annotated. There are two options for uploading data:
- Text
- CML attribute
- File content:
- 1 column of text strings (required)
- 1 column of context (optional)
- Links to JSONs
- CML attribute
- File content:
- 1 column of link to hosted JSONs
- Note: Bucket must be CORS configured and publicly viewable. For more information on secure hosting, check out this article.
- 1 column of link to hosted JSONs
*Note: Below are example files on how to structure source data.
Results
- Results are links to a JSON file that contains original text, spans, and classes associated with each token.
- The links are found in the Full or Aggregated report under the column header that was specified as the value for the name attribute.
- Below is a snapshot taken from this output example.
- "text" = original line of text that requires annotation from your source data
- "classnames" = class assigned
- "tokens" = text associated with "classname"
- "startIDx" = start of annotation, which starts from index 0
- "endIDx" = end of annotation
- Text without classes associated will be noted by null as its "classname"
- "annotated_by" = "human" if the span is annotated by a contributor in this job, "machine" if the span is pre-loaded to the job via data upload.
- "parent" = the parent span of the current span. This is only relevant if you have enabled span nesting.
- "children" = the child spans of the current span. This is only relevant if you have enabled span nesting.
- Result links in Full and Aggregated reports will expire 15 days after generation. To get new result links, you can force re-generate the reports via the Results page.
- Please note the attached image displays only part of the results.
We now have a new report type for text annotation jobs: Download All Annotations
-
This report is only available in text annotation jobs via the Results Page.
-
This report will directly download the JSON files in the job rather than links to them like the full/aggregated reports. If metadata from the full or aggregated report is needed, these reports should be downloaded separately.
-
The report will download a zip file. Once the zip is opened, you will receive a new folder.
-
There is one folder inside with all the aggregated judgments, labeled by unit id.
-
The other folder is the full report judgments, labeled by judgment id.
-
-
This report is more secure than the aggregated report as no URLs will be generated.
Note: This report may take a while to generate and download due to the large nature of all its data files. However, the download will still be much faster compared to running scripts to scrape the results.