Scale and grow your text utterance collection use cases by including smart validators that automatically detect, flag and reject invalid utterances before they can be submitted.
The CML attributes cml:smart_text
cml:text
and cml:textarea
can be used in these use cases (as well as in any other text-based application, such as transcription and translation).
Note
Smart Validators can be enabled on your Team account. Please reach out to your Customer Success Manager if you are interested in turning this feature on.
Glossary
- Utterance - The piece of text to be collected from the contributor
- Gibberish - The piece of text that is illogical and inconsistent
- Prompt - Data provided to the contributors to give them guidance on what utterances to collect
Job Design
Smart validators can be accessed via the graphical editor, or specified in the cml via the code editor. Smart validators can also be used in conjunction with basic validators such as Word count (see this article and this article).
Example Job design - Code Editor:
<cml:text label="Sample text field:" validates="required unique:['within_job', 0.8] lang:['en', 0.10] gibberish:[0.10] wordCountMin:8" />
Smart Validators
Language Detector
required lang:['{language}', {threshold_number}]
- This validator is used to ensure contributors are submitting text in the correct language. Learn more about our Language Detector model in this article.
- The currently supported languages are:
-
English
en
-
German
de
-
French
fr
-
Spanish
es
-
Japanese
ja
-
Portuguese
pt
-
Italian
it
- You will need to provide a threshold that will be used to evaluate contributors' submission. The lower the threshold, the more lenient the evaluation will be.
- Note: you may only validate for one target language per field.
Fig 2. Language Detection Validator
Gibberish Detector
required gibberish:[{threshold_number}]
- This validator is used to ensure contributors are submitting text that is cohesive and coherent. The model auto-detects the language they are typing in and then evaluates the probability that what they’re typing is valid text in that language. Learn more about our Gibberish Detector model in this article.
- These validators work best on text longer than 10 words.
- The currently supported languages are
- English
en
- German
de
- French
fr
- Spanish
es
- Japanese
ja
- Portuguese
pt
- Italian
it
- Note: You can use this in conjunction with other validators, but you may only set one threshold per individual field.
Fig 3. Gibberish Detection Validator
Duplicate Detection
The following validators give you the option of enforcing only unique submissions of text. This is helpful if you need many diverse examples of responses to the same prompt. Note that two of these validators also include a threshold
parameter, see below for more detail.
Fig 4. Duplicate Detection Validator
-
In this job, across all contributors
validates = "required unique:['within_job', {threshold_number}]"
- If your job is collecting multiple utterances per prompt, and you only want unique utterances, across the whole job, you'll want to use this option
-
In this job, across all contributors, across a unique prompt value
validates = "required unique:['within_column_name']"
- If your job is collecting utterances for multiple prompts, and you want to ensure that you are getting unique responses for each prompt, you can use this validator to specify the column that will be validated on. This enforces unique utterances for each prompt, but there may be duplicates utterances across the whole dataset.
-
In this job, across the unique contributor's submissions
validates = "required unique:['within_contributor', {threshold_number}]"
- If your job is collecting multiple utterances and you are interested in the range of likely utterances, or would like to understand the frequency of certain utterances across different contributors, you will use this setting. This ensures each contributors submissions are unique but there may be duplicate utterances across contributors.
-
In multiple jobs, across all contributors
validates = "required unique:[{job_id}, {job_id}]"
- This validator allows you to compare against data collected for a completed job or jobs. You can use this validator if you want to collect additional unique utterances.
In the graphical editor, you will be asked to enter the job id(s) to compare against, separated by commas. All jobs to be compared should have identical cml.
Fig 5. Duplicate Detection Validator: Enter Job IDs
Duplicate Detection Threshold
An additional duplicate detection setting is available in Text Collection jobs that are running in Quality Flow (to find out more about Quality Flow see this article and this article). This setting will allow you to specify a threshold at which utterances should be considered duplicates.
Learn more about the threshold settings in this article.
Duplicate detection threshold is available in Quality Flow work jobs for the following settings:
- In this job, across all contributors
- In this job, across the unique contributor's submissions
Fig 6. Duplicate threshold
The first submitted utterance will serve as the baseline for comparison to subsequent submissions. Upon submission any utterance that is considered a duplicate due to meeting or exceeding the specified threshold will not be accepted and the contributor will receive an error message as in the screenshot below.
Fig 7. Contributor view: duplicate detection threshold set to 0.9