Follow

Guide to: Text Utterance Collection with Smart Validators

Scale and grow your text utterance collection use cases by including smart validators that automatically detect, flag and reject invalid utterances before they can be submitted.

The CML attributes cml:smart_text  cml:text and cml:textarea can be used in these use cases (as well as in any other text-based application, such as transcription and translation).

Note

Smart Validators can be enabled on your Team account. Please reach out to your Customer Success Manager if you are interested in turning this feature on.

Glossary 

  • Utterance - The piece of text to be collected from the contributor 
  • Gibberish - The piece of text that is illogical and inconsistent 
  • Prompt - Data provided to the contributors to give them guidance on what utterances to collect

 

Job Design 

Smart validators can be accessed via the graphical editor, or specified in the cml via the code editor. Smart validators can also be used in conjunction with basic validators such as Word count (see this article and this article).

Example Job design - Code Editor:

<cml:text label="Sample text field:" validates="required unique:['within_job', 0.8] lang:['en', 0.10] gibberish:[0.10] wordCountMin:8" />

 

Smart Validators 

Language Detector

required lang:['{language}', {threshold_number}]

  • This validator is used to ensure contributors are submitting text in the correct language. Learn more about our Language Detector model in this article.
  • The currently supported languages are:
    • English en

    • German de

    • French fr

    • Spanish es

    • Japanese ja

    • Portuguese pt

    • Italian it

  • You will need to provide a threshold that will be used to evaluate contributors' submission. The lower the threshold, the more lenient the evaluation will be. 
  • Note: you may only validate for one target language per field.  

Screen_Shot_2022-06-13_at_2.42.01_PM.png

Fig 2. Language Detection Validator

 

Gibberish Detector 

required gibberish:[{threshold_number}]

  • This validator is used to ensure contributors are submitting text that is cohesive and coherent. The model auto-detects the language they are typing in and then evaluates the probability that what they’re typing is valid text in that language.  Learn more about our Gibberish Detector model in this article.
  • These validators work best on text longer than 10 words. 
  • The currently supported languages are
    • English en
    • German de
    • French fr
    • Spanishes
    • Japanese ja
    • Portuguese pt
    • Italian it
  • Note: You can use this in conjunction with other validators, but you may only set one threshold per individual field.

Screenshot

Fig 3. Gibberish Detection Validator

Duplicate Detection

The following validators give you the option of enforcing only unique submissions of text. This is helpful if you need many diverse examples of responses to the same prompt. Note that two of these validators also include a threshold parameter, see below for more detail.

 

Screen_Shot_2022-06-13_at_2.44.16_PM.png

Fig 4. Duplicate Detection Validator

  • In this job, across all contributors
    • validates = "required unique:['within_job', {threshold_number}]"
    • If your job is collecting multiple utterances per prompt, and you only want unique utterances, across the whole job, you'll want to use this option
  • In this job, across all contributors, across a unique prompt value
    • validates = "required unique:['within_column_name']"
    • If your job is collecting utterances for multiple prompts, and you want to ensure that you are getting unique responses for each prompt, you can use this validator to specify the column that will be validated on. This enforces unique utterances for each prompt, but there may be duplicates utterances across the whole dataset.
  • In this job, across the unique contributor's submissions
    • validates = "required unique:['within_contributor', {threshold_number}]"
    • If your job is collecting multiple utterances and you are interested in the range of likely utterances, or would like to understand the frequency of certain utterances across different contributors, you will use this setting. This ensures each contributors submissions are unique but there may be duplicate utterances across contributors.
  • In multiple jobs, across all contributors
    • validates = "required unique:[{job_id}, {job_id}]"
    • This validator allows you to compare against data collected for a completed job or jobs. You can use this validator if you want to collect additional unique utterances.

In the graphical editor, you will be asked to enter the job id(s) to compare against, separated by commas. All jobs to be compared should have identical cml.

Screenshot 2023-09-13 at 3.21.40 PM.png

Fig 5. Duplicate Detection Validator: Enter Job IDs

 

Duplicate Detection Threshold

An additional duplicate detection setting is available in Text Collection jobs that are running in Quality Flow (to find out more about Quality Flow see this article and this article). This setting will allow you to specify a threshold at which utterances should be considered duplicates.

Learn more about the threshold settings in this article.

Duplicate detection threshold is available in Quality Flow work jobs for the following settings:

  • In this job, across all contributors
  • In this job, across the unique contributor's submissions

aa4939de-cf5b-41c4-ba23-5eda19ac2247.png

Fig 6. Duplicate threshold

 

The first submitted utterance will serve as the baseline for comparison to subsequent submissions. Upon submission any utterance that is considered a duplicate due to meeting or exceeding the specified threshold will not be accepted and the contributor will receive an error message as in the screenshot below.

a740677f-65e1-4461-b47a-f5e7a179ba68.png

Fig 7. Contributor view: duplicate detection threshold set to 0.9

 


Was this article helpful?
2 out of 2 found this helpful


Have more questions? Submit a request
Powered by Zendesk