Guide to: Smart Validators – Appen Success Center

Scale and grow your text utterance collection, AI chat feedback and Audio Transcription use cases by including smart validators that automatically detect, flag and reject invalid utterances before they can be submitted.

Spelling & Grammar and Regex Validation are available for cml:smart_text cml:ai_chat_feedback and cml:audio_transcription
Language detection, Gibberish detection and Duplicate detection are available for cml:smart_text cml:text and cml:textarea

Note

Smart Validators can be enabled on your Team account. Please reach out to your Customer Success Manager if you are interested in turning this feature on.

Glossary

Utterance - The piece of text to be collected from the contributor
Gibberish - The piece of text that is illogical and inconsistent
Prompt - Data provided to the contributors to give them guidance on what utterances to collect

Smart Validators

Smart validators can be accessed via the graphical editor, or specified in the cml via the code editor. Smart validators can also be used in conjunction with basic validators such as Word count (see this article and this article).

Example Job design - Code Editor:

<cml:text label="Sample text field:" validates="required unique:['within_job', 0.8] lang:['en', 0.10] gibberish:[0.10] wordCountMin:8" />

Spelling and Grammar detection

validates="required smart_spelling_and_grammar:['en', 'xx-variety']"

When using Smart Text, Audio Transcription, or AI Chat Feedback in English, you can enable a grammar and spelling check for the input text. Our grammar and spelling check will catch issues with spelling, punctuation, agreement and conjugations.

To add these checks to your job:

Expand "Smart Validation"
Under Add Formatting Rule, choose "Spelling & Grammar Detection"

Select the language and the locale you want to target. Click Save and the grammar and spelling validation will be applied to the job. Currently the varieties of English shown in the screenshot below are supported.

Choosing "Don't suggest any changes based on English variety" will identify only those errors that would be errors in any of the supported varieties of English.

Spelling and grammar issues will be underlined in red as the contributor types.
By clicking on the underline, they will receive a suggestion to fix the issue.
To accept a suggestion, contributors will click the corrected form of the word. If contributors do not find the suggestions correct, they can click the trash icon to reject the suggestion and either leave the text as is, or make any manual corrections needed before going on to click "Submit" again.

enforced (optional, defaults to "false")
- By default contributors are not required to click on the underlines; i.e. they will be able to submit their response without viewing, fixing or trashing the suggestions. This to avoid the suggestions inadvertently slowing down the throughput.
- Where corrections are of very high priority, you can enforce inspection of the suggestions by using the parameter enforced in the code editor.
- When enforced="true" contributors will receive the following message upon submission, until they have clicked on each underline and accept or reject the suggestion.

Regex Detector

smart_regex:[['regex','error_description','fix_suggestion']]

Regex detection allows you to validate contributor input against any regular expression allowed in JavaScript (in cml:smart_text cml:ai_chat_feedback or cml:audio_transcription jobs).
When a contributor enters the specified expression anywhere in the input, the regex will be flagged.
To add this validator to your job:
- Expand "Smart Validation"
- Under Add Formatting Rule, choose "Regex Detection"
- Enter the regular expression the tool should be detecting
- Enter the error description that explains why the regular expression was flagged

- Optionally tick the "Enable fix suggestion" box to also suggest a correction for the error

- Test the regex before applying it to your job by using the "Test the rule" input box. If your inputted text doesn't match your regex, the message "Input does not match Regex" will be displayed

For the above example, the contributor will see a red underline where the regex matches.
When they click on the line they will see the message "fix this" with a trash icon. Clicking on the trash icon allows them to ignore the flag.
If a fix suggestion has been set, they will also see the fix suggestion:

It is possible to have multiple regex detection patterns and corrections running in the same job.

Language Detector

required lang:['{language}', {threshold_number}]

This validator is used to ensure contributors are submitting text in the correct language. Learn more about our Language Detector model in this article.
The currently supported languages are:

English en
German de
French fr
Spanish es
Japanese ja
Portuguese pt
Italian it

You will need to provide a threshold that will be used to evaluate contributors' submission. The lower the threshold, the more lenient the evaluation will be.
Note: you may only validate for one target language per field.

Screen_Shot_2022-06-13_at_2.42.01_PM.png

Fig 2. Language Detection Validator

Gibberish Detector

required gibberish:[{threshold_number}]

This validator is used to ensure contributors are submitting text that is cohesive and coherent. The model auto-detects the language they are typing in and then evaluates the probability that what they’re typing is valid text in that language. Learn more about our Gibberish Detector model in this article.
These validators work best on text longer than 10 words.
The currently supported languages are

English en
German de
French fr
Spanishes
Japanese ja
Portuguese pt
Italian it

Note: You can use this in conjunction with other validators, but you may only set one threshold per individual field.

Screenshot

Fig 3. Gibberish Detection Validator

Duplicate Detection

The following validators give you the option of enforcing only unique submissions of text. This is helpful if you need many diverse examples of responses to the same prompt. Note that two of these validators also include a threshold parameter, see below for more detail.

Screen_Shot_2022-06-13_at_2.44.16_PM.png

Fig 4. Duplicate Detection Validator

In this job, across all contributors
- validates = "required unique:['within_job', {threshold_number}]"
- If your job is collecting multiple utterances per prompt, and you only want unique utterances, across the whole job, you'll want to use this option
In this job, across all contributors, across a unique prompt value
- validates = "required unique:['within_column_name']"
- If your job is collecting utterances for multiple prompts, and you want to ensure that you are getting unique responses for each prompt, you can use this validator to specify the column that will be validated on. This enforces unique utterances for each prompt, but there may be duplicates utterances across the whole dataset.
In this job, across the unique contributor's submissions
- validates = "required unique:['within_contributor', {threshold_number}]"
- If your job is collecting multiple utterances and you are interested in the range of likely utterances, or would like to understand the frequency of certain utterances across different contributors, you will use this setting. This ensures each contributors submissions are unique but there may be duplicate utterances across contributors.
In multiple jobs, across all contributors
- validates = "required unique:[{job_id}, {job_id}]"
- This validator allows you to compare against data collected for a completed job or jobs. You can use this validator if you want to collect additional unique utterances.

In the graphical editor, you will be asked to enter the job id(s) to compare against, separated by commas. All jobs to be compared should have identical cml.

Screenshot 2023-09-13 at 3.21.40 PM.png

Fig 5. Duplicate Detection Validator: Enter Job IDs

Duplicate Detection Threshold

An additional duplicate detection setting is available in Text Collection jobs that are running in Quality Flow (to find out more about Quality Flow see this article and this article). This setting will allow you to specify a threshold at which utterances should be considered duplicates.

Learn more about the threshold settings in this article.

Duplicate detection threshold is available in Quality Flow work jobs for the following settings:

In this job, across all contributors
In this job, across the unique contributor's submissions

Fig 6. Duplicate threshold

The first submitted utterance will serve as the baseline for comparison to subsequent submissions. Upon submission any utterance that is considered a duplicate due to meeting or exceeding the specified threshold will not be accepted and the contributor will receive an error message as in the screenshot below.

Fig 7. Contributor view: duplicate detection threshold set to 0.9