Text Utterance Collection Smart Validators - Model Details & Examples – Appen Success Center

Gibberish Detector

The model is essentially an unsupervised Markov model, which looks at character pairs to identify which characters frequently appear next to each other and can identify if a given text falls with the range or coherent text or if it is gibberish.

The Markov model first “studies” examples of English text and records how often characters appear next to each other, e.g. given the text “Appen helps train machine learning algorithms” - it sees Ap, pp, pe, en, n[space], [space]h,... it counts these pairs. After it has finished reading the examples of training data, it normalizes the counts. Then each character will have a probability distribution of 27 followup character (26 letters + space following the given initial.

Given a string during inference, the algorithm calculates a transition probability by multiplying the probabilities of the adjacent pairs of characters in that string. eg, for that "Appen helps train machine learning algorithm" string, it would compute prob['f']['i'] * prob['i']['g'] * prob['g']['u'] ... This probability can be defined as the amount of ‘expectedness’ assigned to this string according to the data the model observed during training. If the amount of ‘expectedness’ is greater than a threshold, then it is coherent. Otherwise, it is considered as a gibberish text.

Screen_Shot_2019-07-17_at_5.03.17_PM.png

Best practice

If the threshold is set too high, then the algorithm may risk creating false positives: texts that are actually coherent are classified as gibberish. If the threshold is set too low, then the algorithm risks creating false negatives: texts that are actual gibberish are classified as coherent.

In practice, it is probably better to avoid misclassifying data as gibberish and removing valuable information than letting some gibberish slip through (i.e. avoiding false positives), so we recommend setting the threshold to a reasonably low value. The default is 0.50.

Edge Cases

If an utterance collection job will be expecting contributors to mainly enter one or two words especially those consisting of brand/personal names, then we recommend setting the threshold lower, this is because many brand names don’t follow normal patterns, e.g. “shapr” or “issuu”.

In cases where part of an utterance makes sense and part of it is gibberish (e.g. “enviar foto a bob dsfasfasdfasdf “ or “usa el email sdfsdfsdfsd”), it will be classified as gibberish.

Language Detector

Our model is based on pre-trained model based on fasttext model, which is a lightweight library that allows users to learn text representations and text classifiers. Our model is trained on Wikipedia, Tatoeba and SETimes.

Details of the fasttext model performance can be found here and model architecture can be found here. The paper describing fasttext can be found here.

Edge cases:

There are some edge cases where the algorithm doesn’t work well. For example, a text only contains one or two words or a short text with brands/personal names (e.g. some brand name contain foreign words like “bosch” ) may be detected as a language mismatch. If an utterance collection job will be expecting contributors to mainly enter one or two words consisting of brand/personal names, then we recommend setting the threshold lower if you want very highly confident answers.

Duplicate Detection Threshold

The algorithm that is used to for lexical deduplication is Minhash combined with Local Sensitive Hashing. A hash (signature) is generated for each submission and those signatures are compared against each other to detect similarity on a scale from 0-1. The closer to 1, the higher the similarity ratio. The threshold you specify should be equal to or higher than the similarity ratio in order for the phrases to be flagged as duplicates. Here are some examples of phrases and the similarity ratio interaction with thresholds.

Threshold	Phrase 1	Phrase 2	Similarity Ratio	Flagged as duplicate
.90	The code does not compile because of a syntax error	The code does not compile because of a syntax error	1.0	Yes
.90	As far as my memory goes, I have always been fascinated by the stars and the endless expanse of the sky.	As far as my recall goes, I have constantly been captivated by the stars and the endless expanse of the sky.	0.73	No
.90	Our paper aims to explore the link between diet and inflammation, and its implications for chronic disease management.	Our manuscript intended to investigate the association between diet and inflammation, and its implications for chronic ailment management.	0.52	No
.90	It's important to stay hydrated, especially during the summer months.	It was essential to remain hydrated, particularly during the summer season.	.30	No
.70	The code does not compile because of a syntax error	The code does not compile because of a syntax error	1.0	Yes
.70	As far as my memory goes, I have always been fascinated by the stars and the endless expanse of the sky.	As far as my recall goes, I have constantly been captivated by the stars and the endless expanse of the sky.	0.73	Yes
.70	Our paper aims to explore the link between diet and inflammation, and its implications for chronic disease management.	Our manuscript intended to investigate the association between diet and inflammation, and its implications for chronic ailment management.	0.52	No
.70	It's important to stay hydrated, especially during the summer months.	It was essential to remain hydrated, particularly during the summer season.	.30	No
.50	The code does not compile because of a syntax error	The code does not compile because of a syntax error	1.0	Yes
.50	As far as my memory goes, I have always been fascinated by the stars and the endless expanse of the sky.	As far as my recall goes, I have constantly been captivated by the stars and the endless expanse of the sky.	0.73	Yes
.50	Our paper aims to explore the link between diet and inflammation, and its implications for chronic disease management.	Our manuscript intended to investigate the association between diet and inflammation, and its implications for chronic ailment management.	0.52	Yes
.50	It's important to stay hydrated, especially during the summer months.	It was essential to remain hydrated, particularly during the summer season.	.30	No
.30	The code does not compile because of a syntax error	The code does not compile because of a syntax error	1.0	Yes
.30	As far as my memory goes, I have always been fascinated by the stars and the endless expanse of the sky.	As far as my recall goes, I have constantly been captivated by the stars and the endless expanse of the sky.	0.73	Yes
.30	Our paper aims to explore the link between diet and inflammation, and its implications for chronic disease management.	Our manuscript intended to investigate the association between diet and inflammation, and its implications for chronic ailment management.	0.52	Yes
.30	It's important to stay hydrated, especially during the summer months.	It was essential to remain hydrated, particularly during the summer season.	.30	No