The model is essentially an unsupervised Markov model, which looks at character pairs to identify which characters frequently appear next to each other and can identify if a given text falls with the range or coherent text or if it is gibberish.
The Markov model first “studies” examples of English text and records how often character appear next to each other, e.g. given the text “Appen helps train machine learning algorithm” - it sees Fi, ig, gu, ur, re, e[space], [space]E,... it counts these pairs. After it has finished reading the examples of training data, it normalizes the counts. Then each character will have a probability distribution of 27 followup character (26 letters + space following the given initial.
Given a string during inference, the algorithm calculates a transition probability by multiplying the probabilities of the adjacent pairs of characters in that string. eg, for that "Appen helps train machine learning algorithm" string, it would compute prob['f']['i'] * prob['i']['g'] * prob['g']['u'] ... This probability can be defined as the amount of ‘expectedness’ assigned to this string according to the data the model observed during training. If the amount of ‘expectedness’ is greater than a threshold, then it is coherent. Otherwise, it is considered as a gibberish text.
If the threshold is set too high, then the algorithm may risk creating false positives: texts that are actually coherent are classified as gibberish. If the threshold is set too low, then the algorithm risks creating false negatives: texts that are actual gibberish are classified as coherent.
In practice, it is probably better to avoid misclassifying data as gibberish and removing valuable information than letting some gibberish slip through (i.e. avoiding false positives), so we recommend setting the threshold to a reasonably low value. The default is 0.50.
When a text only contains one or two words or a short text with many brands/personal names (as some brand name doesn’t follow normal patterns like “shapr” or “issuu”), the algorithm . If an utterance collection job will be expecting contributors to mainly enter one or two words consisting of brand/personal names, then we recommend setting the threshold lower.
Coherent detector should also work in the case where part of the utterance makes sense and part of it is gibberish (e.g. “enviar foto a bob dsfasfasdfasdf “ or “usa el email sdfsdfsdfsd”). In this case, it is classified as not coherent.
Our model is based on pre-trained model based on fasttext model, which is a lightweight library that allows users to learn text representations and text classifiers. Our model is trained on Wikipedia, Tatoeba and SETimes.
Details of the fasttext model performance can be found here and model architecture can be found here. The paper describing fasttext can be found here.
There are some edge cases where the algorithm doesn’t work well. For example, a text only contains one or two words or a short text with brands/personal names (e.g. some brand name contain foreign words like “bosch” ) may be detected as a language mismatch. If an utterance collection job will be expecting contributors to mainly enter one or two words consisting of brand/personal names, then we recommend setting the threshold lower if you want very highly confident answers.