Overview
As part of the Pre-labeling (Beta) feature, Appen offers various machine learning models available in the model template library to help provide initial "best-guess" hypotheses to an annotation project. This feature is helpful to specialized annotation projects as providing a contributor with model-predicted annotation hypotheses can dramatically cut down annotation time while maintaining and improving annotation quality. Below you can learn more about the models available in the model template library.
Fig. 1: Model Template Library
Important note: This feature is part of our managed services offering; contact your Customer Success Manager for access or more information.
Blur Faces in Images
This model accepts URLs of images and returns URLs pointing to copies of the source images with any pictured faces blurred.
- The face encoding model uses Adam Geitgey’s face recognition library, which is built using dlib’s face recognition model.
- It was trained on a dataset containing about 3 million images of faces grouped by individual.
- The model achieves an accuracy of 99.38% on the standard Labeled Faces in the Wild dataset, which means that given two images of faces, it correctly predicts if the images are of the same person 99.38% of the time.
- To learn more about the model, please read this blog post.
Box and Transcribe Words
This model is designed to be used as part of a document transcription workflow with the following steps:
- A contributor draws bounding boxes around lines of text in an image
- Given a bounding box around a line of text, the model predicts the bounding box coordinates and transcriptions corresponding to each word in the line of text
- A contributor reviews the model’s predictions
In addition to input columns for the images and the text-line bounding boxes, this model also takes an input for the language in which to make the transcription prediction. The values accepted are two-letter ISO 639-1 language codes, and the following language codes are accepted: English ('en'), Spanish ('es'), German ('de'), French ('fr'), Italian ('it'), Portuguese ('pt'), Dutch ('nl'), Hebrew ('he'), Hungarian ('hu'), Swedish ('sv'), Norwegian Nynorsk ('nn'), Danish ('da'), Finnish ('fi'), Chinese ('zh'), Chinese Simple ('zh-sim'), Chinese Traditional ('zh-tra'), Japanese ('jp'), Arabic ('ar'), Russian ('ru').
- The underlying optical character recognition (OCR) model is the Tesseract Open-Source OCR Engine.
- The model is designed to recognize printed text and is unlikely to work well on handwriting.
- Please refer to the model documentation for further support.
Classify Images
This trainable model classifies images into user-defined categories. Once trained, it accepts images and returns classification predictions, as well as the confidence of the predictions as values between 0 and 1. The model can be trained on class tags for the whole image by providing positive examples for each type. The data can also include negative examples for images that should not be assigned a class value. For instructions on how to train and evaluate the model, refer to this article.
- This model was developed and is hosted by IBM Watson Visual Recognition service.
- For more information, please visit the IBM Watson Visual Recognition service webpage here.
Detect Explicit Content
This model can be used to scan images for explicit or adult content and is helpful in monitoring chat, social media, and user-generated content. It accepts images and returns boolean values; if the value returned is “true,” the model has predicted that the image contains explicit content. The model also returns the confidence of its prediction as a value between 0 and 1 in the “explicit_confidence” output data column.
- This model was developed and is hosted by IBM Watson Visual Recognition service.
- For more information, please visit the IBM Watson Visual Recognition service webpage here.
Identify Face Landmarks
This model accepts URLs of images and returns json strings containing the coordinates of predicted facial landmarks.
- The face encoding model uses Adam Geitgey’s face recognition library, which is built using dlib’s face recognition model.
- It was trained on a dataset containing about 3 million images of faces grouped by individual.
- The model achieves an accuracy of 99.38% on the standard Labeled Faces in the Wild dataset, which means that given two images of faces, it correctly predicts if the images are of the same person 99.38% of the time.
- To learn more about the model, please read this blog post.
Label Pixels in Street Scene Images
This model generates pixel-level semantic segmentation masks for street scene images, such as images containing cars, trucks, buildings, pedestrians and signs. It can be useful for enhancing annotation efficiency and quality for autonomous vehicle and related use cases.
- The underlying model used is the High-Resolution Networks (HRNet) semantic segmentation model.
- It is trained on the Cityscapes dataset and achieves a mean intersection-over-union of 0.82.
Segment Audio
This model can be used to classify periods of time in audio according to the sound within it; pre-labeling with this model can streamline audio segmentation and transcription workflows. The model segments audio into the classes “speech”, “music”, “noise” and “silence”. It accepts audio files and returns the class labels, start times and end times of the identified segments.
- This model uses the SpeechSegmenter library that has been trained on INA’s Speaker Dictionary.