Follow

Guide to: The Model Template Library (Beta)

Overview

As part of the Pre-labeling (Beta) feature, Appen offers various machine learning models available in the model template library to help provide initial "best-guess" hypotheses to an annotation project. This feature is helpful to specialized annotation projects as providing a contributor with model-predicted annotation hypotheses can dramatically cut down annotation time while maintaining and improving annotation quality. Below you can learn more about the models available in the model template library.

Screen_Shot_2021-06-11_at_9.44.15_AM.png

Fig. 1: Model Template Library

Important note: This feature is part of our managed services offering; contact your Customer Success Manager for access or more information.

Blur Faces in Images 

This model accepts URLs of images and returns URLs pointing to copies of the source images with any pictured faces blurred. 

  • The face encoding model uses Adam Geitgey’s face recognition library, which is built using dlib’s face recognition model.
  • It was trained on a dataset containing about 3 million images of faces grouped by individual. 
  • The model achieves an accuracy of 99.38% on the standard Labeled Faces in the Wild dataset, which means that given two images of faces, it correctly predicts if the images are of the same person 99.38% of the time. 
  • To learn more about the model, please read this blog post. 

Box and Transcribe Words

This model is designed to be used as part of a document transcription workflow with the following steps: 

  1. A contributor draws bounding boxes around lines of text in an image 
  2. Given a bounding box around a line of text, the model predicts the bounding box coordinates and transcriptions corresponding to each word in the line of text 
  3. A contributor reviews the model’s predictions 

In addition to input columns for the images and the text-line bounding boxes, this model also takes an input for the language in which to make the transcription prediction. The values accepted are two-letter ISO 639-1 language codes.

  • The following language codes are accepted:
    • 'af': 'Afrikaans', 'ar': 'Arabic', 'cs': 'Czech', 'da': 'Danish', 'de': 'German', 'en': 'English', 'el': 'Greek', 'es': 'Spanish', 'fi': 'Finnish', 'fr': 'French', 'ga': 'Irish', 'he': 'Hebrew', 'hi': 'Hindi', 'hu': 'Hungarian', 'id': 'Indonesian', 'id': 'Italian', 'jp: 'Japanese', 'ko': 'Korean', 'nn': 'Norwegian', 'nl': 'Dutch', 'pl': 'Polish', 'pt': 'Portugese', 'ro': 'Romanian', 'ru': 'Russian', 'sv': 'Swedish', 'th': 'Thai', 'tr': 'Turkish', 'zh': 'Chinese', 'vi':'Vietnamese', 'zh-sim': 'Chinese (Simplified)', 'zh-tra': 'Chinese (Traditional)'
  • The underlying optical character recognition (OCR) model is the Tesseract Open-Source OCR Engine
  • The model is designed to recognize printed text and is unlikely to work well on handwriting.
  • Please refer to the model documentation for further support.  

Identify Face Landmarks

This model accepts URLs of images and returns json strings containing the coordinates of predicted facial landmarks.

  • The face encoding model uses Adam Geitgey’s face recognition library, which is built using dlib’s face recognition model.
  • It was trained on a dataset containing about 3 million images of faces grouped by individual
  • The model achieves an accuracy of 99.38% on the standard Labeled Faces in the Wild dataset, which means that given two images of faces, it correctly predicts if the images are of the same person 99.38% of the time. 
  • To learn more about the model, please read this blog post.  

Label Pixels in Street Scene Images 

This model generates pixel-level semantic segmentation masks for street scene images, such as images containing cars, trucks, buildings, pedestrians and signs. It can be useful for enhancing annotation efficiency and quality for autonomous vehicle and related use cases. 

Segment Audio

This model can be used to classify periods of time in audio according to the sound within it; pre-labeling with this model can streamline audio segmentation and transcription workflows. The model segments audio into the classes speechmusicnoise and silence. It accepts audio files and returns the class labels, start times and end times of the identified segments

Appen Automatic Speech Recognition (ASR) - (en-us)

This model is trained to provide hypothesized transcription of the input speech audio. The input speech audio should be US English sampled at 16 kHz.

  • This model and ASR service were developed based on the Kaldi toolkit.
  • This model was trained using Librispeech corpus and Appen's OTS US English ASR corpora.
  • This model can achieve Word Error Rates of 5.01% on a LibriSpeech “clean” test set; 14.49% on a LibriSpeech test set containing “more challenging” speech; 4.92% on an Appen OTS USE-ASR001 test set (reading sentences); and 23.18% on an Appen OTS USE-ASR003 test set (conversational speech in talk shows)
  • Variation in acoustic properties of the input audio such as noise level, channel type, accent, and speaking style (speaker overlap) may affect the accuracy negatively.

Appen Automatic Speech Recognition (ASR) - (en-uk)

This model is trained to provide hypothesized transcription of the input speech audio. The input speech audio should be UK English sampled at 8 kHz.

  • This model and ASR service were developed based on the Kaldi toolkit.
  • This model was trained using Appen's OTS UK English ASR corpora, Voxforge UK English, and OpenSLR-83.
  • This model can achieve a Word Error Rate of around 20% on call center conversational UK English speech test data.
  • Variation in acoustic properties of the input audio such as noise level, channel type, accent, and speaking style (speaker overlap) may affect the accuracy negatively.

Was this article helpful?
1 out of 2 found this helpful


Have more questions? Submit a request
Powered by Zendesk