Overview

Features

Low Prices
ASR is pay-as-you-go at below 0.2 USD per audio hour, minimizing your recognition costs.

More Languages Supported

Good Effect
ASR uses the same set of services adopted by the speech-to-text conversion features in WeChat and Honor of Kings, which deliver an industry-leading word recognition accuracy rate of 97%.

Powerful Algorithms
Based on the innovative network structure TLC-BLSTM, ASR leverages the attention mechanism to effectively model speech signals and improves the system robustness through the teacher-student approach, delivering an industry-leading recognition accuracy and efficiency in diverse scenarios in general and vertical fields.

Self-Service Accuracy Improvement

Wide Scenario Support
ASR has been fully verified by Alto’s internal high-traffic products such as WeChat, Alto Video, and Honor of Kings and well optimized for diversified scenarios in the internet, finance, and education sectors based on massive amounts of data, with best practices accumulated and output for many industries.
Scenarios
Call Quality Inspection
Short Video Subtitling
Video Understanding

Call quality inspection at call centers is traditionally conducted through random spot checks due to labor efficiency and costs, making it difficult to assess the performance of customer service reps. ASR can recognize call recordings, convert them to text, and then analyze the text in real time to identify non-compliant calls. This greatly enhances the performance management of call centers, completes large-scale call recording quality inspection that cannot be accomplished by human, and eventually improves the service quality of call center staff.

In UGSV scenarios, users talk while shooting videos and generally need to edit the videos and manually add subtitles before posting them. The real-time speech recognition feature of ASR can directly generate subtitles when users are talking, which significantly reduces post-processing costs and enables users to post videos immediately after creating them.

Live streaming and audio-sharing platforms have high numbers of audio/videos that need to be understood for quality inspection, tagging, and recommendation purposes, which is difficult to achieve by humans. The real-time speech recognition feature of ASR can transcribe audio and audio streams in videos based on the audio/video transcription model. It well satisfies the different latency requirements of different input sources and helps platform staff quickly understand high numbers of audio/videos, which remarkably reduces labor costs and quickly implements quality inspection, tagging, and recommendation.