Amazon Transcribe, Microsoft Azure Speech to Text and Google Cloud Speech-to-Text enable developers to create dictation applications that can automatically generate transcriptions for audio files, as well as captions for video files. Development teams can weave these capabilities into timesaving apps for a range of uses, including call center analytics, video indexing services, web conference indexing and business transcription workflows.

These speech-to-text services — which are part of the AI portfolios public cloud providers continue to build out — are still in their early days. But they continue to evolve with new capabilities, such as enhanced and automated punctuation, and will likely continue to improve as providers develop more accurate speech processing models.

The biggest benefit of these services, which the cloud providers deliver as APIs, is their ability to integrate with the broader platform of tools and services on which they run. They also, however, have some important differences.

Here’s a closer look at these speech-to-text services from AWS, Microsoft and Google and some of the key features they offer.

Azure Speech to Text

One of the strengths of Microsoft Azure Speech to Text is its support for custom speech and acoustic models, which enables developers to customize speech recognition for a particular environment. A custom language model, for example, could improve transcription accuracy for a regional dialect, while a custom acoustic model could improve accuracy for a headset used in a call center. However, Microsoft charges an additional fee for the use of these custom models.

Developers can also code applications to deliver recognition results in real time; this could enable an application to give users feedback to speak more clearly or to pause when their words are not being properly recognized.

Developers can access the Azure Speech to Text API from any app using a REST API. In addition, Microsoft developed several client libraries to improve integration with various apps written in C#, Java, JavaScript and Objective-C. In some cases, client apps use the WebSocket protocol to improve performance. Currently, the service supports 29 languages, as well as WAV and Opus audio formats.

Amazon Transcribe

Amazon Transcribe enables developers to submit audio — via a standard REST interface — in several formats, including WAV, MP3, MP4 and FLAC, as well as from any device. Additionally, Amazon has a variety of software development kits (SDKs) to improve the use of this transcription service, which supports .NET, Go, Java, JavaScript, PHP, Python and Ruby.

Amazon’s offering automatically recognizes multiple speakers and can provide a timestamp, which makes it easier for users to locate the audio or video segment associated with a specific sentence. However, the service currently only supports English and Spanish.

Google Cloud Speech-to-Text

Google has updated its speech-to-text engine to process both short audio snippets for voice interfaces and longer audio for transcription. The service can transcribe 120 languages in real time or from prerecorded audio files. It also includes a new proper noun processing engine that improves formatting for words that involve company or celebrity names.


AWS, Microsoft and Google all provide a free tier to let developers kick the tires of on these speech-to-text services for a limited number of minutes or hours per month.

From there, Azure Speech to Text costs $0.50 per hour, Amazon Transcribe costs $1.44 per hour and Google Cloud Speech-to-Text costs $1.44 for audio and $2.88 for video per hour. Although Azure is the cheapest, Microsoft offers various ancillary services, such as speaker identification and audio analysis, as paid add-ons, while these features are included with the other services.

The speech-to-text service supports several prebuilt transcription models for various use cases that improve accuracy for phone calls, video recordings or professionally recorded video. It supports audio formats such as FLAC, AMR, PCMU and WAV files. Also, SDKs are available for C#, Go, Java, Node.js, PHP, Python and Ruby. Google has also optimized the service to transcribe noisy audio without requiring additional noise cancellation.

Quality still a factor

For the moment, these speech-to-text services are likely to complement — rather than replace — other input modalities. Still, they can provide value, especially by indexing large blocks of audio for compliance and customer service purposes or automatically generating captions for audio and video streams.

In cases where accuracy is paramount, developers should bake these tools into workflows that complement human transcribers. Developers can also use recording samples from existing sources to test the accuracy of these engines — similar to an approach taken by Florida Institute of Technology researchers who developed a tool to analyze the quality of the different cloud speech engines.