🏆 Real-World Speech-to-text API Leaderboard

This leaderboard compares the best automatic speech recognition (ASR) models and APIs across real-world conditions like background noise, non-native accents, and technical vocabulary.

We built this benchmark to find the best speech recognition API for Voice Writer — an AI writing tool that transcribes speech and corrects grammar and wording in real time. Only real-world results matter here, so this benchmark is fully independent and unbiased.

⏱️ Leaderboard last updated: July 2025

Category:

Evaluation:

Include streaming

System	Mean WER	Std Dev	Price (per hour)
GPT-4o Transcribe ?	5.4%	4.6%	$0.36
Gemini 2.5 Pro ?	5.6%	5.2%	$0.11
Gemini 2.5 Flash ?	6.7%	5.1%	$0.06
ElevenLabs ?	6.8%	6.0%	$0.35
AssemblyAI ?	6.8%	4.7%	$0.37
Whisper Large ?	7.2%	5.1%	Local
Deepgram ?	7.6%	4.7%	$0.26
Speechmatics ?	7.6%	5.5%	$1.04
AssemblyAI (Streaming) ?	8.1%	5.1%	$0.15
AWS Transcribe ?	8.3%	5.1%	$1.44
Deepgram (Streaming) ?	9.7%	5.7%	$0.46
Whisper Small ?	9.7%	5.9%	Local
AWS (Streaming) ?	9.9%	5.1%	$1.44
Microsoft ?	10.3%	5.4%	$0.18
Whisper (Streaming) ?	12.4%	16.3%	Local
Google Speech ?	13.1%	5.8%	$0.96
Apple Dictation ?	16.5%	6.3%	Local

Speech Conditions

Many existing benchmarks use unrealistic language (e.g., LibriSpeech), very short clips (e.g., Common Voice), or are so widely used for benchmarking that API providers are likely to overfit to them. This makes them less reliable for real-world evaluation, especially for longer audio.

To create a fair and relevant comparison for applications like Voice Writer, we built a custom dataset covering four categories:

Clean Speech: High-quality TED Talk clips from native English speakers.
Noisy Speech: The clean TED clips with added hospital background noise to simulate real-world environments.
Accented Speech: Wikipedia readings by speakers with non-native English accents (such as Chinese and Indian accents).
Specialist Speech: Technical abstracts from recent machine learning and physics papers, synthesized by a text-to-speech (TTS) system.

Each clip is 1–2 minutes long, allowing us to evaluate not only word recognition but also punctuation and capitalization over longer passages.

Evaluation Metrics

We measured performance using two types of Word Error Rate (WER):

Formatted WER (also called unnormalized WER) — capitalization, punctuation, and formatting are treated as important. This produces higher WER and is the more relevant metric for most writing-related applications.
Raw WER (also called normalized WER) — punctuation and capitalization are stripped, and variations are standardized (e.g., "color" vs "colour", "two" vs "2"). This metric is more relevant for machine-processing use cases where formatting is not required.

We also distinguish between batch and streaming transcription on the leaderboard:

Batch — the full audio file is uploaded and transcribed at once.
Streaming — audio is streamed in real time to the provider, and transcripts are generated incrementally. Streaming models typically have lower accuracy than batch models due to limited context.

System Inclusion Criteria

To keep the leaderboard focused, we only include:

Major ASR products from large tech companies
Startups with 100+ employees or $100M+ in funding
Widely used open-source models

Additional Details

For a deeper dive into the benchmark setup, audio samples, and methodology, see this blog post or this video.

Note: the blog and video reflect an earlier version of the benchmark from January 2025, so some data, evaluation methods, and models may differ.

For questions or inquiries, please contact bai@voicewriter.io.

🏆 Real-World Speech-to-text API Leaderboard

Speech Conditions

Evaluation Metrics

System Inclusion Criteria

Additional Details

Try Voice Writer today!