🏆 Speech Recognition Leaderboard

This leaderboard compares the best automatic speech recognition (ASR) models and APIs across real-world conditions like background noise, non-native accents, and technical vocabulary.

We built this benchmark to find the best speech recognition API for Voice Writer — an AI writing tool that transcribes speech and corrects grammar and wording in real time. Only real-world results matter here, so this benchmark is fully independent and unbiased.

SystemMean WERStd DevPrice (per hour)
GPT-4o Transcribe
?
11.9%4.3%$0.36
ElevenLabs
?
14.8%6.6%$0.35
Whisper Large
?
14.9%5.2%Local
Gemini 1.5 Pro
?
15.7%7.2%$0.11
Gemini 2.0 Flash
?
15.8%6.5%$0.06
AssemblyAI
?
17.5%5.1%$0.37
Speechmatics
?
17.6%5.8%$1.04
Deepgram
?
17.7%6.8%$0.26
AWS Transcribe
?
18.8%5.8%$1.44
Microsoft
?
20.0%6.0%$0.18
Whisper Small
?
20.1%6.0%Local
Google Speech
?
31.4%6.7%$0.96

Speech Conditions

Many existing benchmarks use unrealistic language (e.g., LibriSpeech), very short clips (e.g., Common Voice), or are so widely used for benchmarking that API providers are likely to overfit to them. This makes them less reliable for real-world evaluation, especially for longer audio.

To create a fair and relevant comparison for applications like Voice Writer, we built a custom dataset covering four categories:

  • Clean Speech: High-quality TED Talk clips from native English speakers.
  • Noisy Speech: The clean TED clips with added hospital background noise to simulate real-world environments.
  • Accented Speech: Wikipedia readings by speakers with non-native English accents (such as Chinese and Indian accents).
  • Specialist Speech: Technical abstracts from recent machine learning and physics papers, synthesized by a text-to-speech (TTS) system.

Each clip is 1–2 minutes long, allowing us to evaluate not only word recognition but also punctuation and capitalization over longer passages.

Evaluation Metrics

We measured performance using two types of Word Error Rate (WER):

  1. Formatted WER (also called unnormalized WER) — capitalization, punctuation, and formatting are treated as important. This produces higher WER and is the more relevant metric for most writing-related applications.
  2. Raw WER (also called normalized WER) — punctuation and capitalization are stripped, and variations are standardized (e.g., "color" vs "colour", "two" vs "2"). This metric is more relevant for machine-processing use cases where formatting is not required.

We also distinguish between batch and streaming transcription on the leaderboard:

  • Batch — the full audio file is uploaded and transcribed at once.
  • Streaming — audio is streamed in real time to the provider, and transcripts are generated incrementally. Streaming models typically have lower accuracy than batch models due to limited context.

More Technical Details

For a deeper dive into the benchmark setup, audio samples, and methodology, see this blog post or this video.

Note: the blog and video reflect an earlier version of the benchmark from January 2025, so some data, evaluation methods, and models may differ.

For questions or inquiries, please contact bai@voicewriter.io.

Try Voice Writer today!

© 2025 Efficient NLP. All rights reserved.