Methodology

How the BizCrush STT benchmark works, end to end.

Why this benchmark

We publish this benchmark to give STT users a scoreboard for the apps they actually use — real-world audio, methodology fully visible — and to keep the numbers, audio, and transcripts here for anyone to review directly.

The conflict of interest is real (we build BizCrush and this benchmark); the rest of this page is what we do to bound it.

Test setup

The benchmark runs on a dedicated Mac mini. Reference audio plays through a virtual audio bridge into the Android emulator's mic input, where a real STT app picks it up. We use the speaker-to-mic path instead of calling the engine's API directly because what users experience in practice is the app's full audio pipeline, and the score should reflect that.

Reference audio (WAV)
Mac virtual audio (BlackHole + Multi-Output)
Android emulator mic input
STT app
Captured live transcript

Multi-Output Device duplicates the audio to both the Mac's speakers and BlackHole, so playback can be monitored while BlackHole feeds the emulator's mic. Without it, BlackHole would silently consume the playback.

How a single test runs

  1. The harness opens the target app on the emulator and taps record.
  2. It plays the reference audio clip through BlackHole into the emulator's mic input.
  3. While the recording is still running, it captures the engine's live transcript directly from the app — via the in-recording copy button on apps that expose one, or by polling the visible transcript view on apps that don't.
  4. After playback finishes, it waits a few seconds for any trailing recognition to flush, then captures the last bit of the Live transcript.
  5. It taps stop. What happens next depends on which transcript the test is capturing:
    • Live test — the Live transcript was already captured in the steps above, so the recording session simply ends here.
    • Post test — the app runs its AI post-processing after the stop tap, so the harness waits for it to finish and then captures the polished result.
  6. Each captured transcript is normalized and scored against the canonical reference, then enters the internal review queue. It isn't visible on the public site yet.
  7. A reviewer listens to the clip and compares the engine output against the audio. They can mark word-pairs as equivalent when the difference is orthographic but acoustically indistinguishable — 띄어쓰기 variations, fast-speech spelling variants (e.g. 있지마는 vs. 있지만은), or cases the auto-normalizer can't catch generically (e.g. 100 kilometers vs. a hundred kilometers). Each mark drops the corresponding errors from the score. The marks themselves are shown on each published run alongside the transcripts, so any reader can see which equivalences were applied. Reviewers cannot edit the engine output, the reference text, or the audio. Every mark's (reference, engine) word-pair is logged for later normalization-rule mining; reviewer identity is not stored.
  8. If the reviewer approves, the run becomes visible on this site.

Why we capture during recording

Most apps — ours included — run a post-stop pass over the whole transcript: punctuation, capitalization, occasional word fixes. The polish is helpful for end users but hides recognition errors. The displayed transcript ends up better than what the engine actually output.

Capturing live sidesteps that pass. The live capture still includes any inline AI cleanup the app layers on top as words arrive — that's a feature of the app's pipeline, not the engine itself, and we don't try to defeat it since it's what users actually see in the app.

We score the engine as users experience it in the app: inline cleanup included, post-stop polish excluded.

Live and Post transcripts

Many apps don't expose a Live transcript — they only reveal the polished transcript after you tap stop. We added a Post transcript capture (the post-stop, AI-rewritten result) so those apps can be benchmarked too.

Where an app exposes both, each is published as its own run, labeled Live or Post, so you can see exactly what the post-processing changed.

Disclosure

To keep every test verifiable, each published run shows the normalized reference and the normalized engine transcript side by side, so any reader can audit the score.

We build a custom automation harness for each app we benchmark. However, the implementation details would identify the apps being tested, so we don't publish how each harness is built.

Scoring

Every transcript is normalized symmetrically on both sides (reference and engine output) before comparison so the metric reflects acoustic accuracy rather than orthographic preference. Specifically:

  • Casing & punctuation. Lowercased, parentheticals and stage directions like (laughter) stripped, timestamps stripped, punctuation removed (apostrophes preserved inside contractions), Unicode NFKC.
  • HTML entities decoded. Source transcripts sometimes contain raw entities like &. We decode them before normalization, so 'R&D' in the source and 'R&D' from the engine become the same token.
  • Thousand-grouping commas collapsed. 300,000 and 300000 are treated as the same token, since engines vary in whether they emit the comma.
  • Currency & percent equivalence. If the reference says $3.3 billion and the engine emits 3.3 billion dollars, we count that as a match (and similarly for 0.6% vs. 0.6 percent). The engine's form wins on both sides, so the diff renders them as identity rather than a substitution.
  • Speaker labels stripped. Diarization artifacts like Speaker 1 that engines sometimes emit aren't in the canonical reference, so they don't count as insertions.

With normalization applied, the metric is:

  • WER for Latin-script languages and Korean. Whitespace-separated tokens; counts substitutions, deletions, insertions. Raw Korean WER counts orthographic differences as errors even when they aren't audible — 띄어쓰기 variations and spelling variants indistinguishable in fast speech (e.g. 있지마는 vs. 있지만은). A reviewer listens to each clip and marks such pairs as equivalent, so the score reflects acoustic accuracy. WER also keeps a distinction CER erases: 전 체조 선수 ("former gymnast") vs. 전체 조 선수 ("all-team player").
  • CER for Japanese and Chinese. Character-level scoring; whitespace stripped from both sides before comparison.

Cases the auto-normalizer can't catch generically (e.g. 100 kilometers vs. a hundred kilometers — same meaning, but neither a number-format issue nor a currency equivalence) are handled in the reviewer step described under How a single test runs above. Reviewers can mark such pairs as equivalent without editing the engine output or the reference.

Reliability

The Android emulator's audio bridge can fail to deliver frames on real-time deadlines. When that happens, the recording captures silence and the transcript is degraded. The harness counts these producerThread late events from the emulator log and flags a run as chaotic when the rate exceeds a threshold. Reliability is the share of runs per engine that came in clean.

Limitations

  • Today we test engines through their consumer apps, which means scores reflect any inline AI cleanup the app layers on top of the engine. API-direct testing is on the roadmap and will be reported separately; in-app and API numbers for the same engine generally won't be directly comparable.
  • Live-transcript capture is app-specific. Each engine exposes its in-progress transcript differently, so the harness has a per-app capture path. Adding a new engine requires manually mapping the right UI hooks.
  • App version isn't auto-captured yet; comparisons across major engine releases are not (yet) explicit on this site.
  • The corpus is small, and grows deliberately. License-free audio clips are hard to source, and well-written transcripts harder still — human reviewers create or refine each ground-truth transcript by listening to the audio directly, so the reference is accurate before any engine is scored against it.

Licenses & credits

Original BizCrush-generated audio and reference transcripts are © BizCrush — all rights reserved. They are published here so any reader can verify transcript accuracy by ear and against the displayed ground truth. Redistribution, derivative works, commercial use, and use as machine-learning training data are prohibited without prior written permission.

Third-party-sourced test audio and reference transcripts are reproduced from their original sources for the purpose of objective benchmarking. Copyright remains with the original creators; we credit each source on its clip page. The output transcripts produced by each engine are derivatives of the source audio and inherit its copyright posture.

Engine output transcripts are produced by the respective STT services and are reproduced here for accuracy comparison only. Trademark and product names belong to their respective owners.

For licensing inquiries about BizCrush-original content, or if you are a rights holder of any third-party clip in our corpus and would like attribution updated or content removed, contact us at help@bizcrush.ai.