Why we built our speech model for 8 kHz — and why most voice AI fails on a real phone call

Here is a thing that sounds like a detail and is actually the whole game: the audio on a phone call is not the audio you record into your laptop. A telephone line carries narrowband audio — roughly 300 Hz to 3.4 kHz, sampled at 8 kHz. Studio and headset audio is wideband, 16 kHz or higher. That is not a small difference in fidelity. It is about half the frequency range simply gone.

And the half that's gone is the half that carries the consonants. The crispness that separates an 's' from an 'f', a 't' from a 'p', the high-frequency air that tells you which word a caller actually said — most of that lives above 4 kHz, in exactly the band a phone line throws away. Vowels survive narrowband fine. Consonants are where transcription goes to die.

Studio audio — wideband

≥16 kHz

What most voice AI is trained on. Full detail, top to bottom.

A phone line — 8 kHz narrowband

8 kHz

Half the audio detail. Everything above ~4 kHz is gone — the crisp part of s, f, and th.

Why this quietly wrecks most voice AI

Almost every speech model on the market is trained, by default, on clean wideband audio — audiobooks, podcasts, read-aloud corpora, headset recordings. That's what's abundant and easy to label. A model trained that way learns to lean on the high-frequency detail that makes consonants easy. It performs beautifully in a demo, where you're talking into a good mic in a quiet room.

Then you point it at a real phone call. The high band is gone, there's line noise and codec compression, the caller is in a car or a warehouse, and the model is suddenly guessing at the exact cues it was trained to depend on. Word error rate climbs, names and numbers get mangled, and your agent mishears the one digit that mattered. The model didn't get worse — its world did, and it was never built for this world.

A voice model that demos perfectly on a headset and fumbles on a phone call isn't broken. It was trained for the wrong world.

What we did differently

We didn't take a wideband model and downsample at the edges to cope with phone audio. We built for 8 kHz narrowband as the primary target — tuned on the kind of audio a real call actually carries, line noise, codecs, and all. When the high band is missing, the model isn't surprised; that's the condition it was optimized for.

We could do that because we'd already lived in this world. Across the platforms we power, we've handled over a billion calls — so we weren't guessing at where telephony audio breaks. We'd measured it. The 8 kHz decision wasn't a checkbox; it was the consequence of watching what real call audio does to a model that wasn't ready for it.

Hear, our speech-to-text, is tuned for 8 kHz call audio with streaming partials — built for the phone, not the podcast.
Omni, our speech-to-speech model, hears and speaks over a single socket with natural turn-taking and barge-in, built on the same telephony-first foundation.
The result is accuracy that holds up on the lines you actually run on, not just in the quiet-room demo.

Why it matters for what you're building

If you're building a phone agent — a receptionist, an appointment booker, a support line — the audio you'll see in production is narrowband, noisy, and nothing like your test recordings. A model that wasn't built for that will look great the day you ship and disappoint the day real callers arrive. Telephony isn't a degraded version of the studio. It's its own problem, and it deserves a model built for it.

That's the bet we made: be the best voice model for the phone specifically, not the most impressive one in a demo. The phone is where the calls are.

Keep reading

Omni — one speech-to-speech model for phone calls Hear — telephony-native speech-to-text Compare voice agents