Performance you can verify.
Every PyAI model, scored on the metrics that matter for phone calls — accuracy, latency, turn-taking, grounding — with the exact method, the thresholds, and verification on request. We measure our own engine the way we'd want a vendor to measure theirs.
How each model scores
Each metric carries its value, the band it's graded against (good / warn / critical), a plain-English explainer, and where the number came from.
Hear
Speech-to-text
Word Error Rate
Fixture scorecard (telephony-8k + accented-noisy corpus)
4.76% across the telephony/accented corpus; 0% on clean audio. Live round-trip WER on clean audio measured 1.59%.
In plain terms: Word Error Rate is the share of words the transcript gets wrong (substituted, dropped, or inserted). Under 5% is roughly human-parity for clear speech; we hold that bar on real 8 kHz phone audio, which is the hard case.
Speak
Text-to-speech
Time to first audio (warm path)
Tested, warm-path streaming synthesis
32–98 ms to first audio byte on the warm streaming path (clone_v3 prefix-state cache). Under load on a shared/sandbox key the path can queue — that's a capacity fix, tracked separately; it does not change the warm-path figure.
In plain terms: Time-to-first-byte is how long until the caller hears the first sound. Streaming it from the first byte (instead of waiting to render the whole clip) is what makes a voice agent feel responsive rather than frozen. Sub-100 ms is imperceptible.
Format & sample-rate correctness
Live run (wav/pcm/g711_ulaw)
Every requested container and sample rate is honored exactly — a hard contract check across wav @24kHz, pcm @8kHz, and g711_ulaw @8kHz.
In plain terms: When you ask for 8 kHz μ-law for Twilio, you get exactly that — not a resampled surprise. 100% means the audio drops into your telephony pipeline without a conversion step.
Omni
Realtime speech-to-speech agent
Turn latency p50 (utterance-end → first audio)
Live in-region, ua_first (Omni 402 / Agents 386 p50), self-tracking
Median ~390 ms from the moment the caller stops speaking to the first byte of the agent's reply, measured in-region. This is the felt turn latency — not session-anchored. Continuously tracked by pyai-latency-30m.
In plain terms: This is the number that defines how human the agent feels. ~390 ms sits inside the natural pause between people talking; above ~800 ms it starts to feel like a walkie-talkie. We measure utterance-end to first audio, which is the real metric — anchoring on session-open inflates it by the length of what the caller said.
VAQI (voice-agent quality index)
Fixture scorecard (composite)
Composite 0–100: interruptions 40% + missed-response 40% + latency 20%. 94.3 reflects low false-barge and missed-response rates alongside the turn latency.
In plain terms: A single number for 'is this a good voice agent.' It penalizes talking over people (interruptions), failing to reply (missed responses), and lagging — the three things that make a call feel broken. Above 70 is good; we're at 94.
KB-grounded answer rate
Fixture scorecard
9 of 10 answers were both grounded in the bound knowledge base and keyword-correct.
In plain terms: When the agent answers, does it answer from your content (not made up), and is it right? 90% means it pulls the answer from your knowledge base and gets it correct nearly every time.
What these metrics mean (and why they matter on a call)
WER
Word Error Rate = (substitutions + deletions + insertions) / reference words, over normalized tokens.
TTFB
Time-to-first-byte: wall time from request send to the first audio byte received. Reported on the warm, uncontended streaming path.
ua_first (turn latency)
Utterance-end (EOU / commit) to the first agent audio frame — the felt turn latency. NOT session-anchored, which would inflate it by the caller's speech length.
VAQI
Voice Agent Quality Index (0–100): interruptions 40% + missed-response 40% + latency 20%.
How we measure — and the honest caveats
An in-repo, CI-gated harness (evals/) plus a self-tracking in-region latency profile (pyai-latency-30m). The same scorers grade recorded fixtures and live API calls, so a live number is directly comparable to a fixture number.
- Fixture metrics (WER corpus, VAQI, KB-grounded) are recorded golden cases scored offline; live numbers (turn latency, Speak TTFB, format correctness) are measured against api.pyai.com in-region.
- Live Hear WER uses a Speak→Hear round-trip (synthesize, then transcribe), so it reflects cleaner-than-telephony audio; the fixture WER on real 8 kHz call audio is the more demanding number.
- Turn latency is measured on the live hybrid path; the p99 tail on a shared key can be inflated by per-key rate limits — a dedicated monitoring key is the fix.
How these are measured — and how to verify them
We run an in-repo, CI-gated benchmark harness (offline scorer fixtures) plus a self-tracking in-region latency profile. The harness and a walk-through of the exact method are available to any team evaluating PyAI — just ask.
What we run, continuously
- A scorer harness grading accuracy (WER), format correctness, grounding, and a VAQI composite — the same cases gate engine regressions in CI.
- A self-tracking in-region latency profile (every 30 min) measuring
ua_first— utterance-end to first agent audio — at p50/p95/p99. - Every metric on this page carries its source (fixture vs live) and the band it's graded against.
Want the harness, or a live re-run?
The benchmark harness, the fixture corpus, and a live walk-through are available to any team evaluating PyAI for production — tell us what you're building and we'll share the method and help you re-run it against your own traffic.
One voice stack. Numbers with receipts.
Start with $50 in free credit and 10,000 free Hear minutes every month. No card.
No credit card - OpenAI-compatible - cancel anytime