Skip to content
PASSAs of 2026-06-20Measured, not marketed

Performance you can verify.

Every PyAI model, scored on the metrics that matter for phone calls — accuracy, latency, turn-taking, grounding — with the exact method, the thresholds, and verification on request. We measure our own engine the way we'd want a vendor to measure theirs.

94.3/100
VAQI — voice-agent quality index
1.59%
Hear WER on clean audio
~390 ms
Median turn-taking (ua_first)
100%
Speak format & sample-rate correctness
Scorecards

How each model scores

Each metric carries its value, the band it's graded against (good / warn / critical), a plain-English explainer, and where the number came from.

Hear

Speech-to-text

PASS

Word Error Rate

Fixture scorecard (telephony-8k + accented-noisy corpus)

PASS
4.76%
better0.05% good

4.76% across the telephony/accented corpus; 0% on clean audio. Live round-trip WER on clean audio measured 1.59%.

In plain terms: Word Error Rate is the share of words the transcript gets wrong (substituted, dropped, or inserted). Under 5% is roughly human-parity for clear speech; we hold that bar on real 8 kHz phone audio, which is the hard case.

Speak

Text-to-speech

PASS

Time to first audio (warm path)

Tested, warm-path streaming synthesis

PASS
32–98 ms
better400ms good

32–98 ms to first audio byte on the warm streaming path (clone_v3 prefix-state cache). Under load on a shared/sandbox key the path can queue — that's a capacity fix, tracked separately; it does not change the warm-path figure.

In plain terms: Time-to-first-byte is how long until the caller hears the first sound. Streaming it from the first byte (instead of waiting to render the whole clip) is what makes a voice agent feel responsive rather than frozen. Sub-100 ms is imperceptible.

Format & sample-rate correctness

Live run (wav/pcm/g711_ulaw)

PASS
100%

Every requested container and sample rate is honored exactly — a hard contract check across wav @24kHz, pcm @8kHz, and g711_ulaw @8kHz.

In plain terms: When you ask for 8 kHz μ-law for Twilio, you get exactly that — not a resampled surprise. 100% means the audio drops into your telephony pipeline without a conversion step.

Omni

Realtime speech-to-speech agent

PASS

Turn latency p50 (utterance-end → first audio)

Live in-region, ua_first (Omni 402 / Agents 386 p50), self-tracking

PASS
~390 ms
better800ms good

Median ~390 ms from the moment the caller stops speaking to the first byte of the agent's reply, measured in-region. This is the felt turn latency — not session-anchored. Continuously tracked by pyai-latency-30m.

In plain terms: This is the number that defines how human the agent feels. ~390 ms sits inside the natural pause between people talking; above ~800 ms it starts to feel like a walkie-talkie. We measure utterance-end to first audio, which is the real metric — anchoring on session-open inflates it by the length of what the caller said.

VAQI (voice-agent quality index)

Fixture scorecard (composite)

PASS
94.3
70 goodbetter

Composite 0–100: interruptions 40% + missed-response 40% + latency 20%. 94.3 reflects low false-barge and missed-response rates alongside the turn latency.

In plain terms: A single number for 'is this a good voice agent.' It penalizes talking over people (interruptions), failing to reply (missed responses), and lagging — the three things that make a call feel broken. Above 70 is good; we're at 94.

KB-grounded answer rate

Fixture scorecard

PASS
90%
0.85% goodbetter

9 of 10 answers were both grounded in the bound knowledge base and keyword-correct.

In plain terms: When the agent answers, does it answer from your content (not made up), and is it right? 90% means it pulls the answer from your knowledge base and gets it correct nearly every time.

In plain terms

What these metrics mean (and why they matter on a call)

WER

Word Error Rate = (substitutions + deletions + insertions) / reference words, over normalized tokens.

TTFB

Time-to-first-byte: wall time from request send to the first audio byte received. Reported on the warm, uncontended streaming path.

ua_first (turn latency)

Utterance-end (EOU / commit) to the first agent audio frame — the felt turn latency. NOT session-anchored, which would inflate it by the caller's speech length.

VAQI

Voice Agent Quality Index (0–100): interruptions 40% + missed-response 40% + latency 20%.

Methodology

How we measure — and the honest caveats

Measured by

An in-repo, CI-gated harness (evals/) plus a self-tracking in-region latency profile (pyai-latency-30m). The same scorers grade recorded fixtures and live API calls, so a live number is directly comparable to a fixture number.

Honest caveats
  • Fixture metrics (WER corpus, VAQI, KB-grounded) are recorded golden cases scored offline; live numbers (turn latency, Speak TTFB, format correctness) are measured against api.pyai.com in-region.
  • Live Hear WER uses a Speak→Hear round-trip (synthesize, then transcribe), so it reflects cleaner-than-telephony audio; the fixture WER on real 8 kHz call audio is the more demanding number.
  • Turn latency is measured on the live hybrid path; the p99 tail on a shared key can be inflated by per-key rate limits — a dedicated monitoring key is the fix.
Verification

How these are measured — and how to verify them

We run an in-repo, CI-gated benchmark harness (offline scorer fixtures) plus a self-tracking in-region latency profile. The harness and a walk-through of the exact method are available to any team evaluating PyAI — just ask.

What we run, continuously

  • A scorer harness grading accuracy (WER), format correctness, grounding, and a VAQI composite — the same cases gate engine regressions in CI.
  • A self-tracking in-region latency profile (every 30 min) measuring ua_first — utterance-end to first agent audio — at p50/p95/p99.
  • Every metric on this page carries its source (fixture vs live) and the band it's graded against.

Want the harness, or a live re-run?

The benchmark harness, the fixture corpus, and a live walk-through are available to any team evaluating PyAI for production — tell us what you're building and we'll share the method and help you re-run it against your own traffic.

One voice stack. Numbers with receipts.

Start with $50 in free credit and 10,000 free Hear minutes every month. No card.

No credit card - OpenAI-compatible - cancel anytime