The entire real-time voice infrastructure suite is now predictive, multilingual, and constructed for real-world manufacturing audio.
Krisp launched Krisp VIVA 2.0, the voice AI infrastructure layer for voice brokers, IVRs, and conversational AI. The discharge introduces a brand new era of small, real-time fashions that enhance WER, predict when customers end talking, classify interruptions, and skim perceptual alerts like artificial speech, gender, and accent, setting a brand new benchmark for a way voice brokers deal with audio in manufacturing.
Krisp’s full real-time voice infrastructure suite is now predictive, multilingual, and constructed for real-world manufacturing audio.
Voice agent utilization grew 9x in 2025, but most voice brokers nonetheless fail in the identical predictable methods the second they depart a demo room. Background voices and noises push speech-to-text phrase error charges from 5% to over 30%. Voice exercise detection misfires on background voices; bots can ignore actual interruptions or hallucinate them. And on telephony, the agent’s personal voice can loop again by means of the mic and set off self-interruption.
Voice AI methods at this time are constructed on STT, LLMs, and TTS. What’s been lacking is a layer to deal with real-world audio and conversational dynamics earlier than these methods have interaction.
VIVA fills that hole to make sure AI brokers operate in messy, real-world environments.
Krisp’s VIVA SDK runs server-side straight in every buyer’s audio pipeline earlier than STT, enhancing reliability throughout your entire stack.
Additionally Learn: AiThority Interview with Glenn Jocher, Founder & CEO, Ultralytics
What’s new in VIVA 2.0:
- Flip Prediction v3: A brand new multilingual mannequin that predicts end-of-turn from audio alone, no transcription wanted. Reacts shortly to actual turn-ends whereas holding by means of mid-sentence pauses — low-latency responses with out the agent reducing customers off. Tiny sufficient to run on normal CPUs or regionally, on-device for robotics and conversational toys.
- Interrupt Prediction v1: A primary-of-its-kind audio-only classifier that predicts when a person is desiring to interrupt the agent (start-of-turn prediction). Distinguishes intent-to-take-the-floor from backchannel speech like “sure” or “mhm.” Totally different from end-of-turn prediction, which detects when the person has completed talking. Patent filed.
- Sign Detectors: A brand new class of real-time audio fashions that give voice AI the perceptual cues people use with out considering.
Three launching with VIVA 2.0:- TTS Detector: Detects artificial speech in actual time. Use case: an outbound voice AI agent calls a quantity and acknowledges when an inbound voice AI agent or IVR picks up.
- Accent Detector: Identifies the speaker’s accent so audio will be routed to the STT mannequin finest tuned for it, lifting transcription high quality.
- Gender Detector: Identifies speaker gender to allow personalised responses.
- Voice Isolation v3: The world’s most generally used voice isolation mannequin has been upgraded to ship measurable enhancements in downstream WER.
All fashions run on normal server CPUs, function on audio enter alone with no transcription required, and are bundled into present VIVA pricing at no extra cost.
Krisp has spent over eight years fixing real-world voice in manufacturing, first for human-to-human conversations and now for human-to-AI. That have offers VIVA the depth of coaching knowledge and field-tested reliability nothing else out there can match.
Krisp VIVA SDK processes greater than 12 billion minutes of voice AI agent site visitors a 12 months and is embedded in over 130 voice AI merchandise, together with Every day, Vapi, LiveKit, Ultravox, Telnyx, the world’s main AI labs, and the most important enterprise contact facilities.
Platforms operating VIVA report:
- 3.5x enchancment in turn-taking accuracy
- 50% fewer dropped calls
- 30% increased buyer satisfaction
“At scale, the most important problem in voice AI isn’t the mannequin. It’s the standard of the sign going into it,” stated David Casem, CEO of Telnyx. “Krisp addresses that on the supply, which improves every thing downstream from transcription to response.”
“Voice is turning into the first interface between people and AI,” stated Robert Schoenfield, EVP of Licensing and Partnerships at Krisp. “These conversations don’t occur in clear environments. They occur in the actual world, formed by noise and refined human cues. VIVA brings that layer into the system, so voice brokers can function the best way folks really converse.”
Additionally Learn: The Infrastructure Struggle Behind the AI Increase
[To share your insights with us, please write to psen@itechseries.com ]
