Work / VoiceDesk
Case study 02 · Voice AI
VoiceDesk
Live in production · Founder & full-stack AI engineer · voicedesk.app ↗
A multi-tenant SaaS that turns a company's documents into a live, always-on AI voice assistant. It answers customer calls - by phone or web - in 1–3 seconds, 24/7, with sub-three-second responses from the company's actual knowledge base.
"Never miss a customer call."
1–3s
response latency
<2s p95
KB load at call start
100+
concurrent calls / sec
24/7
uptime, zero hold time
The problem
Businesses lose customers to voicemail. Existing voice bots are slow and dumb. Chatbots only work on a website. The AI is the easy part of fixing this - the hard part is real-time voice.
Most "AI projects" in 2025 are a chat wrapper around a hosted LLM endpoint. VoiceDesk isn't that. It's a real-time, full-duplex voice pipeline running between a customer's browser (or phone) and a real-time voice server, with NAT traversal that actually works on mobile networks, an SDP proxy that injects per-tenant configuration into the handshake, and a pre-loaded-context architecture that beats RAG on latency.
Architecture
Phone via a telephony provider. Browser via WebRTC. The application backend stitches it together.
Engineering substance
The hard problem wasn't the AI. The hard problem was real-time voice.
-
Full-duplex WebRTC, browser & phone
Audio in both directions between the customer's browser/phone and a real-time voice server. The two transports look very different on the wire - phone via SIP through a telephony provider, browser via raw WebRTC - and converge on the same intelligence loop.
-
NAT traversal via a dedicated TURN service
TURN/STUN credentials served on-demand so calls actually work on mobile networks and behind corporate firewalls. The thing most "WebRTC demos" gloss over.
-
Custom SDP offer/answer proxy
In the application backend: injects per-tenant configuration - system prompt, voice, persona, greeting, temperature - into the handshake. The browser never sees the AI configuration, so it can't be tampered with.
-
No-trickle-ICE handling
The real-time voice backend doesn't support trickle ICE. The client gathers all candidates with a 3-second timeout and a "relay-ready" shortcut before sending the offer - without this, calls would either be slow to connect or fail outright on certain network topologies.
-
Pre-loaded knowledge - not mid-call RAG
Instead of doing retrieval on every turn, the entire company knowledge base is loaded into the AI's context at call-start (≈2s, p95). Result: 1–3s response latency for the whole call, vs. 5–8s for typical retrieval-per-turn approaches. Cache hit rate >80% on knowledge loads.
-
RTVI data channel with structured-tag protocol
LLM tokens stream over a data channel alongside the audio. The system prompt asks the LLM to emit hidden
<filter_update>or<form_update>JSON before its spoken response. The frontend parses them out of the token stream and uses them to drive UI state - filtering a gallery, auto-filling a form - while the AI is talking. -
Multi-tenant call analytics
Every call recorded, transcribed, scored for confidence, aggregated into daily dashboards. Voice library switchable per-tenant - self-hosted TTS for standard voices, premium TTS for the high-end.
-
Embeddable widgets
Button, banner, floating styles - all WebRTC-based, drop into any site with one script tag.
Latency budget
Where the seconds go.
~500ms
STT
~1000ms
LLM
~500ms
TTS
Stack
The wow moments
The live demo wizard
Split-screen voice call where the AI interviews a prospect and the form populates in real time as they speak. The single best "this is real AI engineering" moment on the live site.
Voice-driven on-page filtering
Speak into the mic and the demo grid filters itself based on what you say. Powered by the <filter_update> tag protocol embedded in the LLM's response stream.
Industry demo cards
Pre-built voice demos for healthcare, legal, property, e-commerce, hospitality, financial services, hair & beauty, automotive, government, education, IT/SaaS - each a real working call you can place.
"WebRTC for the browser. Telephony for the phone. A speech-aware LLM for the brain. Premium synthetic voice for the output. The application backend stitches it all together."