DDaz Williams

Work  /  VoiceDesk

Case study 02 · Voice AI

VoiceDesk

Live in production · Founder & full-stack AI engineer · voicedesk.app ↗

A multi-tenant SaaS that turns a company's documents into a live, always-on AI voice assistant. It answers customer calls - by phone or web - in 1–3 seconds, 24/7, with sub-three-second responses from the company's actual knowledge base.

"Never miss a customer call."

1–3s

response latency

<2s p95

KB load at call start

100+

concurrent calls / sec

24/7

uptime, zero hold time

The problem

Businesses lose customers to voicemail. Existing voice bots are slow and dumb. Chatbots only work on a website. The AI is the easy part of fixing this - the hard part is real-time voice.

Most "AI projects" in 2025 are a chat wrapper around a hosted LLM endpoint. VoiceDesk isn't that. It's a real-time, full-duplex voice pipeline running between a customer's browser (or phone) and a real-time voice server, with NAT traversal that actually works on mobile networks, an SDP proxy that injects per-tenant configuration into the handshake, and a pre-loaded-context architecture that beats RAG on latency.

Architecture

Phone via a telephony provider. Browser via WebRTC. The application backend stitches it together.

Customer │ voice ▼ ┌─────────────────┐ ┌──────────────────────────┐ │ Phone │────────▶│ │ └─────────────────┘ SIP │ Real-time Voice Server │ │ ───────────────────── │ ┌─────────────────┐ │ • Speech-aware LLM │ │ Browser (WebRTC)│◀───────▶│ (STT + reasoning) │◀──┐ └────────┬────────┘ RTC │ • TTS (self + premium) │ │ │ SDP └──────────────────────────┘ │ │ offer │ knowledge ▼ │ + config ┌──────────────────────┐ │ │ Application Backend │ per-tenant system prompt, │ │ ───────────────── │ voice, persona, greeting, │ │ • SDP proxy │───temperature, knowledge base ──────┘ │ • TURN/ICE service │ │ • Call analytics │ ┌──────────────────┐ │ • Multi-tenant API │──▶│ Postgres + Redis │ └──────────────────────┘ └──────────────────┘

Engineering substance

The hard problem wasn't the AI. The hard problem was real-time voice.

  1. Full-duplex WebRTC, browser & phone

    Audio in both directions between the customer's browser/phone and a real-time voice server. The two transports look very different on the wire - phone via SIP through a telephony provider, browser via raw WebRTC - and converge on the same intelligence loop.

  2. NAT traversal via a dedicated TURN service

    TURN/STUN credentials served on-demand so calls actually work on mobile networks and behind corporate firewalls. The thing most "WebRTC demos" gloss over.

  3. Custom SDP offer/answer proxy

    In the application backend: injects per-tenant configuration - system prompt, voice, persona, greeting, temperature - into the handshake. The browser never sees the AI configuration, so it can't be tampered with.

  4. No-trickle-ICE handling

    The real-time voice backend doesn't support trickle ICE. The client gathers all candidates with a 3-second timeout and a "relay-ready" shortcut before sending the offer - without this, calls would either be slow to connect or fail outright on certain network topologies.

  5. Pre-loaded knowledge - not mid-call RAG

    Instead of doing retrieval on every turn, the entire company knowledge base is loaded into the AI's context at call-start (≈2s, p95). Result: 1–3s response latency for the whole call, vs. 5–8s for typical retrieval-per-turn approaches. Cache hit rate >80% on knowledge loads.

  6. RTVI data channel with structured-tag protocol

    LLM tokens stream over a data channel alongside the audio. The system prompt asks the LLM to emit hidden <filter_update> or <form_update> JSON before its spoken response. The frontend parses them out of the token stream and uses them to drive UI state - filtering a gallery, auto-filling a form - while the AI is talking.

  7. Multi-tenant call analytics

    Every call recorded, transcribed, scored for confidence, aggregated into daily dashboards. Voice library switchable per-tenant - self-hosted TTS for standard voices, premium TTS for the high-end.

  8. Embeddable widgets

    Button, banner, floating styles - all WebRTC-based, drop into any site with one script tag.

Latency budget

Where the seconds go.

~500ms

STT

~1000ms

LLM

~500ms

TTS

Stack

BackendPython 3.12
DatabasePostgreSQL 17
Cache / sessionsRedis 7
Real-time voiceWebRTC + custom voice server
STT + LLMSpeech-aware LLM
TTSSelf-hosted + premium
TelephonyTelephony provider (numbers + TURN)
FrontendVanilla JS + utility-first CSS + Web Audio API
AuthOAuth (social login)
DeployContainerised · production WSGI · nginx · tunnel ingress
ErrorsError tracking + alerting
AutomationHeadless-browser capture (wizard)

The wow moments

The live demo wizard

Split-screen voice call where the AI interviews a prospect and the form populates in real time as they speak. The single best "this is real AI engineering" moment on the live site.

Voice-driven on-page filtering

Speak into the mic and the demo grid filters itself based on what you say. Powered by the <filter_update> tag protocol embedded in the LLM's response stream.

Industry demo cards

Pre-built voice demos for healthcare, legal, property, e-commerce, hospitality, financial services, hair & beauty, automotive, government, education, IT/SaaS - each a real working call you can place.

"WebRTC for the browser. Telephony for the phone. A speech-aware LLM for the brain. Premium synthetic voice for the output. The application backend stitches it all together."