Voice Processing

Voice Processing Architecture¶

This guide explains how voice interactions work in the ART Voice Agent Accelerator.

Three Patterns on the Managed Services Spectrum¶

There are three distinct patterns for building voice AI applications. The spectrum ranges from full control (you manage each component) to fully managed (the platform handles audio end-to-end):

CascadeVoiceLiveDirect Realtime

Maximum Control

You independently configure and tune each layer of the voice pipeline with full flexibility over models, regions, and deployment options.

Component	Options
Speech-to-Text	Azure Speech, Whisper (via Azure OpenAI), or custom fine-tuned STT models
Language Model	Foundry Models, swap models without code changes
Text-to-Speech	Azure Neural Voices, Custom Neural Voice, or fine-tuned TTS models
Telephony	Azure Communication Services
Deployment	Cloud, hybrid, or on-premises via Speech containers

Best for:

Custom voice personas — Train Custom Neural Voice with your brand voice
Domain-specific recognition — Fine-tune STT with industry vocabulary
Data residency — Deploy Speech containers on-premises for air-gapped environments
Model flexibility — Use Whisper for multilingual STT, GPT-4o for reasoning
Per-component debugging — Isolate issues in STT, LLM, or TTS independently

Managed Pipeline with Full Customization

Azure-managed voice pipeline with extensive model and voice options. See VoiceLive language support for current availability.

Component	Options
Speech-to-Text	Azure Speech: multilingual model (15 locales), single language, or up to 10 defined languages; phrase lists; Custom Speech models
Language Model	Foundry Models, or bring your own model
Text-to-Speech	Azure Neural Voices, HD voices with temperature control, Custom Neural Voice, custom lexicon
Telephony	Azure Communication Services
Audio Features	Noise suppression, echo cancellation, advanced end-of-turn detection, interruption handling

Managed ≠ Locked-In

VoiceLive is Azure-managed, but you retain full control over STT and TTS customization. You can configure phrase lists, custom speech models, custom lexicons, and custom neural voices—the pipeline optimization is managed, not the models themselves.

Best for:

Managed infrastructure — Azure handles the pipeline optimization
Domain-specific accuracy — Phrase lists + Custom Speech for industry vocabulary
Brand voice — Custom Neural Voice or HD voices with temperature control
Production audio quality — Built-in noise suppression and echo cancellation
Custom LLM — Bring your own model for specialized reasoning or fine-tuned models

Fully Managed

Everything handled by a single API—you only define tools.

Component	Service
Audio + STT + LLM + TTS	OpenAI Realtime API
Telephony	Optional (ACS or browser-only)

Best for: Rapid prototyping, browser-based demos, scenarios where you don't need STT/TTS customization.

Comparison¶

	Cascade	VoiceLive	Direct Realtime
Control Level	Full	Managed	Minimal
STT Options	Azure Speech, Whisper, Custom Speech	Azure Speech (multilingual: 15 locales, or up to 10 defined), Custom Speech	OpenAI built-in only
TTS Options	Azure Neural Voices, Custom Neural Voice	Azure Neural Voices, HD voices, Custom Neural Voice	OpenAI voices only
LLM Options	Any Azure OpenAI model	Foundry Models or bring your own	GPT-4o Realtime only
Phrase Lists	Supported	Supported	Not supported
Custom Lexicon	Supported	Supported	Not supported
Audio Quality	Manual configuration	Noise suppression, echo cancellation	Basic
Component Swapping	Full control	STT/TTS configurable, pipeline managed	Managed pipeline
On-Premises Deployment	Speech containers + local SLMs/LLMs	Cloud only	Cloud only
Latency	Varies by configuration	Optimized pipeline	Optimized pipeline
Debugging	Per-component isolation	End-to-end tracing	End-to-end tracing
ACS Telephony	Supported	Supported	Optional
In This Accelerator	Implemented	Implemented	Not implemented

Latency depends on many factors

Actual latency varies based on network conditions, model deployment region, utterance length, and configuration. Run your own benchmarks for production planning.

Why not Direct Realtime?

Direct Realtime can integrate with ACS for telephony, but offers no fine-tuning capabilities—no custom voices, no phrase lists, no per-component observability. This accelerator focuses on Cascade and VoiceLive because they provide the enterprise controls needed for production deployments.

What This Accelerator Showcases¶

Both implemented patterns use Azure Communication Services (ACS) for telephony:

Phone numbers & PSTN — Real phone calls, not just browser demos
Media streaming — Secure WebSocket bridge between calls and your backend
Shared infrastructure — Same agent registry, tools, handoffs, and session management across both modes

Shared Components¶

Both Cascade and VoiceLive share the same core infrastructure, making it easy to switch between them or run both in parallel:

flowchart TB subgraph Shared[Shared Infrastructure] Agents[Agent Registry] Tools[Tool Registry] Scenarios[Scenario Store] Handoff[HandoffService] Greeting[GreetingService] Session[Session Management] end subgraph Cascade[Cascade Mode] CH[Cascade Handler] CO[CascadeOrchestrator] end subgraph VoiceLive[VoiceLive Mode] VH[VoiceLive Handler] VO[LiveOrchestrator] end CH --> Shared CO --> Shared VH --> Shared VO --> Shared style Shared fill:#E6FFE6,stroke:#107C10 style Cascade fill:#E6F3FF,stroke:#0078D4 style VoiceLive fill:#FFF4CE,stroke:#FFB900

Shared Component	Purpose
Agent Registry	YAML-defined agents with prompts, tools, voice settings
Tool Registry	Business logic tools (auth, fraud, account lookup)
Scenario Store	Industry scenarios with handoff graphs
HandoffService	Unified agent switching logic
GreetingService	Template-based greeting resolution
Session Management	Redis-backed call state

This shared architecture means your agents, tools, and business logic work identically in both modes—only the voice pipeline differs.

Enterprise Considerations¶

Both Cascade and VoiceLive are well-suited for enterprise deployment because they offer granular control over the voice pipeline and support custom business logic:

Cascade: Maximum Control¶

Choose Cascade when you need control over:

Capability	Options
STT models	Azure Speech, Whisper (via Azure OpenAI), or custom fine-tuned STT
TTS models	Azure Neural Voices, Custom Neural Voice, or fine-tuned voices
LLM selection	GPT-4, GPT-4o, GPT-4o-mini, or other Azure OpenAI models
Regional deployment	Deploy each service in specific Azure regions for data residency
On-premises	Speech containers for air-gapped or hybrid environments

Custom model examples:

# Use Whisper for multilingual STT
stt:
  type: whisper
  deployment_id: whisper-large

# Use Custom Neural Voice for brand persona
tts:
  type: custom
  endpoint: https://eastus.customvoice.api.speech.microsoft.com
  voice_id: contoso-brand-voice

📖 Learn more: Speech containers, Custom Speech fine-tuning, Custom Neural Voice

VoiceLive: Azure-Managed with Full Customization¶

VoiceLive provides STT/TTS customization within a managed pipeline. Based on Microsoft's VoiceLive language support:

Speech Input (STT) Language Options:

Option	Description
Automatic Multilingual	Default model supporting 15 locales: zh-CN, en-AU, en-CA, en-IN, en-GB, en-US, fr-CA, fr-FR, de-DE, hi-IN, it-IT, ja-JP, ko-KR, es-MX, es-ES
Single Language	Configure one specific language for optimal accuracy
Multiple Languages	Up to 10 defined languages for broader coverage
Phrase List	Just-in-time vocabulary hints (product names, acronyms)
Custom Speech	Fine-tuned STT models trained on your domain data

Speech Output (TTS) Customization:

Option	Description
Azure Neural Voices	All supported voices via `azure-standard` type (monolingual, multilingual, HD)
Custom Lexicon	Pronunciation customization for terms
Custom Neural Voice	Brand voice trained on your audio via `azure-custom` type
Custom Avatar	Photorealistic video avatar (optional)

Example configuration with custom models:

// STT with phrase list and custom speech
{
  "session": {
    "input_audio_transcription": {
      "model": "azure-speech",
      "phrase_list": ["Neo QLED TV", "AutoQuote Explorer"],
      "custom_speech": {
        "zh-CN": "847cb03d-7f22-4b11-xxx"  // Custom model ID
      }
    }
  }
}

// TTS with Azure neural voice and custom lexicon
{
  "voice": {
    "name": "en-US-Ava:DragonHDLatestNeural",
    "type": "azure-standard",
    "temperature": 0.8,
    "custom_lexicon_url": "<lexicon-url>"
  }
}

// TTS with Custom Neural Voice
{
  "voice": {
    "name": "en-US-CustomNeural",
    "type": "azure-custom",
    "endpoint_id": "your-endpoint-id"
  }
}

📖 Learn more: VoiceLive customization, Custom Speech, Custom Neural Voice

When to use custom STT/TTS models

Both Cascade and VoiceLive support the same custom model options. Consider fine-tuning when you need:

Domain-specific vocabulary — Medical terminology, legal jargon, financial products
Locale-specific accents — Regional dialects, non-native speakers, industry slang
Brand voice consistency — Custom Neural Voice trained on your brand persona
Improved recognition accuracy — Proper nouns, product names, acronyms unique to your business

Choose based on:

Cascade — You need to swap LLM models, run on-premises, or debug each component independently
VoiceLive — You want lowest latency and managed infrastructure (same STT/TTS customization, simpler ops)

Bottom line: Both Cascade and VoiceLive support extensive STT/TTS customization (Custom Speech, Custom Neural Voice, phrase lists, custom lexicon). VoiceLive adds HD voices and built-in audio quality features. Choose Cascade when you need on-premises deployment, component swapping, or per-component debugging. Choose VoiceLive for simpler operations and production audio quality features.

Voice Call Lifecycle¶

Both modes share the same high-level call lifecycle:

sequenceDiagram actor Caller participant ACS participant Handler as Voice Handler participant Agent Caller->>ACS: Dial ACS->>Handler: CallConnected Handler->>Agent: Load Agent Agent-->>Handler: Greeting Handler-->>ACS: TTS Audio ACS-->>Caller: Hello! rect rgb(230, 243, 255) Note over Caller,Agent: Conversation Turn Caller->>ACS: Check my balance ACS->>Handler: Audio Stream Handler->>Handler: STT Handler->>Agent: Transcript Agent-->>Handler: Response Handler-->>ACS: TTS Audio ACS-->>Caller: Your balance is... end Caller->>ACS: Hangup ACS->>Handler: CallDisconnected

Mode Comparison¶

Capability	Cascade	VoiceLive	Direct Realtime
Telephony (PSTN)	✅ ACS	✅ ACS	Optional (ACS)
STT Provider	Azure Speech / Whisper	Azure Speech (multilingual: 15 locales, or up to 10 defined)	OpenAI Realtime
Custom Speech (STT)	✅ Supported	✅ Supported	❌ Not supported
LLM Provider	Azure OpenAI (any model)	Foundry Models or bring your own	GPT-4o Realtime
TTS Provider	Azure Neural Voices	Azure Neural Voices + HD + Custom	OpenAI voices
Voice Selection	Azure Neural Voices	Azure Neural Voices	OpenAI voices
HD Voices	❌ Not supported	✅ With temperature control	❌ Not supported
Custom Neural Voice	✅ Supported	✅ Supported	❌ Not supported
Phrase Lists	✅ Supported	✅ Supported	❌ Not supported
Custom Lexicon	✅ Supported	✅ Supported	❌ Not supported
Audio Quality	Manual configuration	✅ Noise suppression, echo cancellation	Basic
Barge-in	Client-side VAD	Advanced end-of-turn detection	Server-side VAD
Latency	Varies by configuration	Optimized pipeline	Optimized pipeline
Debugging	Per-component isolation	End-to-end tracing	End-to-end tracing
In This Accelerator	✅ Implemented	✅ Implemented	❌ Not implemented

Cascade Architecture¶

Cascade orchestrates three Azure services with a three-thread design for low latency:

flowchart TB subgraph T1[Thread 1: Speech SDK] direction TB A[/Audio In/] --> B[Continuous Recognition] B --> C{Partial?} C -->|Yes| D[Barge-in Signal] C -->|No| E[Final Transcript] end subgraph Q[Speech Queue] F[(Events)] end subgraph T2[Thread 2: Route Turn] direction TB G[Process Turn] --> H[LLM Call] H --> I[TTS Response] end subgraph T3[Thread 3: Main Loop] J[Task Cancellation] K[WebSocket] end D -.->|interrupt| J E --> F F --> G style T1 fill:#E6F3FF,stroke:#0078D4 style T2 fill:#E6FFE6,stroke:#107C10 style T3 fill:#FFF4CE,stroke:#FFB900 style Q fill:#F3F2F1,stroke:#605E5C

How it works:

Thread 1 — Azure Speech SDK streams audio, emits partials (for barge-in) and finals
Thread 2 — Processes complete utterances through Azure OpenAI, streams TTS per sentence
Thread 3 — Handles WebSocket lifecycle and task cancellation

Key files: voice/speech_cascade/handler.py, voice/speech_cascade/orchestrator.py

VoiceLive Architecture¶

VoiceLive uses the Azure VoiceLive SDK — a single WebSocket connection to Azure that handles audio in, speech recognition, LLM, and audio out:

flowchart LR subgraph Client[ACS / Browser] Audio[Audio Stream] end subgraph Handler[VoiceLive Handler] WS[WebSocket Bridge] Events[Event Router] end subgraph VoiceLiveAPI[Azure VoiceLive] VAD[Server VAD] STT[Transcription] LLM[Azure OpenAI] TTS[Voice Output] end subgraph Orchestrator[LiveOrchestrator] Tools[Tool Execution] Handoff[Agent Switching] end Audio <--> WS WS <--> VoiceLiveAPI VoiceLiveAPI --> Events Events --> Tools Tools --> Handoff style Client fill:#F3F2F1,stroke:#605E5C style Handler fill:#E6F3FF,stroke:#0078D4 style VoiceLiveAPI fill:#FFF4CE,stroke:#FFB900 style Orchestrator fill:#E6FFE6,stroke:#107C10

How it works:

Audio streams to Azure VoiceLive over WebSocket
Server-side VAD detects speech start/end automatically
Transcription, LLM response, and TTS happen in one round-trip
Handler routes events: tool calls → execute locally, audio deltas → stream to caller

Key files: voice/voicelive/handler.py, voice/voicelive/orchestrator.py

Handoff Flow¶

Both modes use the same HandoffService for agent switching:

sequenceDiagram participant C as Concierge participant HS as HandoffService participant F as FraudAgent C->>HS: handoff_fraud() rect rgb(230, 243, 255) HS->>HS: resolve_handoff() Note over HS: Find target agent,<br/>Build system_vars,<br/>Get greeting config end HS-->>C: HandoffResolution C->>F: switch_to() F->>HS: select_greeting() HS-->>F: I'm the Fraud specialist...

Handoff Types:

Announced — Target agent plays a greeting (default)
Discrete — Silent handoff, no greeting

Key Files¶

Component	Cascade	VoiceLive
Handler	`voice/speech_cascade/handler.py`	`voice/voicelive/handler.py`
Orchestrator	`voice/speech_cascade/orchestrator.py`	`voice/voicelive/orchestrator.py`
Handoff	`voice/shared/handoff_service.py`	(same)
Greeting	`voice/shared/greeting_service.py`	(same)

Audio Formats¶

Transport	Sample Rate	Chunk Size	Notes
Browser (WebRTC)	48 kHz	9,600 bytes	Base64 over WebSocket
ACS Telephony	16 kHz	1,280 bytes	40ms pacing

Voice Processing