Skip to content

Voice Processing

Voice Processing Architecture

This guide explains how voice interactions work in the ART Voice Agent Accelerator.


Three Patterns on the Managed Services Spectrum

There are three distinct patterns for building voice AI applications. The spectrum ranges from full control (you manage each component) to fully managed (the platform handles audio end-to-end):

Maximum Control

You independently configure and tune each layer of the voice pipeline with full flexibility over models, regions, and deployment options.

Component Options
Speech-to-Text Azure Speech, Whisper (via Azure OpenAI), or custom fine-tuned STT models
Language Model Foundry Models, swap models without code changes
Text-to-Speech Azure Neural Voices, Custom Neural Voice, or fine-tuned TTS models
Telephony Azure Communication Services
Deployment Cloud, hybrid, or on-premises via Speech containers

Best for:

  • Custom voice personas — Train Custom Neural Voice with your brand voice
  • Domain-specific recognitionFine-tune STT with industry vocabulary
  • Data residency — Deploy Speech containers on-premises for air-gapped environments
  • Model flexibility — Use Whisper for multilingual STT, GPT-4o for reasoning
  • Per-component debugging — Isolate issues in STT, LLM, or TTS independently

Managed Pipeline with Full Customization

Azure-managed voice pipeline with extensive model and voice options. See VoiceLive language support for current availability.

Component Options
Speech-to-Text Azure Speech: multilingual model (15 locales), single language, or up to 10 defined languages; phrase lists; Custom Speech models
Language Model Foundry Models, or bring your own model
Text-to-Speech Azure Neural Voices, HD voices with temperature control, Custom Neural Voice, custom lexicon
Telephony Azure Communication Services
Audio Features Noise suppression, echo cancellation, advanced end-of-turn detection, interruption handling

Managed ≠ Locked-In

VoiceLive is Azure-managed, but you retain full control over STT and TTS customization. You can configure phrase lists, custom speech models, custom lexicons, and custom neural voices—the pipeline optimization is managed, not the models themselves.

Best for:

  • Managed infrastructure — Azure handles the pipeline optimization
  • Domain-specific accuracy — Phrase lists + Custom Speech for industry vocabulary
  • Brand voice — Custom Neural Voice or HD voices with temperature control
  • Production audio quality — Built-in noise suppression and echo cancellation
  • Custom LLMBring your own model for specialized reasoning or fine-tuned models

Fully Managed

Everything handled by a single API—you only define tools.

Component Service
Audio + STT + LLM + TTS OpenAI Realtime API
Telephony Optional (ACS or browser-only)

Best for: Rapid prototyping, browser-based demos, scenarios where you don't need STT/TTS customization.


Comparison

Cascade VoiceLive Direct Realtime
Control Level Full Managed Minimal
STT Options Azure Speech, Whisper, Custom Speech Azure Speech (multilingual: 15 locales, or up to 10 defined), Custom Speech OpenAI built-in only
TTS Options Azure Neural Voices, Custom Neural Voice Azure Neural Voices, HD voices, Custom Neural Voice OpenAI voices only
LLM Options Any Azure OpenAI model Foundry Models or bring your own GPT-4o Realtime only
Phrase Lists ✅ Supported ✅ Supported Not supported
Custom Lexicon ✅ Supported ✅ Supported Not supported
Audio Quality Manual configuration Noise suppression, echo cancellation Basic
Component Swapping ✅ Full control STT/TTS configurable, pipeline managed Managed pipeline
On-Premises Deployment ✅ Speech containers + local SLMs/LLMs Cloud only Cloud only
Latency Varies by configuration Optimized pipeline Optimized pipeline
Debugging Per-component isolation End-to-end tracing End-to-end tracing
ACS Telephony ✅ Supported ✅ Supported Optional
In This Accelerator ✅ Implemented ✅ Implemented Not implemented

Latency depends on many factors

Actual latency varies based on network conditions, model deployment region, utterance length, and configuration. Run your own benchmarks for production planning.

Why not Direct Realtime?

Direct Realtime can integrate with ACS for telephony, but offers no fine-tuning capabilities—no custom voices, no phrase lists, no per-component observability. This accelerator focuses on Cascade and VoiceLive because they provide the enterprise controls needed for production deployments.


What This Accelerator Showcases

Both implemented patterns use Azure Communication Services (ACS) for telephony:

  • Phone numbers & PSTN — Real phone calls, not just browser demos
  • Media streaming — Secure WebSocket bridge between calls and your backend
  • Shared infrastructure — Same agent registry, tools, handoffs, and session management across both modes

Shared Components

Both Cascade and VoiceLive share the same core infrastructure, making it easy to switch between them or run both in parallel:

flowchart TB subgraph Shared[Shared Infrastructure] Agents[Agent Registry] Tools[Tool Registry] Scenarios[Scenario Store] Handoff[HandoffService] Greeting[GreetingService] Session[Session Management] end subgraph Cascade[Cascade Mode] CH[Cascade Handler] CO[CascadeOrchestrator] end subgraph VoiceLive[VoiceLive Mode] VH[VoiceLive Handler] VO[LiveOrchestrator] end CH --> Shared CO --> Shared VH --> Shared VO --> Shared style Shared fill:#E6FFE6,stroke:#107C10 style Cascade fill:#E6F3FF,stroke:#0078D4 style VoiceLive fill:#FFF4CE,stroke:#FFB900
Shared Component Purpose
Agent Registry YAML-defined agents with prompts, tools, voice settings
Tool Registry Business logic tools (auth, fraud, account lookup)
Scenario Store Industry scenarios with handoff graphs
HandoffService Unified agent switching logic
GreetingService Template-based greeting resolution
Session Management Redis-backed call state

This shared architecture means your agents, tools, and business logic work identically in both modes—only the voice pipeline differs.


Enterprise Considerations

Both Cascade and VoiceLive are well-suited for enterprise deployment because they offer granular control over the voice pipeline and support custom business logic:

Cascade: Maximum Control

Choose Cascade when you need control over:

Capability Options
STT models Azure Speech, Whisper (via Azure OpenAI), or custom fine-tuned STT
TTS models Azure Neural Voices, Custom Neural Voice, or fine-tuned voices
LLM selection GPT-4, GPT-4o, GPT-4o-mini, or other Azure OpenAI models
Regional deployment Deploy each service in specific Azure regions for data residency
On-premises Speech containers for air-gapped or hybrid environments

Custom model examples:

# Use Whisper for multilingual STT
stt:
  type: whisper
  deployment_id: whisper-large

# Use Custom Neural Voice for brand persona
tts:
  type: custom
  endpoint: https://eastus.customvoice.api.speech.microsoft.com
  voice_id: contoso-brand-voice

📖 Learn more: Speech containers, Custom Speech fine-tuning, Custom Neural Voice

VoiceLive: Azure-Managed with Full Customization

VoiceLive provides STT/TTS customization within a managed pipeline. Based on Microsoft's VoiceLive language support:

Speech Input (STT) Language Options:

Option Description
Automatic Multilingual Default model supporting 15 locales: zh-CN, en-AU, en-CA, en-IN, en-GB, en-US, fr-CA, fr-FR, de-DE, hi-IN, it-IT, ja-JP, ko-KR, es-MX, es-ES
Single Language Configure one specific language for optimal accuracy
Multiple Languages Up to 10 defined languages for broader coverage
Phrase List Just-in-time vocabulary hints (product names, acronyms)
Custom Speech Fine-tuned STT models trained on your domain data

Speech Output (TTS) Customization:

Option Description
Azure Neural Voices All supported voices via azure-standard type (monolingual, multilingual, HD)
Custom Lexicon Pronunciation customization for terms
Custom Neural Voice Brand voice trained on your audio via azure-custom type
Custom Avatar Photorealistic video avatar (optional)

Example configuration with custom models:

// STT with phrase list and custom speech
{
  "session": {
    "input_audio_transcription": {
      "model": "azure-speech",
      "phrase_list": ["Neo QLED TV", "AutoQuote Explorer"],
      "custom_speech": {
        "zh-CN": "847cb03d-7f22-4b11-xxx"  // Custom model ID
      }
    }
  }
}

// TTS with Azure neural voice and custom lexicon
{
  "voice": {
    "name": "en-US-Ava:DragonHDLatestNeural",
    "type": "azure-standard",
    "temperature": 0.8,
    "custom_lexicon_url": "<lexicon-url>"
  }
}

// TTS with Custom Neural Voice
{
  "voice": {
    "name": "en-US-CustomNeural",
    "type": "azure-custom",
    "endpoint_id": "your-endpoint-id"
  }
}

📖 Learn more: VoiceLive customization, Custom Speech, Custom Neural Voice

When to use custom STT/TTS models

Both Cascade and VoiceLive support the same custom model options. Consider fine-tuning when you need:

  • Domain-specific vocabulary — Medical terminology, legal jargon, financial products
  • Locale-specific accents — Regional dialects, non-native speakers, industry slang
  • Brand voice consistency — Custom Neural Voice trained on your brand persona
  • Improved recognition accuracy — Proper nouns, product names, acronyms unique to your business

Choose based on:

  • Cascade — You need to swap LLM models, run on-premises, or debug each component independently
  • VoiceLive — You want lowest latency and managed infrastructure (same STT/TTS customization, simpler ops)

Bottom line: Both Cascade and VoiceLive support extensive STT/TTS customization (Custom Speech, Custom Neural Voice, phrase lists, custom lexicon). VoiceLive adds HD voices and built-in audio quality features. Choose Cascade when you need on-premises deployment, component swapping, or per-component debugging. Choose VoiceLive for simpler operations and production audio quality features.


Voice Call Lifecycle

Both modes share the same high-level call lifecycle:

sequenceDiagram actor Caller participant ACS participant Handler as Voice Handler participant Agent Caller->>ACS: Dial ACS->>Handler: CallConnected Handler->>Agent: Load Agent Agent-->>Handler: Greeting Handler-->>ACS: TTS Audio ACS-->>Caller: Hello! rect rgb(230, 243, 255) Note over Caller,Agent: Conversation Turn Caller->>ACS: Check my balance ACS->>Handler: Audio Stream Handler->>Handler: STT Handler->>Agent: Transcript Agent-->>Handler: Response Handler-->>ACS: TTS Audio ACS-->>Caller: Your balance is... end Caller->>ACS: Hangup ACS->>Handler: CallDisconnected

Mode Comparison

Capability Cascade VoiceLive Direct Realtime
Telephony (PSTN) ✅ ACS ✅ ACS Optional (ACS)
STT Provider Azure Speech / Whisper Azure Speech (multilingual: 15 locales, or up to 10 defined) OpenAI Realtime
Custom Speech (STT) Supported ✅ Supported ❌ Not supported
LLM Provider Azure OpenAI (any model) Foundry Models or bring your own GPT-4o Realtime
TTS Provider Azure Neural Voices Azure Neural Voices + HD + Custom OpenAI voices
Voice Selection Azure Neural Voices Azure Neural Voices OpenAI voices
HD Voices ❌ Not supported ✅ With temperature control ❌ Not supported
Custom Neural Voice Supported ✅ Supported ❌ Not supported
Phrase Lists Supported ✅ Supported ❌ Not supported
Custom Lexicon Supported ✅ Supported ❌ Not supported
Audio Quality Manual configuration ✅ Noise suppression, echo cancellation Basic
Barge-in Client-side VAD Advanced end-of-turn detection Server-side VAD
Latency Varies by configuration Optimized pipeline Optimized pipeline
Debugging Per-component isolation End-to-end tracing End-to-end tracing
In This Accelerator ✅ Implemented ✅ Implemented ❌ Not implemented

Cascade Architecture

Cascade orchestrates three Azure services with a three-thread design for low latency:

flowchart TB subgraph T1[Thread 1: Speech SDK] direction TB A[/Audio In/] --> B[Continuous Recognition] B --> C{Partial?} C -->|Yes| D[Barge-in Signal] C -->|No| E[Final Transcript] end subgraph Q[Speech Queue] F[(Events)] end subgraph T2[Thread 2: Route Turn] direction TB G[Process Turn] --> H[LLM Call] H --> I[TTS Response] end subgraph T3[Thread 3: Main Loop] J[Task Cancellation] K[WebSocket] end D -.->|interrupt| J E --> F F --> G style T1 fill:#E6F3FF,stroke:#0078D4 style T2 fill:#E6FFE6,stroke:#107C10 style T3 fill:#FFF4CE,stroke:#FFB900 style Q fill:#F3F2F1,stroke:#605E5C

How it works:

  1. Thread 1 — Azure Speech SDK streams audio, emits partials (for barge-in) and finals
  2. Thread 2 — Processes complete utterances through Azure OpenAI, streams TTS per sentence
  3. Thread 3 — Handles WebSocket lifecycle and task cancellation

Key files: voice/speech_cascade/handler.py, voice/speech_cascade/orchestrator.py


VoiceLive Architecture

VoiceLive uses the Azure VoiceLive SDK — a single WebSocket connection to Azure that handles audio in, speech recognition, LLM, and audio out:

flowchart LR subgraph Client[ACS / Browser] Audio[Audio Stream] end subgraph Handler[VoiceLive Handler] WS[WebSocket Bridge] Events[Event Router] end subgraph VoiceLiveAPI[Azure VoiceLive] VAD[Server VAD] STT[Transcription] LLM[Azure OpenAI] TTS[Voice Output] end subgraph Orchestrator[LiveOrchestrator] Tools[Tool Execution] Handoff[Agent Switching] end Audio <--> WS WS <--> VoiceLiveAPI VoiceLiveAPI --> Events Events --> Tools Tools --> Handoff style Client fill:#F3F2F1,stroke:#605E5C style Handler fill:#E6F3FF,stroke:#0078D4 style VoiceLiveAPI fill:#FFF4CE,stroke:#FFB900 style Orchestrator fill:#E6FFE6,stroke:#107C10

How it works:

  1. Audio streams to Azure VoiceLive over WebSocket
  2. Server-side VAD detects speech start/end automatically
  3. Transcription, LLM response, and TTS happen in one round-trip
  4. Handler routes events: tool calls → execute locally, audio deltas → stream to caller

Key files: voice/voicelive/handler.py, voice/voicelive/orchestrator.py


Handoff Flow

Both modes use the same HandoffService for agent switching:

sequenceDiagram participant C as Concierge participant HS as HandoffService participant F as FraudAgent C->>HS: handoff_fraud() rect rgb(230, 243, 255) HS->>HS: resolve_handoff() Note over HS: Find target agent,<br/>Build system_vars,<br/>Get greeting config end HS-->>C: HandoffResolution C->>F: switch_to() F->>HS: select_greeting() HS-->>F: I'm the Fraud specialist...

Handoff Types:

  • Announced — Target agent plays a greeting (default)
  • Discrete — Silent handoff, no greeting

Key Files

Component Cascade VoiceLive
Handler voice/speech_cascade/handler.py voice/voicelive/handler.py
Orchestrator voice/speech_cascade/orchestrator.py voice/voicelive/orchestrator.py
Handoff voice/shared/handoff_service.py (same)
Greeting voice/shared/greeting_service.py (same)

Audio Formats

Transport Sample Rate Chunk Size Notes
Browser (WebRTC) 48 kHz 9,600 bytes Base64 over WebSocket
ACS Telephony 16 kHz 1,280 bytes 40ms pacing

See Also