Speech Services

Speech Services Reference¶

Quick Reference: Azure Speech SDK integration for STT and TTS

Speech Recognition (STT)¶

Core Class¶

Source: src/speech/speech_recognizer.py

from src.speech.speech_recognizer import StreamingSpeechRecognizerFromBytes

recognizer = StreamingSpeechRecognizerFromBytes(
    speech_region="eastus",
    languages=["en-US", "es-ES"],
    enable_diarization=True,
    use_default_credential=True,  # Azure Entra ID
)

# Set callbacks
recognizer.on_partial_result = handle_partial  # Barge-in detection
recognizer.on_final_result = handle_final      # Complete utterance

Handler Integration¶

Endpoint	Handler	STT Pattern
`/api/v1/media/stream`	SpeechCascadeHandler	Three-thread architecture
`/api/v1/media/stream`	VoiceLiveHandler	OpenAI Realtime (internal)
`/api/v1/browser/conversation`	BrowserHandler	Per-connection pooled client

Three-Thread Architecture (SpeechCascade)¶

graph TB subgraph "Speech SDK Thread" A["on_partial → Barge-in"] B["on_final → Queue Result"] end subgraph "Route Turn Thread" C["await speech_queue.get()"] D["Orchestrator Processing"] end subgraph "Main Event Loop" E["WebSocket Handler"] F["Task Cancellation"] end A -.->|"run_coroutine_threadsafe"| F B -.->|"queue.put_nowait"| C

Audio Format¶

All endpoints expect 16 kHz, mono PCM:

SAMPLE_RATE = 16000
CHANNELS = 1
SAMPLE_WIDTH = 2  # 16-bit

Speech Synthesis (TTS)¶

Core Class¶

Source: src/speech/text_to_speech.py

from src.speech.text_to_speech import SpeechSynthesizer

synthesizer = SpeechSynthesizer(
    region="eastus",
    voice="en-US-JennyMultilingualNeural",
    enable_tracing=True,
    playback="never",  # Headless deployment
)

Basic Usage¶

# Synthesize to memory
audio_data = synthesizer.synthesize_speech(
    "Hello! How can I help?",
    style="chat",
    rate="+10%"
)

# Streaming for WebSocket
frames = synthesizer.synthesize_to_base64_frames(text, sample_rate=16000)
for frame in frames:
    websocket.send(frame)

Voice Styling¶

audio = synthesizer.synthesize_speech(
    text,
    voice="en-US-EmmaNeural",
    style="cheerful",  # chat, news, excited, sad, angry
    rate="+5%",
    pitch="+2Hz",
    volume="+10dB"
)

Configuration¶

Environment Variables¶

Variable	Purpose
`AZURE_SPEECH_REGION`	Speech service region
`AZURE_SPEECH_RESOURCE_ID`	Resource ID for managed identity
`STT_POOL_SIZE`	Concurrent recognizer pool (default: 10)
`TTS_POOL_SIZE`	Concurrent synthesizer pool (default: 10)
`TTS_ENABLE_LOCAL_PLAYBACK`	Enable speaker output (dev only)

Authentication¶

# Production: Azure Entra ID (recommended)
synthesizer = SpeechSynthesizer(
    region="eastus",
    use_default_credential=True,
)

# Development: API Key
synthesizer = SpeechSynthesizer(
    key=os.getenv("AZURE_SPEECH_KEY"),
    region="eastus",
)

Resource Pooling¶

Both STT and TTS use connection pools managed by the application:

# Handlers automatically acquire/release pool resources
stt_client = await stt_pool.acquire()
try:
    # Process audio
    ...
finally:
    await stt_pool.release(stt_client)

Pool size is configured via environment variables (STT_POOL_SIZE, TTS_POOL_SIZE).

OpenTelemetry Integration¶

Both services include built-in tracing:

synthesizer = SpeechSynthesizer(
    region="eastus",
    enable_tracing=True,
    call_connection_id="acs-call-123",  # Correlation ID
)

Traced operations include: - Session lifecycle - Audio processing latency - Error conditions and recovery

Streaming Modes — Mode selection and comparison
Pricing & Deployment Options — Cost analysis, containers, breakeven
ACS Call Flows — Telephony integration
Session Management — State persistence

Speech Services

Speech Services Reference¶

Speech Recognition (STT)¶

Core Class¶

Handler Integration¶

Three-Thread Architecture (SpeechCascade)¶

Audio Format¶

Speech Synthesis (TTS)¶

Core Class¶

Basic Usage¶

Voice Styling¶

Configuration¶

Environment Variables¶

Authentication¶

Resource Pooling¶

OpenTelemetry Integration¶

Related Documentation¶