nvidia-nemo/nemo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Orchestrates multimodal AI training and inference for speech, language, and conversational agents
Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking.
Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 10-component ml training. 1346 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking.
- Receive WebSocket audio — FastAPI server receives WebSocket connections from clients, establishes Pipecat pipeline with WebSocketTransport, and begins streaming audio frame processing
- Stream audio chunks — NemoStreamingASRService processes incoming audio using CacheFeatureBufferer to manage sliding window features, runs RNNT model inference to generate transcription hypotheses [Audio frames → ASRResult] (config: asr_model, sample_rate)
- Identify speakers — NeMoStreamingDiarService runs Sortformer model on audio features with streaming state management, outputs speaker labels synchronized with transcriptions [Audio features → DiarResultFrame] (config: max_num_speakers, spkcache_len)
- Generate LLM response — HuggingFaceLLMService receives transcribed text, formats with conversation context from OpenAILLMContext, streams response from local or remote LLM with tool calling support [ASRResult → LLM responses] (config: model_path, temperature, max_tokens)
- Synthesize speech — NeMoFastPitchHiFiGANTTSService converts LLM response text to audio using FastPitch for mel spectrograms and HiFiGAN for waveform generation [LLM responses → Audio output] (config: tts_model, vocoder_model)
- Handle turn-taking — NeMoTurnTakingService monitors VAD signals and user speech activity, sends interruption signals to stop TTS playback when user starts speaking [VAD events → Turn-taking decisions] (config: vad_threshold)
- Stream response audio — RTVIObserver translates pipeline frames to RTVI protocol messages, WebSocketTransport sends synthesized audio chunks back to client for playback [Audio output]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.pydataclass with text: str, is_final: bool, eou_prob: Optional[float], eob_prob: Optional[float], eou_latency: Optional[float], eob_latency: Optional[float], processing_time: Optional[float]
Created from ASR model output with transcription text and confidence scores, consumed by voice agent pipeline for real-time speech processing
nemo/agents/voice_agent/pipecat/frames/frames.pydataclass inheriting DataFrame with diar_result: np.ndarray | int, stream_id: str = 'default'
Generated by speaker diarization models to identify which speaker is talking, flows through Pipecat pipeline for multi-speaker conversation handling
nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.pydataclass with model_path: str, device: str, log: bool, max_num_speakers: int, spkcache_len: int, spkcache_refresh_rate: int, fifo_len: int
Configuration object created at service startup, defines streaming diarization model parameters and buffer sizes for real-time speaker tracking
nemo/collections/asr/parts/utils/rnnt_utils.pyclass representing ASR decoding hypothesis with text tokens, scores, and alignment information
Created during RNNT beam search decoding, contains partial and complete transcription hypotheses with confidence scores
nemo/collections/asr/data/ssl_dataset.pydataclass with audio: Union[Tensor, None], audio_len: Union[Tensor, None], noise: Union[Tensor, None], noise_len: Union[Tensor, None], noisy_audio: Union[Tensor, None], noisy_audio_len: Union[Tensor, None]
Batched audio tensors for SSL training, contains clean and noisy audio pairs with length information for contrastive learning
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Audio input tensors have shape (batch_size, sequence_length) with specific sample rate matching model training (16kHz) but only validates tensor type, not shape or sample rate compatibility
If this fails: If client sends 44.1kHz audio or different tensor shapes, ASR model produces garbled transcriptions or crashes during feature extraction without clear error messages
nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:NemoStreamingASRService
Feature vectors fed to cache have consistent dimensionality (spkcache_len=188) across all audio chunks but never validates feature dimensions match cache buffer size
If this fails: Mismatched feature dimensions cause silent array indexing errors or memory corruption in circular buffer operations, leading to incorrect speaker predictions
nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:CacheFeatureBufferer
Audio frames arrive in temporal order from WebSocket clients and processing completes before next frame arrives, but no sequence numbering or timing validation exists
If this fails: Network reordering or processing delays cause ASR context windows to contain out-of-order audio, producing nonsensical transcriptions that appear valid to downstream LLM
examples/voice_agent/server/server.py:WebSocket handler
Local LLM inference server remains responsive and has sufficient GPU memory for concurrent requests, but only checks initial connection without monitoring server health
If this fails: GPU OOM or server deadlock causes silent request failures, leaving users waiting indefinitely for responses while pipeline appears to be functioning normally
nemo/agents/voice_agent/pipecat/services/nemo/llm.py:HuggingFaceLLMService
End-of-utterance probability thresholds (eou_prob, eob_prob) are calibrated for English conversational speech patterns but applied to any language without validation
If this fails: Non-English languages or domain-specific speech (medical, legal) trigger premature utterance boundaries, causing conversation flow interruptions and truncated transcriptions
nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:ASRResult
Maximum 4 speakers (max_num_speakers=4) hardcoded in configuration covers all use cases but speaker count can exceed this in conference calls or group meetings
If this fails: With >4 speakers, diarization model assigns overlapping IDs to different people, causing conversation context to become confused and responses to be attributed to wrong participants
nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:DiarizationConfig
RTVI protocol messages can be serialized to protobuf and transmitted without size limits or network fragmentation handling
If this fails: Large conversation contexts or long transcriptions exceed WebSocket frame limits, causing connection drops that appear as random client disconnections
nemo/agents/voice_agent/pipecat/processors/frameworks/rtvi.py:RTVIObserver
Pipeline processors are connected in fixed order (ASR -> LLM -> TTS) but async processing can cause frame reordering between pipeline stages
If this fails: TTS synthesis begins before LLM completes response generation, producing audio output for partial responses that get overwritten by final LLM output
examples/voice_agent/server/server.py:Pipeline construction
Log directory path exists and is writable when AudioLogger initializes, but file system permissions and disk space are never validated
If this fails: Insufficient disk space or permission changes cause silent logging failures, making debugging impossible when issues occur in production deployments
nemo/agents/voice_agent/pipecat/services/nemo/audio_logger.py:AudioLogger
ASR hypothesis states remain valid across audio chunks without considering context window expiration or model state staleness
If this fails: Long pauses in conversation leave stale partial hypotheses in memory, causing old partial text to suddenly appear in new transcriptions after silence periods
nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:Hypothesis tracking
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Circular buffers storing audio features for streaming models, enabling real-time processing with context windows
Maintains dialogue history and system prompts for LLM context, tracks user and assistant messages
Stores trained model weights and configurations for ASR, TTS, and diarization models
Session-organized WAV files and JSON metadata for debugging and analysis
Feedback Loops
- Real-time transcription loop (streaming, reinforcing) — Trigger: New audio chunk received. Action: NemoStreamingASRService processes features, updates hypothesis, generates partial transcription. Exit: User stops speaking or connection closes.
- Conversation context accumulation (memory-update, reinforcing) — Trigger: User message or assistant response. Action: OpenAILLMContext appends to message history, maintains conversation state. Exit: Session ends or context reset.
- Turn-taking interrupt loop (interrupt-handler, balancing) — Trigger: User starts speaking during TTS playback. Action: NeMoTurnTakingService stops TTS, clears audio buffer, switches to listening mode. Exit: User stops speaking.
- Feature buffer rotation (cache-refresh, balancing) — Trigger: New audio features arrive. Action: CacheFeatureBufferer shifts buffer window, evicts old features, adds new ones. Exit: Continuous operation.
Delays
- ASR processing latency (computation-delay, ~~80ms per chunk) — Affects real-time transcription responsiveness, includes model inference and hypothesis generation
- LLM response generation (async-processing, ~Variable based on model size) — User waits for AI response, streaming reduces perceived latency
- TTS synthesis (audio-generation, ~~100ms for initial chunk) — Delay before audio playback starts, subsequent chunks stream in real-time
- WebSocket frame transmission (network-latency, ~Network dependent) — Affects end-to-end conversation latency, includes serialization overhead
Control Points
- VAD threshold (threshold) — Controls: Voice activity detection sensitivity for turn-taking decisions. Default: threshold config parameter
- Model device selection (device-selection) — Controls: Whether models run on CPU or GPU for inference. Default: cuda
- LLM temperature (hyperparameter) — Controls: Response creativity vs determinism in language generation. Default: temperature config parameter
- Buffer cache lengths (architecture-switch) — Controls: Memory usage vs context window size for streaming models. Default: spkcache_len=188
- Max speakers (threshold) — Controls: Maximum number of speakers the diarization system can track simultaneously. Default: max_num_speakers config parameter
Technology Stack
Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training
Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents
HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions
Model loading and tokenization for language models, providing standardized interfaces for LLM inference
Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation
Optimized inference engine for deployed models, providing cross-platform model execution
Real-time audio streaming between browser clients and voice agent server
Configuration management system enabling hierarchical configs and experiment composition
Key Components
- NemoStreamingASRService (processor) — Performs real-time speech recognition by processing streaming audio chunks and generating transcriptions with end-of-utterance detection
nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py - NeMoStreamingDiarService (processor) — Identifies speakers in real-time audio streams using Sortformer model with streaming state management and speaker caching
nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py - HuggingFaceLLMService (processor) — Manages LLM inference for conversational AI, handling model loading, prompt formatting, and streaming response generation with tool calling support
nemo/agents/voice_agent/pipecat/services/nemo/llm.py - RTVIObserver (monitor) — Observes frame flow in the Pipecat pipeline and translates events to RTVI protocol messages for WebSocket communication with clients
nemo/agents/voice_agent/pipecat/processors/frameworks/rtvi.py - CacheFeatureBufferer (store) — Manages circular buffers for audio features in streaming models, handling cache updates and feature extraction for real-time processing
nemo/agents/voice_agent/pipecat/services/nemo/utils.py - AudioLogger (monitor) — Records audio streams and transcriptions to disk for debugging and analysis, organizing sessions with metadata and timestamps
nemo/agents/voice_agent/pipecat/services/nemo/audio_logger.py - NeMoTurnTakingService (orchestrator) — Coordinates conversation flow by detecting when to interrupt TTS playback based on user speech patterns and VAD signals
nemo/agents/voice_agent/pipecat/services/nemo/turn_taking.py - SortformerEncLabelModel (processor) — Neural network model for speaker diarization using Sortformer architecture, predicts speaker labels from audio embeddings
nemo/collections/asr/models/ - OpenAILLMContext (store) — Manages conversation history and context for LLM interactions, tracking messages and maintaining dialogue state
pipecat.processors.aggregators.openai_llm_context - PipelineRunner (orchestrator) — Executes Pipecat pipelines, managing frame flow between processors and handling async task coordination
pipecat.pipeline.runner
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is NeMo used for?
Orchestrates multimodal AI training and inference for speech, language, and conversational agents nvidia-nemo/nemo is a 10-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 1346 files.
How is NeMo architected?
NeMo is organized into 4 architecture layers: Collections, Core Framework, Voice Agents, Utilities & Tools. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through NeMo?
Data moves through 7 stages: Receive WebSocket audio → Stream audio chunks → Identify speakers → Generate LLM response → Synthesize speech → .... Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking. This pipeline design reflects a complex multi-stage processing system.
What technologies does NeMo use?
The core stack includes PyTorch Lightning (Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training), Pipecat (Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents), FastAPI (HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions), HuggingFace Transformers (Model loading and tokenization for language models, providing standardized interfaces for LLM inference), Lhotse (Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation), ONNX Runtime (Optimized inference engine for deployed models, providing cross-platform model execution), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does NeMo have?
NeMo exhibits 4 data pools (Feature cache buffers, Conversation context), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle streaming and memory-update. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does NeMo use?
5 design patterns detected: Collection-based architecture, Streaming state management, Frame-based pipeline, Async service coordination, Model registry pattern.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.