nvidia-nemo/nemo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

17,099 stars Python 10 components

Orchestrates multimodal AI training and inference for speech, language, and conversational agents

Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component ml training. 1346 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking.

  1. Receive WebSocket audio — FastAPI server receives WebSocket connections from clients, establishes Pipecat pipeline with WebSocketTransport, and begins streaming audio frame processing
  2. Stream audio chunks — NemoStreamingASRService processes incoming audio using CacheFeatureBufferer to manage sliding window features, runs RNNT model inference to generate transcription hypotheses [Audio frames → ASRResult] (config: asr_model, sample_rate)
  3. Identify speakers — NeMoStreamingDiarService runs Sortformer model on audio features with streaming state management, outputs speaker labels synchronized with transcriptions [Audio features → DiarResultFrame] (config: max_num_speakers, spkcache_len)
  4. Generate LLM response — HuggingFaceLLMService receives transcribed text, formats with conversation context from OpenAILLMContext, streams response from local or remote LLM with tool calling support [ASRResult → LLM responses] (config: model_path, temperature, max_tokens)
  5. Synthesize speech — NeMoFastPitchHiFiGANTTSService converts LLM response text to audio using FastPitch for mel spectrograms and HiFiGAN for waveform generation [LLM responses → Audio output] (config: tts_model, vocoder_model)
  6. Handle turn-taking — NeMoTurnTakingService monitors VAD signals and user speech activity, sends interruption signals to stop TTS playback when user starts speaking [VAD events → Turn-taking decisions] (config: vad_threshold)
  7. Stream response audio — RTVIObserver translates pipeline frames to RTVI protocol messages, WebSocketTransport sends synthesized audio chunks back to client for playback [Audio output]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ASRResult nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py
dataclass with text: str, is_final: bool, eou_prob: Optional[float], eob_prob: Optional[float], eou_latency: Optional[float], eob_latency: Optional[float], processing_time: Optional[float]
Created from ASR model output with transcription text and confidence scores, consumed by voice agent pipeline for real-time speech processing
DiarResultFrame nemo/agents/voice_agent/pipecat/frames/frames.py
dataclass inheriting DataFrame with diar_result: np.ndarray | int, stream_id: str = 'default'
Generated by speaker diarization models to identify which speaker is talking, flows through Pipecat pipeline for multi-speaker conversation handling
DiarizationConfig nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py
dataclass with model_path: str, device: str, log: bool, max_num_speakers: int, spkcache_len: int, spkcache_refresh_rate: int, fifo_len: int
Configuration object created at service startup, defines streaming diarization model parameters and buffer sizes for real-time speaker tracking
Hypothesis nemo/collections/asr/parts/utils/rnnt_utils.py
class representing ASR decoding hypothesis with text tokens, scores, and alignment information
Created during RNNT beam search decoding, contains partial and complete transcription hypotheses with confidence scores
BatchConfig nemo/collections/asr/data/ssl_dataset.py
dataclass with audio: Union[Tensor, None], audio_len: Union[Tensor, None], noise: Union[Tensor, None], noise_len: Union[Tensor, None], noisy_audio: Union[Tensor, None], noisy_audio_len: Union[Tensor, None]
Batched audio tensors for SSL training, contains clean and noisy audio pairs with length information for contrastive learning

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

Audio input tensors have shape (batch_size, sequence_length) with specific sample rate matching model training (16kHz) but only validates tensor type, not shape or sample rate compatibility

If this fails: If client sends 44.1kHz audio or different tensor shapes, ASR model produces garbled transcriptions or crashes during feature extraction without clear error messages

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:NemoStreamingASRService
critical Contract unguarded

Feature vectors fed to cache have consistent dimensionality (spkcache_len=188) across all audio chunks but never validates feature dimensions match cache buffer size

If this fails: Mismatched feature dimensions cause silent array indexing errors or memory corruption in circular buffer operations, leading to incorrect speaker predictions

nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:CacheFeatureBufferer
critical Temporal unguarded

Audio frames arrive in temporal order from WebSocket clients and processing completes before next frame arrives, but no sequence numbering or timing validation exists

If this fails: Network reordering or processing delays cause ASR context windows to contain out-of-order audio, producing nonsensical transcriptions that appear valid to downstream LLM

examples/voice_agent/server/server.py:WebSocket handler
critical Resource weakly guarded

Local LLM inference server remains responsive and has sufficient GPU memory for concurrent requests, but only checks initial connection without monitoring server health

If this fails: GPU OOM or server deadlock causes silent request failures, leaving users waiting indefinitely for responses while pipeline appears to be functioning normally

nemo/agents/voice_agent/pipecat/services/nemo/llm.py:HuggingFaceLLMService
warning Domain unguarded

End-of-utterance probability thresholds (eou_prob, eob_prob) are calibrated for English conversational speech patterns but applied to any language without validation

If this fails: Non-English languages or domain-specific speech (medical, legal) trigger premature utterance boundaries, causing conversation flow interruptions and truncated transcriptions

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:ASRResult
warning Scale unguarded

Maximum 4 speakers (max_num_speakers=4) hardcoded in configuration covers all use cases but speaker count can exceed this in conference calls or group meetings

If this fails: With >4 speakers, diarization model assigns overlapping IDs to different people, causing conversation context to become confused and responses to be attributed to wrong participants

nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:DiarizationConfig
warning Environment unguarded

RTVI protocol messages can be serialized to protobuf and transmitted without size limits or network fragmentation handling

If this fails: Large conversation contexts or long transcriptions exceed WebSocket frame limits, causing connection drops that appear as random client disconnections

nemo/agents/voice_agent/pipecat/processors/frameworks/rtvi.py:RTVIObserver
warning Ordering weakly guarded

Pipeline processors are connected in fixed order (ASR -> LLM -> TTS) but async processing can cause frame reordering between pipeline stages

If this fails: TTS synthesis begins before LLM completes response generation, producing audio output for partial responses that get overwritten by final LLM output

examples/voice_agent/server/server.py:Pipeline construction
warning Contract unguarded

Log directory path exists and is writable when AudioLogger initializes, but file system permissions and disk space are never validated

If this fails: Insufficient disk space or permission changes cause silent logging failures, making debugging impossible when issues occur in production deployments

nemo/agents/voice_agent/pipecat/services/nemo/audio_logger.py:AudioLogger
info Temporal unguarded

ASR hypothesis states remain valid across audio chunks without considering context window expiration or model state staleness

If this fails: Long pauses in conversation leave stale partial hypotheses in memory, causing old partial text to suddenly appear in new transcriptions after silence periods

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:Hypothesis tracking

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Feature cache buffers (buffer)
Circular buffers storing audio features for streaming models, enabling real-time processing with context windows
Conversation context (state-store)
Maintains dialogue history and system prompts for LLM context, tracks user and assistant messages
Model checkpoints (file-store)
Stores trained model weights and configurations for ASR, TTS, and diarization models
Audio logs (file-store)
Session-organized WAV files and JSON metadata for debugging and analysis

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch Lightning (framework)
Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training
Pipecat (framework)
Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents
FastAPI (framework)
HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions
HuggingFace Transformers (library)
Model loading and tokenization for language models, providing standardized interfaces for LLM inference
Lhotse (library)
Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation
ONNX Runtime (runtime)
Optimized inference engine for deployed models, providing cross-platform model execution
WebRTC/WebSocket (infra)
Real-time audio streaming between browser clients and voice agent server
Hydra/OmegaConf (library)
Configuration management system enabling hierarchical configs and experiment composition

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is NeMo used for?

Orchestrates multimodal AI training and inference for speech, language, and conversational agents nvidia-nemo/nemo is a 10-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 1346 files.

How is NeMo architected?

NeMo is organized into 4 architecture layers: Collections, Core Framework, Voice Agents, Utilities & Tools. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through NeMo?

Data moves through 7 stages: Receive WebSocket audio → Stream audio chunks → Identify speakers → Generate LLM response → Synthesize speech → .... Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking. This pipeline design reflects a complex multi-stage processing system.

What technologies does NeMo use?

The core stack includes PyTorch Lightning (Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training), Pipecat (Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents), FastAPI (HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions), HuggingFace Transformers (Model loading and tokenization for language models, providing standardized interfaces for LLM inference), Lhotse (Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation), ONNX Runtime (Optimized inference engine for deployed models, providing cross-platform model execution), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does NeMo have?

NeMo exhibits 4 data pools (Feature cache buffers, Conversation context), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle streaming and memory-update. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does NeMo use?

5 design patterns detected: Collection-based architecture, Streaming state management, Frame-based pipeline, Async service coordination, Model registry pattern.

Analyzed on April 20, 2026 by CodeSea. Written by .