nvidia-nemo/nemo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

17,099 stars Python 10 components

Orchestrates multimodal AI training and inference for speech, language, and conversational agents

Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component ml training. 1346 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Receive WebSocket audio — FastAPI server receives WebSocket connections from clients, establishes Pipecat pipeline with WebSocketTransport, and begins streaming audio frame processing
Stream audio chunks — NemoStreamingASRService processes incoming audio using CacheFeatureBufferer to manage sliding window features, runs RNNT model inference to generate transcription hypotheses [Audio frames → ASRResult] (config: asr_model, sample_rate)
Identify speakers — NeMoStreamingDiarService runs Sortformer model on audio features with streaming state management, outputs speaker labels synchronized with transcriptions [Audio features → DiarResultFrame] (config: max_num_speakers, spkcache_len)
Generate LLM response — HuggingFaceLLMService receives transcribed text, formats with conversation context from OpenAILLMContext, streams response from local or remote LLM with tool calling support [ASRResult → LLM responses] (config: model_path, temperature, max_tokens)
Synthesize speech — NeMoFastPitchHiFiGANTTSService converts LLM response text to audio using FastPitch for mel spectrograms and HiFiGAN for waveform generation [LLM responses → Audio output] (config: tts_model, vocoder_model)
Handle turn-taking — NeMoTurnTakingService monitors VAD signals and user speech activity, sends interruption signals to stop TTS playback when user starts speaking [VAD events → Turn-taking decisions] (config: vad_threshold)
Stream response audio — RTVIObserver translates pipeline frames to RTVI protocol messages, WebSocketTransport sends synthesized audio chunks back to client for playback [Audio output]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ASRResult nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py
dataclass with text: str, is_final: bool, eou_prob: Optional[float], eob_prob: Optional[float], eou_latency: Optional[float], eob_latency: Optional[float], processing_time: Optional[float]
Created from ASR model output with transcription text and confidence scores, consumed by voice agent pipeline for real-time speech processing

DiarResultFrame nemo/agents/voice_agent/pipecat/frames/frames.py
dataclass inheriting DataFrame with diar_result: np.ndarray | int, stream_id: str = 'default'
Generated by speaker diarization models to identify which speaker is talking, flows through Pipecat pipeline for multi-speaker conversation handling

DiarizationConfig nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py
dataclass with model_path: str, device: str, log: bool, max_num_speakers: int, spkcache_len: int, spkcache_refresh_rate: int, fifo_len: int
Configuration object created at service startup, defines streaming diarization model parameters and buffer sizes for real-time speaker tracking

Hypothesis nemo/collections/asr/parts/utils/rnnt_utils.py
class representing ASR decoding hypothesis with text tokens, scores, and alignment information
Created during RNNT beam search decoding, contains partial and complete transcription hypotheses with confidence scores

BatchConfig nemo/collections/asr/data/ssl_dataset.py
dataclass with audio: Union[Tensor, None], audio_len: Union[Tensor, None], noise: Union[Tensor, None], noise_len: Union[Tensor, None], noisy_audio: Union[Tensor, None], noisy_audio_len: Union[Tensor, None]
Batched audio tensors for SSL training, contains clean and noisy audio pairs with length information for contrastive learning

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

Audio input tensors have shape (batch_size, sequence_length) with specific sample rate matching model training (16kHz) but only validates tensor type, not shape or sample rate compatibility

If this fails: If client sends 44.1kHz audio or different tensor shapes, ASR model produces garbled transcriptions or crashes during feature extraction without clear error messages

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:NemoStreamingASRService

critical Contract unguarded

Feature vectors fed to cache have consistent dimensionality (spkcache_len=188) across all audio chunks but never validates feature dimensions match cache buffer size

If this fails: Mismatched feature dimensions cause silent array indexing errors or memory corruption in circular buffer operations, leading to incorrect speaker predictions

nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:CacheFeatureBufferer

critical Temporal unguarded

Audio frames arrive in temporal order from WebSocket clients and processing completes before next frame arrives, but no sequence numbering or timing validation exists

If this fails: Network reordering or processing delays cause ASR context windows to contain out-of-order audio, producing nonsensical transcriptions that appear valid to downstream LLM

examples/voice_agent/server/server.py:WebSocket handler

critical Resource weakly guarded

Local LLM inference server remains responsive and has sufficient GPU memory for concurrent requests, but only checks initial connection without monitoring server health

If this fails: GPU OOM or server deadlock causes silent request failures, leaving users waiting indefinitely for responses while pipeline appears to be functioning normally

nemo/agents/voice_agent/pipecat/services/nemo/llm.py:HuggingFaceLLMService

warning Domain unguarded

End-of-utterance probability thresholds (eou_prob, eob_prob) are calibrated for English conversational speech patterns but applied to any language without validation

If this fails: Non-English languages or domain-specific speech (medical, legal) trigger premature utterance boundaries, causing conversation flow interruptions and truncated transcriptions

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:ASRResult

warning Scale unguarded

Maximum 4 speakers (max_num_speakers=4) hardcoded in configuration covers all use cases but speaker count can exceed this in conference calls or group meetings

If this fails: With >4 speakers, diarization model assigns overlapping IDs to different people, causing conversation context to become confused and responses to be attributed to wrong participants

nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py:DiarizationConfig

warning Environment unguarded

RTVI protocol messages can be serialized to protobuf and transmitted without size limits or network fragmentation handling

If this fails: Large conversation contexts or long transcriptions exceed WebSocket frame limits, causing connection drops that appear as random client disconnections

nemo/agents/voice_agent/pipecat/processors/frameworks/rtvi.py:RTVIObserver

warning Ordering weakly guarded

Pipeline processors are connected in fixed order (ASR -> LLM -> TTS) but async processing can cause frame reordering between pipeline stages

If this fails: TTS synthesis begins before LLM completes response generation, producing audio output for partial responses that get overwritten by final LLM output

examples/voice_agent/server/server.py:Pipeline construction

warning Contract unguarded

Log directory path exists and is writable when AudioLogger initializes, but file system permissions and disk space are never validated

If this fails: Insufficient disk space or permission changes cause silent logging failures, making debugging impossible when issues occur in production deployments

nemo/agents/voice_agent/pipecat/services/nemo/audio_logger.py:AudioLogger

info Temporal unguarded

ASR hypothesis states remain valid across audio chunks without considering context window expiration or model state staleness

If this fails: Long pauses in conversation leave stale partial hypotheses in memory, causing old partial text to suddenly appear in new transcriptions after silence periods

nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py:Hypothesis tracking

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Feature cache buffers (buffer)
Circular buffers storing audio features for streaming models, enabling real-time processing with context windows

Conversation context (state-store)
Maintains dialogue history and system prompts for LLM context, tracks user and assistant messages

Model checkpoints (file-store)
Stores trained model weights and configurations for ASR, TTS, and diarization models

Audio logs (file-store)
Session-organized WAV files and JSON metadata for debugging and analysis

Feedback Loops

Real-time transcription loop (streaming, reinforcing) — Trigger: New audio chunk received. Action: NemoStreamingASRService processes features, updates hypothesis, generates partial transcription. Exit: User stops speaking or connection closes.
Conversation context accumulation (memory-update, reinforcing) — Trigger: User message or assistant response. Action: OpenAILLMContext appends to message history, maintains conversation state. Exit: Session ends or context reset.
Turn-taking interrupt loop (interrupt-handler, balancing) — Trigger: User starts speaking during TTS playback. Action: NeMoTurnTakingService stops TTS, clears audio buffer, switches to listening mode. Exit: User stops speaking.
Feature buffer rotation (cache-refresh, balancing) — Trigger: New audio features arrive. Action: CacheFeatureBufferer shifts buffer window, evicts old features, adds new ones. Exit: Continuous operation.

Delays

ASR processing latency (computation-delay, ~~80ms per chunk) — Affects real-time transcription responsiveness, includes model inference and hypothesis generation
LLM response generation (async-processing, ~Variable based on model size) — User waits for AI response, streaming reduces perceived latency
TTS synthesis (audio-generation, ~~100ms for initial chunk) — Delay before audio playback starts, subsequent chunks stream in real-time
WebSocket frame transmission (network-latency, ~Network dependent) — Affects end-to-end conversation latency, includes serialization overhead

Control Points

VAD threshold (threshold) — Controls: Voice activity detection sensitivity for turn-taking decisions. Default: threshold config parameter
Model device selection (device-selection) — Controls: Whether models run on CPU or GPU for inference. Default: cuda
LLM temperature (hyperparameter) — Controls: Response creativity vs determinism in language generation. Default: temperature config parameter
Buffer cache lengths (architecture-switch) — Controls: Memory usage vs context window size for streaming models. Default: spkcache_len=188
Max speakers (threshold) — Controls: Maximum number of speakers the diarization system can track simultaneously. Default: max_num_speakers config parameter

Technology Stack

PyTorch Lightning (framework)
Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training

Pipecat (framework)
Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents

FastAPI (framework)
HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions

HuggingFace Transformers (library)
Model loading and tokenization for language models, providing standardized interfaces for LLM inference

Lhotse (library)
Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation

ONNX Runtime (runtime)
Optimized inference engine for deployed models, providing cross-platform model execution

WebRTC/WebSocket (infra)
Real-time audio streaming between browser clients and voice agent server

Hydra/OmegaConf (library)
Configuration management system enabling hierarchical configs and experiment composition

Key Components

NemoStreamingASRService (processor) — Performs real-time speech recognition by processing streaming audio chunks and generating transcriptions with end-of-utterance detection nemo/agents/voice_agent/pipecat/services/nemo/streaming_asr.py
NeMoStreamingDiarService (processor) — Identifies speakers in real-time audio streams using Sortformer model with streaming state management and speaker caching nemo/agents/voice_agent/pipecat/services/nemo/streaming_diar.py
HuggingFaceLLMService (processor) — Manages LLM inference for conversational AI, handling model loading, prompt formatting, and streaming response generation with tool calling support nemo/agents/voice_agent/pipecat/services/nemo/llm.py
RTVIObserver (monitor) — Observes frame flow in the Pipecat pipeline and translates events to RTVI protocol messages for WebSocket communication with clients nemo/agents/voice_agent/pipecat/processors/frameworks/rtvi.py
CacheFeatureBufferer (store) — Manages circular buffers for audio features in streaming models, handling cache updates and feature extraction for real-time processing nemo/agents/voice_agent/pipecat/services/nemo/utils.py
AudioLogger (monitor) — Records audio streams and transcriptions to disk for debugging and analysis, organizing sessions with metadata and timestamps nemo/agents/voice_agent/pipecat/services/nemo/audio_logger.py
NeMoTurnTakingService (orchestrator) — Coordinates conversation flow by detecting when to interrupt TTS playback based on user speech patterns and VAD signals nemo/agents/voice_agent/pipecat/services/nemo/turn_taking.py
SortformerEncLabelModel (processor) — Neural network model for speaker diarization using Sortformer architecture, predicts speaker labels from audio embeddings nemo/collections/asr/models/
OpenAILLMContext (store) — Manages conversation history and context for LLM interactions, tracking messages and maintaining dialogue state pipecat.processors.aggregators.openai_llm_context
PipelineRunner (orchestrator) — Executes Pipecat pipelines, managing frame flow between processors and handling async task coordination pipecat.pipeline.runner

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is NeMo used for?

Orchestrates multimodal AI training and inference for speech, language, and conversational agents nvidia-nemo/nemo is a 10-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 1346 files.

How is NeMo architected?

NeMo is organized into 4 architecture layers: Collections, Core Framework, Voice Agents, Utilities & Tools. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through NeMo?

Data moves through 7 stages: Receive WebSocket audio → Stream audio chunks → Identify speakers → Generate LLM response → Synthesize speech → .... Audio enters the system through WebSocket connections, gets processed by streaming ASR and diarization services simultaneously, generates text that flows to LLM for response generation, which then gets synthesized to speech by TTS and streamed back to the client. The voice agent maintains conversation context while handling real-time interruptions and turn-taking. This pipeline design reflects a complex multi-stage processing system.

What technologies does NeMo use?

The core stack includes PyTorch Lightning (Provides distributed training infrastructure with automatic scaling, checkpointing, and logging for neural network training), Pipecat (Real-time conversational AI framework handling audio streaming, frame processing, and WebSocket communication for voice agents), FastAPI (HTTP server framework providing WebSocket endpoints and REST APIs for voice agent interactions), HuggingFace Transformers (Model loading and tokenization for language models, providing standardized interfaces for LLM inference), Lhotse (Audio dataset processing toolkit for speech data preparation, augmentation, and batch generation), ONNX Runtime (Optimized inference engine for deployed models, providing cross-platform model execution), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does NeMo have?

NeMo exhibits 4 data pools (Feature cache buffers, Conversation context), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle streaming and memory-update. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does NeMo use?

5 design patterns detected: Collection-based architecture, Streaming state management, Frame-based pipeline, Async service coordination, Model registry pattern.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.