neuralmagic/deepsparse

Sparsity-aware deep learning inference runtime for CPUs

3,163 stars Python 6 components

Accelerates sparse neural network inference on CPUs using specialized runtime optimizations

Raw inputs (text, images) enter through Pipeline.create() which handles task-specific preprocessing like tokenization. The EngineOperator converts preprocessed data into tensors, passes them to the compiled DeepSparse engine for sparse inference, then postprocessing operators transform raw outputs into structured results. Throughout this flow, LoggerManager captures performance metrics and request details.

Under the hood, the system uses 2 feedback loops, 2 data pools, 5 control points to manage its runtime behavior.

A 6-component ml inference. 426 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Pipeline creation and model compilation — Pipeline.create() loads the specified model path, compile_model() analyzes the ONNX graph for sparsity patterns and generates optimized execution plans for CPU inference (config: model, batch_size)
Input preprocessing — Task-specific operators tokenize text inputs, resize images, or apply other domain transformations to create standardized tensor inputs for the engine [Raw user inputs → PipelineInputs] (config: temperature, max_tokens)
Sparse inference execution — EngineOperator validates tensor shapes and passes data to the DeepSparse engine, which executes the optimized sparse computation graph using CPU SIMD instructions [PipelineInputs → EngineOutputs]
Output postprocessing — Task-specific operators convert raw engine tensors into structured results — sentiment labels with confidence scores, extracted answer spans, or detected bounding boxes [EngineOutputs → Task-specific results] (config: top_p, n)
Performance logging — LoggerManager captures inference latency, resource utilization, and request metadata throughout the pipeline, routing events to configured destinations based on logger filters [LogEvent → Filtered log streams] (config: enable)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

PipelineInputs src/deepsparse/operators/engine_operator.py
Pydantic model with engine_inputs: List[numpy.ndarray] containing tokenized and preprocessed tensors ready for engine execution
Created by preprocessing operators, consumed by EngineOperator.run(), then transformed into engine outputs

EngineOutputs src/deepsparse/operators/engine_operator.py
Pydantic model with engine_outputs: List[numpy.ndarray] containing raw tensor outputs from sparse inference engine
Generated by DeepSparse engine, passed through postprocessing operators to create final pipeline results

ChatCompletionRequest examples/openai-server/protocol.py
Pydantic model with model: str, messages: List[str], temperature: float=0.7, max_tokens: int=16, stream: bool=False, and 10 other OpenAI-compatible parameters
Received from HTTP clients, validated against OpenAI schema, then converted to pipeline inputs for text generation

BenchmarkResult src/deepsparse/benchmark/config.py
Configuration object capturing throughput (items/sec), latency percentiles, and system resource utilization during model benchmarking
Accumulated during benchmark runs, then serialized for CLI output or stored for analysis

LogEvent src/deepsparse/loggers/
Dictionary with timestamp, event_type (PREDICTION_LATENCY|REQUEST_DETAILS|RESOURCE_UTILIZATION), and event-specific payload data
Emitted during inference execution, filtered by logger configuration, then written to configured destinations

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

INPUTS environment variable contains comma-separated strings with no embedded commas, quotes, or special characters that would break simple split() parsing

If this fails: If input strings contain commas (like 'Hello, world'), split(',') will break them into multiple inputs, causing silent data corruption where one logical input becomes multiple pipeline inputs with wrong sentiment analysis results

examples/aws-serverless/batch/app_inf/app.py:input_str

critical Contract unguarded

JSON payload in event['body'] contains exactly the parameter names that pipeline() expects (likely 'inputs' or similar) without validation of required keys

If this fails: If client sends wrong JSON structure, pipeline(**payload) will either crash with TypeError for missing required arguments or silently process with default values, returning meaningless results to the client

examples/aws-serverless/realtime/app/app.py:lambda_handler

critical Resource unguarded

Model files at './model/deployment' fit within Lambda's memory limits (up to 10GB) and the compiled engine doesn't exceed available RAM during batch processing

If this fails: Large sparse models or batch processing of many inputs simultaneously can cause out-of-memory kills in Lambda, terminating the function without error handling and losing all batch progress

examples/aws-serverless/batch/app_inf/app.py:pipeline

warning Domain unguarded

Model at args.model path is actually compatible with version_2_with_negative=True parameter (SQuAD 2.0 format) but doesn't verify model capabilities

If this fails: If a SQuAD 1.0 model is passed, the pipeline will execute but may produce incorrect confidence scores or handle unanswerable questions wrong, silently degrading question-answering quality

examples/nlp-legal-cuad/main.py:pipeline

warning Contract unguarded

Model variants in feat.variants all return dictionaries with an 'answer' key, but doesn't validate the response structure before accessing answer['answer']

If this fails: If any model returns a different response format (list, string, or dict with different keys), the UI will crash with KeyError when displaying results, breaking the user experience

examples/sparseserver-ui/client/app.py:answer

warning Environment unguarded

AWS credentials are available in the Lambda environment (via IAM role, environment variables, or instance profile) and have sufficient permissions for S3 operations

If this fails: If credentials are missing or have insufficient permissions, boto3.client('s3') creation succeeds but upload_file_to_s3() will fail at runtime, losing all inference results without fallback storage

examples/aws-serverless/batch/app_inf/app.py:s3_client

warning Scale unguarded

All zoo: model URLs remain accessible and the models haven't been moved, deleted, or changed format in the SparseZoo registry

If this fails: If any zoo URL becomes invalid, model loading will fail at runtime during benchmarking, causing the entire benchmark UI tab to become unusable without graceful degradation

examples/benchmark-ui/settings.py:models

info Temporal weakly guarded

Model inference completes quickly enough that perf_counter() timing is meaningful and doesn't overflow or lose precision during measurement

If this fails: For very fast inferences (microseconds) or very slow ones (hours), timing accuracy degrades and displayed performance metrics become misleading, confusing users about actual model speed

examples/sparseserver-ui/client/app.py:start,end

info Domain unguarded

CUAD dataset test split contains at least one example and cuad[0] has the expected schema with 'question', 'context', and 'answers' keys

If this fails: If CUAD dataset is empty, restructured, or unavailable, cuad[0] will throw IndexError or KeyError, causing the example to crash without demonstrating the question-answering capability

examples/nlp-legal-cuad/main.py:cuad[0]

info Contract unguarded

Sentiment analysis pipeline always returns an object with a 'labels' attribute that is iterable and contains serializable elements for CSV writing

If this fails: If pipeline returns different response format or labels contains complex objects, write_list_to_csv() will either crash or write meaningless string representations to the output file

examples/aws-serverless/batch/app_inf/app.py:inference.labels

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Registry Cache (cache)
Stores compiled engine instances keyed by model path and configuration to avoid recompilation across pipeline instantiations

Logger Event Buffer (buffer)
Accumulates performance metrics and request events before batch writing to configured destinations to minimize I/O overhead during inference

Feedback Loops

HTTP request retry with exponential backoff (retry, balancing) — Trigger: Server errors or timeouts in OpenAI API server. Action: Client automatically retries failed requests with increasing delay intervals. Exit: Successful response or maximum retry count reached.
Benchmark warmup iterations (warmup, balancing) — Trigger: StreamBenchmarker initialization before performance measurement. Action: Executes model inference repeatedly to warm CPU caches and stabilize performance metrics. Exit: Specified warmup iteration count completed.

Delays

Model compilation latency (compilation, ~1-10 seconds depending on model size) — First inference request experiences compilation overhead while subsequent requests use cached compiled model
Batch processing window (batch-window, ~Variable based on batch_size configuration) — Engine waits to accumulate full batch before processing, trading latency for higher throughput

Control Points

batch_size (hyperparameter) — Controls: Number of inputs processed simultaneously, affecting memory usage and throughput vs latency tradeoff. Default: 1 (default single inference)
num_cores (runtime-toggle) — Controls: CPU core allocation for parallel sparse computation, scaling inference throughput. Default: All available physical cores
temperature (sampling-strategy) — Controls: Text generation randomness in language model outputs, affecting response creativity vs consistency. Default: 0.7 (moderate randomness)
stream (architecture-switch) — Controls: Whether to return complete responses or streaming token-by-token output for real-time applications. Default: false (batch responses)
logger.enable (feature-flag) — Controls: Activates performance monitoring and event logging throughout the inference pipeline. Default: Configuration-dependent

Technology Stack

ONNX (serialization)
Standard model format that DeepSparse compiles into sparse-optimized execution graphs

Pydantic (framework)
Validates API request/response schemas and internal data models throughout the inference pipeline

NumPy (compute)
Provides tensor data structures and array operations for model inputs, outputs, and internal computations

FastAPI (framework)
Powers HTTP server endpoints that expose DeepSparse inference through OpenAI-compatible REST APIs

Gradio (framework)
Creates interactive web UIs for benchmarking and testing sparse models in browser environments

AWS Boto3 (infra)
Integrates with AWS services for serverless deployment patterns including Lambda functions and S3 storage

pytest (testing)
Test framework for validating inference correctness and performance across different model configurations

Key Components

Pipeline (orchestrator) — Main entry point that coordinates task-specific preprocessing, engine inference, and postprocessing for AI workflows like sentiment analysis and question answering src/deepsparse/pipelines/
Engine (executor) — Low-level inference runtime that compiles ONNX models into sparse-optimized execution graphs and manages CPU scheduling to exploit zero-weight patterns src/deepsparse/engine.py
EngineOperator (adapter) — Bridges high-level pipeline data to low-level engine tensors, handling tensor shape validation and memory management during inference src/deepsparse/operators/engine_operator.py
compile_model (factory) — Analyzes ONNX model architecture and sparsity patterns to generate optimized execution plans for the DeepSparse engine src/deepsparse/compile.py
LoggerManager (dispatcher) — Routes performance metrics and inference events to configured logging backends based on event type and severity filters src/deepsparse/loggers/
StreamBenchmarker (monitor) — Measures inference throughput and latency under sustained load, capturing performance characteristics of sparse vs dense models src/deepsparse/benchmark/stream_benchmark.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is deepsparse used for?

Accelerates sparse neural network inference on CPUs using specialized runtime optimizations neuralmagic/deepsparse is a 6-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 426 files.

How is deepsparse architected?

deepsparse is organized into 4 architecture layers: Pipeline Layer, Operator Layer, Engine Layer, Logging Layer. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through deepsparse?

Data moves through 5 stages: Pipeline creation and model compilation → Input preprocessing → Sparse inference execution → Output postprocessing → Performance logging. Raw inputs (text, images) enter through Pipeline.create() which handles task-specific preprocessing like tokenization. The EngineOperator converts preprocessed data into tensors, passes them to the compiled DeepSparse engine for sparse inference, then postprocessing operators transform raw outputs into structured results. Throughout this flow, LoggerManager captures performance metrics and request details. This pipeline design reflects a complex multi-stage processing system.

What technologies does deepsparse use?

The core stack includes ONNX (Standard model format that DeepSparse compiles into sparse-optimized execution graphs), Pydantic (Validates API request/response schemas and internal data models throughout the inference pipeline), NumPy (Provides tensor data structures and array operations for model inputs, outputs, and internal computations), FastAPI (Powers HTTP server endpoints that expose DeepSparse inference through OpenAI-compatible REST APIs), Gradio (Creates interactive web UIs for benchmarking and testing sparse models in browser environments), AWS Boto3 (Integrates with AWS services for serverless deployment patterns including Lambda functions and S3 storage), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does deepsparse have?

deepsparse exhibits 2 data pools (Model Registry Cache, Logger Event Buffer), 2 feedback loops, 5 control points, 2 delays. The feedback loops handle retry and warmup. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does deepsparse use?

4 design patterns detected: Operator Pipeline Pattern, Engine Compilation Caching, OpenAI API Compatibility Layer, Multi-Engine Benchmarking.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.