microsoft/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Executes machine learning model inference and training across platforms with hardware acceleration
The system loads an ONNX model file into memory, parses it into a computational graph, applies optimizations based on the target hardware, creates an execution plan that maps operators to execution providers, then runs inference by flowing input tensors through the optimized graph and returning output tensors. For web deployment, tensors are uploaded to GPU textures or WASM memory, processed through WebGL shaders or native operators, then downloaded back to JavaScript.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 9-component ml inference. 6806 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
The system loads an ONNX model file into memory, parses it into a computational graph, applies optimizations based on the target hardware, creates an execution plan that maps operators to execution providers, then runs inference by flowing input tensors through the optimized graph and returning output tensors. For web deployment, tensors are uploaded to GPU textures or WASM memory, processed through WebGL shaders or native operators, then downloaded back to JavaScript.
- Load ONNX model — InferenceSession.loadModel() reads the ONNX protobuf file, parses the computational graph structure, and validates operator types and tensor shapes (config: providers, graph_optimization_level)
- Optimize graph — GraphOptimizer applies transformations like operator fusion (Conv+BatchNorm+Relu → FusedConv), constant folding, and layout optimizations based on the selected execution providers [Raw ONNX graph → Optimized execution graph] (config: graph_optimization_level, disabled_optimizers)
- Create execution plan — ExecutionProviderManager assigns each operator to the best available execution provider (CUDA, DirectML, CPU) and generates memory allocation plan with buffer reuse [Optimized execution graph → ExecutionPlan] (config: providers, provider_options)
- Prepare input tensors — Input data is converted to OrtValue tensors with proper memory layout - for web backends, this uploads data to WebGL textures or WASM memory buffers [User data arrays → OrtValue] (config: memory.arena_extend_strategy)
- Execute operators — The execution engine runs operators in dependency order according to the ExecutionPlan, with each operator executed by its assigned provider using optimized kernels [OrtValue → OrtValue] (config: intra_op_num_threads, inter_op_num_threads)
- Return results — Output OrtValues are converted back to the appropriate format for each language binding - JavaScript Tensors for web, NumPy arrays for Python, managed arrays for C# [OrtValue → Tensor]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
onnxruntime/core/framework/ort_value.hC++ union holding either Tensor (multi-dimensional array with typed data buffer), SparseTensor (indices + values arrays), Sequence (vector of OrtValues), or Map (key-value pairs)
Created from input tensors, flows through execution providers during operator execution, returned as output tensors
js/common/lib/tensor.tsJavaScript object with data: TypedArray (Float32Array, Int32Array, etc), dims: number[] (shape), type: string (data type like 'float32', 'int64')
Constructed from user data arrays, converted to native tensors for execution, converted back to JavaScript tensors for output
js/common/lib/inference-session.tsInterface with loadModel(): Promise<void>, run(feeds: Record<string, Tensor>): Promise<Record<string, Tensor>>, and configuration options like executionProviders, graphOptimizationLevel
Created with model path/buffer, loads and optimizes ONNX graph, maintains session state for repeated inference calls
onnxruntime/core/framework/execution_plan_base.hC++ structure containing sequence of execution steps, each step mapping operators to execution providers with memory allocation plan
Generated during session initialization from optimized graph, used repeatedly during inference to coordinate operator execution
onnxruntime/core/session/run_options.hC++ struct with run_log_verbosity_level: int, run_tag: string, terminate_on_timeout: bool, and profiling configuration
Created per inference run, passed to execution engine to control logging and profiling behavior
js/web/lib/onnxjs/backends/webgl/types.tsObject extending TextureLayout with tensor: Tensor, texture: WebGLTexture, width/height: number, channels: 1|2|3|4, shape: readonly number[]
Created when tensors are uploaded to WebGL textures, used during GPU computation, disposed after results are downloaded
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Backend registration order determines execution priority, with the Node.js backend always registered at priority 100 for all discovered backends
If this fails: If multiple backends support the same operations and are registered with equal or higher priorities, the Node.js backend may be bypassed, causing fallback to less optimal execution paths without warning
js/node/lib/index.ts:registerBackend
Build-time constants BUILD_DEFS.DISABLE_WEBGL, BUILD_DEFS.DISABLE_JSEP, BUILD_DEFS.DISABLE_WEBGPU are correctly set and consistent across the entire build process
If this fails: If these constants are undefined or have inconsistent values between modules, backends may be double-registered, missing, or throw configuration errors at runtime rather than build time
js/web/lib/index.ts:BUILD_DEFS
Dynamic requires inside conditional blocks will always resolve to modules that export the expected backend interfaces (onnxjsBackend, wasmBackend)
If this fails: If bundlers or runtime environments don't properly handle conditional requires, or if the required modules don't export the expected objects, the application fails at runtime with cryptic module loading errors
js/web/lib/index.ts:require('./backend-onnxjs')
The listSupportedBackends() function always returns an array of objects with a 'name' property that matches valid backend names expected by registerBackend()
If this fails: If listSupportedBackends() returns malformed objects, null values, or backend names that don't correspond to actual implementations, the registration loop silently registers invalid backends or crashes with property access errors
js/node/lib/index.ts:listSupportedBackends
Build configuration errors (JSEP + WebGPU enabled together, WebNN without JSEP/WebGPU) are the only invalid combinations that need to be checked
If this fails: Other invalid build configurations may pass validation but result in runtime failures or degraded performance when actual inference is attempted, with no clear indication of the root cause
js/web/lib/index.ts:configuration validation
Backend registration happens before any inference sessions are created, and the registerBackend() calls complete synchronously
If this fails: If inference sessions are created before all backends are registered (in async scenarios), they may fall back to suboptimal backends or fail to find required execution providers
js/react_native/lib/index.ts:backend registration timing
Backend priority values (-10 for webgl, 5 for webgpu/webnn, 10 for cpu/wasm) create the intended execution order and won't conflict with priorities set by user code
If this fails: If user code registers backends with intermediate priority values, or if the priority scale doesn't provide enough granularity, backends may execute in unexpected order leading to suboptimal performance
js/web/lib/index.ts:backend priority values
The env.versions object exists and can be safely extended with version information using Object.defineProperty
If this fails: If env.versions is frozen, sealed, or undefined, version injection fails silently or throws errors, potentially breaking version detection in downstream code
js/node/lib/index.ts:version object
The 'onnxruntime-common' module exports all necessary types and interfaces as named exports that can be safely re-exported
If this fails: If 'onnxruntime-common' has breaking changes in its export structure, the re-export statements fail and applications importing from this module lose access to core functionality
js/web/lib/index.ts:default export
React Native backends should have the lowest priority (1) among all platform-specific backends to act as fallbacks
If this fails: If this priority assumption is wrong and React Native backends should be preferred, optimal execution providers won't be selected on React Native platforms
js/react_native/lib/index.ts:backend priority 1
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Caches loaded ONNX models and their optimized graphs to avoid re-parsing on subsequent session creations
Global registry mapping (operator_type, provider, data_types) → kernel implementations, populated at initialization
Pre-allocated memory pools for intermediate tensors during inference, with buffer reuse to minimize allocations
Caches GPU textures for tensor data to avoid repeated upload/download between CPU and GPU memory
Feedback Loops
- Dynamic memory growth (auto-scale, reinforcing) — Trigger: Arena runs out of pre-allocated memory during execution. Action: Allocate additional memory chunks and expand the arena size. Exit: Memory allocation succeeds.
- Provider fallback (circuit-breaker, balancing) — Trigger: Execution provider fails to execute an operator. Action: Fall back to CPU provider and retry the operator execution. Exit: Operator executes successfully on fallback provider.
- WebGL context recovery (retry, balancing) — Trigger: WebGL context is lost during GPU computation. Action: Recreate WebGL context, recompile shaders, and re-upload textures. Exit: Context restoration succeeds or max retries reached.
Delays
- Model loading (async-processing, ~varies by model size) — Session creation blocks until ONNX model is fully parsed and optimized
- WebGL shader compilation (compilation, ~100-500ms per operator type) — First inference run includes shader compilation time for each unique operator
- CUDA stream synchronization (async-processing, ~microseconds) — GPU execution requires synchronization points for memory transfers and kernel launches
Control Points
- Graph optimization level (hyperparameter) — Controls: Aggressiveness of graph optimizations from basic (level 1) to extended (level 99). Default: ORT_ENABLE_ALL
- Execution providers (architecture-switch) — Controls: Which hardware backends to use and their priority order. Default: ['CUDAExecutionProvider', 'CPUExecutionProvider']
- Thread pool size (runtime-toggle) — Controls: Number of threads for intra-op and inter-op parallelism. Default: hardware_concurrency()
- Memory pattern optimization (feature-flag) — Controls: Whether to optimize memory allocation patterns and enable buffer reuse. Default: true
- WebGL texture format (precision-mode) — Controls: GPU texture precision (RGBA32F vs RGBA16F) trading accuracy for performance. Default: RGBA32F
Technology Stack
Core runtime engine implementation with operator kernels and execution providers
Standard model format for representing neural networks with operators and computational graphs
Enables running the C++ runtime in browsers with near-native performance
Browser GPU computation using fragment shaders for accelerated inference in web environments
NVIDIA GPU acceleration using CUDA kernels and cuDNN for high-performance neural network operations
Microsoft's hardware-agnostic ML acceleration layer for Windows GPUs and NPUs
Creates Python bindings for the C++ runtime, enabling seamless integration with NumPy and PyTorch ecosystems
Type-safe JavaScript implementations for web and Node.js environments with tensor operations
Cross-platform build system managing dependencies and compilation across multiple target platforms
ONNX model serialization format for representing graphs, operators, and metadata
Key Components
- InferenceSession (orchestrator) — Main execution coordinator that loads ONNX models, applies graph optimizations, creates execution plans, and manages inference runs across different execution providers
onnxruntime/core/session/inference_session.cc - ExecutionProviderManager (dispatcher) — Routes operators to appropriate execution providers (CPU, CUDA, DirectML, etc.) based on hardware capabilities and operator support, managing provider priority and fallback
onnxruntime/core/framework/execution_provider.cc - GraphOptimizer (optimizer) — Applies graph-level optimizations like operator fusion, constant folding, and dead code elimination to improve inference performance
onnxruntime/core/optimizer/graph_optimizer.cc - KernelRegistry (registry) — Maps ONNX operator types to their implementation functions across different execution providers and data types
onnxruntime/core/framework/kernel_registry.cc - SessionState (store) — Maintains runtime state for an inference session including loaded model, execution plan, memory pools, and provider contexts
onnxruntime/core/session/session_state.cc - WebGLBackend (adapter) — Implements browser-based inference using WebGL compute shaders, managing texture memory and GPU program compilation
js/web/lib/onnxjs/backends/webgl/backend-webgl.ts - WasmBackend (adapter) — Provides WebAssembly-based execution in browsers and Node.js, bridging JavaScript tensors to the native C++ runtime
js/web/lib/wasm/wasm-core-impl.ts - CudaExecutionProvider (executor) — Executes operators on NVIDIA GPUs using CUDA and cuDNN libraries for high-performance neural network inference
onnxruntime/core/providers/cuda/cuda_execution_provider.cc - PyBindModule (adapter) — Exposes C++ inference session and training APIs to Python through pybind11 bindings
onnxruntime/python/onnxruntime_pybind.cc
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Inference Repositories
Frequently Asked Questions
What is onnxruntime used for?
Executes machine learning model inference and training across platforms with hardware acceleration microsoft/onnxruntime is a 9-component ml inference written in C++. Data flows through 6 distinct pipeline stages. The codebase contains 6806 files.
How is onnxruntime architected?
onnxruntime is organized into 4 architecture layers: Language Bindings, Core Runtime, Execution Providers, Operators. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through onnxruntime?
Data moves through 6 stages: Load ONNX model → Optimize graph → Create execution plan → Prepare input tensors → Execute operators → .... The system loads an ONNX model file into memory, parses it into a computational graph, applies optimizations based on the target hardware, creates an execution plan that maps operators to execution providers, then runs inference by flowing input tensors through the optimized graph and returning output tensors. For web deployment, tensors are uploaded to GPU textures or WASM memory, processed through WebGL shaders or native operators, then downloaded back to JavaScript. This pipeline design reflects a complex multi-stage processing system.
What technologies does onnxruntime use?
The core stack includes C++ (Core runtime engine implementation with operator kernels and execution providers), ONNX (Standard model format for representing neural networks with operators and computational graphs), WebAssembly (Enables running the C++ runtime in browsers with near-native performance), WebGL (Browser GPU computation using fragment shaders for accelerated inference in web environments), CUDA (NVIDIA GPU acceleration using CUDA kernels and cuDNN for high-performance neural network operations), DirectML (Microsoft's hardware-agnostic ML acceleration layer for Windows GPUs and NPUs), and 4 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does onnxruntime have?
onnxruntime exhibits 4 data pools (Model cache, Kernel registry), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle auto-scale and circuit-breaker. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does onnxruntime use?
5 design patterns detected: Execution Provider Pattern, Memory Arena Pattern, Graph Optimization Pipeline, Language Binding Facade, Asynchronous Execution.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.