microsoft/onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

19,901 stars C++ 9 components

Executes machine learning model inference and training across platforms with hardware acceleration

The system loads an ONNX model file into memory, parses it into a computational graph, applies optimizations based on the target hardware, creates an execution plan that maps operators to execution providers, then runs inference by flowing input tensors through the optimized graph and returning output tensors. For web deployment, tensors are uploaded to GPU textures or WASM memory, processed through WebGL shaders or native operators, then downloaded back to JavaScript.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component ml inference. 6806 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Load ONNX model — InferenceSession.loadModel() reads the ONNX protobuf file, parses the computational graph structure, and validates operator types and tensor shapes (config: providers, graph_optimization_level)
Optimize graph — GraphOptimizer applies transformations like operator fusion (Conv+BatchNorm+Relu → FusedConv), constant folding, and layout optimizations based on the selected execution providers [Raw ONNX graph → Optimized execution graph] (config: graph_optimization_level, disabled_optimizers)
Create execution plan — ExecutionProviderManager assigns each operator to the best available execution provider (CUDA, DirectML, CPU) and generates memory allocation plan with buffer reuse [Optimized execution graph → ExecutionPlan] (config: providers, provider_options)
Prepare input tensors — Input data is converted to OrtValue tensors with proper memory layout - for web backends, this uploads data to WebGL textures or WASM memory buffers [User data arrays → OrtValue] (config: memory.arena_extend_strategy)
Execute operators — The execution engine runs operators in dependency order according to the ExecutionPlan, with each operator executed by its assigned provider using optimized kernels [OrtValue → OrtValue] (config: intra_op_num_threads, inter_op_num_threads)
Return results — Output OrtValues are converted back to the appropriate format for each language binding - JavaScript Tensors for web, NumPy arrays for Python, managed arrays for C# [OrtValue → Tensor]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

OrtValue onnxruntime/core/framework/ort_value.h
C++ union holding either Tensor (multi-dimensional array with typed data buffer), SparseTensor (indices + values arrays), Sequence (vector of OrtValues), or Map (key-value pairs)
Created from input tensors, flows through execution providers during operator execution, returned as output tensors

Tensor js/common/lib/tensor.ts
JavaScript object with data: TypedArray (Float32Array, Int32Array, etc), dims: number[] (shape), type: string (data type like 'float32', 'int64')
Constructed from user data arrays, converted to native tensors for execution, converted back to JavaScript tensors for output

InferenceSession js/common/lib/inference-session.ts
Interface with loadModel(): Promise<void>, run(feeds: Record<string, Tensor>): Promise<Record<string, Tensor>>, and configuration options like executionProviders, graphOptimizationLevel
Created with model path/buffer, loads and optimizes ONNX graph, maintains session state for repeated inference calls

ExecutionPlan onnxruntime/core/framework/execution_plan_base.h
C++ structure containing sequence of execution steps, each step mapping operators to execution providers with memory allocation plan
Generated during session initialization from optimized graph, used repeatedly during inference to coordinate operator execution

RunOptions onnxruntime/core/session/run_options.h
C++ struct with run_log_verbosity_level: int, run_tag: string, terminate_on_timeout: bool, and profiling configuration
Created per inference run, passed to execution engine to control logging and profiling behavior

WebGLTextureData js/web/lib/onnxjs/backends/webgl/types.ts
Object extending TextureLayout with tensor: Tensor, texture: WebGLTexture, width/height: number, channels: 1|2|3|4, shape: readonly number[]
Created when tensors are uploaded to WebGL textures, used during GPU computation, disposed after results are downloaded

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Ordering unguarded

Backend registration order determines execution priority, with the Node.js backend always registered at priority 100 for all discovered backends

If this fails: If multiple backends support the same operations and are registered with equal or higher priorities, the Node.js backend may be bypassed, causing fallback to less optimal execution paths without warning

js/node/lib/index.ts:registerBackend

critical Environment weakly guarded

Build-time constants BUILD_DEFS.DISABLE_WEBGL, BUILD_DEFS.DISABLE_JSEP, BUILD_DEFS.DISABLE_WEBGPU are correctly set and consistent across the entire build process

If this fails: If these constants are undefined or have inconsistent values between modules, backends may be double-registered, missing, or throw configuration errors at runtime rather than build time

js/web/lib/index.ts:BUILD_DEFS

critical Contract unguarded

Dynamic requires inside conditional blocks will always resolve to modules that export the expected backend interfaces (onnxjsBackend, wasmBackend)

If this fails: If bundlers or runtime environments don't properly handle conditional requires, or if the required modules don't export the expected objects, the application fails at runtime with cryptic module loading errors

js/web/lib/index.ts:require('./backend-onnxjs')

critical Contract unguarded

The listSupportedBackends() function always returns an array of objects with a 'name' property that matches valid backend names expected by registerBackend()

If this fails: If listSupportedBackends() returns malformed objects, null values, or backend names that don't correspond to actual implementations, the registration loop silently registers invalid backends or crashes with property access errors

js/node/lib/index.ts:listSupportedBackends

warning Environment weakly guarded

Build configuration errors (JSEP + WebGPU enabled together, WebNN without JSEP/WebGPU) are the only invalid combinations that need to be checked

If this fails: Other invalid build configurations may pass validation but result in runtime failures or degraded performance when actual inference is attempted, with no clear indication of the root cause

js/web/lib/index.ts:configuration validation

warning Temporal unguarded

Backend registration happens before any inference sessions are created, and the registerBackend() calls complete synchronously

If this fails: If inference sessions are created before all backends are registered (in async scenarios), they may fall back to suboptimal backends or fail to find required execution providers

js/react_native/lib/index.ts:backend registration timing

warning Scale unguarded

Backend priority values (-10 for webgl, 5 for webgpu/webnn, 10 for cpu/wasm) create the intended execution order and won't conflict with priorities set by user code

If this fails: If user code registers backends with intermediate priority values, or if the priority scale doesn't provide enough granularity, backends may execute in unexpected order leading to suboptimal performance

js/web/lib/index.ts:backend priority values

warning Environment unguarded

The env.versions object exists and can be safely extended with version information using Object.defineProperty

If this fails: If env.versions is frozen, sealed, or undefined, version injection fails silently or throws errors, potentially breaking version detection in downstream code

js/node/lib/index.ts:version object

info Contract unguarded

The 'onnxruntime-common' module exports all necessary types and interfaces as named exports that can be safely re-exported

If this fails: If 'onnxruntime-common' has breaking changes in its export structure, the re-export statements fail and applications importing from this module lose access to core functionality

js/web/lib/index.ts:default export

info Domain unguarded

React Native backends should have the lowest priority (1) among all platform-specific backends to act as fallbacks

If this fails: If this priority assumption is wrong and React Native backends should be preferred, optimal execution providers won't be selected on React Native platforms

js/react_native/lib/index.ts:backend priority 1

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model cache (in-memory)
Caches loaded ONNX models and their optimized graphs to avoid re-parsing on subsequent session creations

Kernel registry (registry)
Global registry mapping (operator_type, provider, data_types) → kernel implementations, populated at initialization

Memory arena (buffer)
Pre-allocated memory pools for intermediate tensors during inference, with buffer reuse to minimize allocations

WebGL texture cache (cache)
Caches GPU textures for tensor data to avoid repeated upload/download between CPU and GPU memory

Feedback Loops

Dynamic memory growth (auto-scale, reinforcing) — Trigger: Arena runs out of pre-allocated memory during execution. Action: Allocate additional memory chunks and expand the arena size. Exit: Memory allocation succeeds.
Provider fallback (circuit-breaker, balancing) — Trigger: Execution provider fails to execute an operator. Action: Fall back to CPU provider and retry the operator execution. Exit: Operator executes successfully on fallback provider.
WebGL context recovery (retry, balancing) — Trigger: WebGL context is lost during GPU computation. Action: Recreate WebGL context, recompile shaders, and re-upload textures. Exit: Context restoration succeeds or max retries reached.

Delays

Model loading (async-processing, ~varies by model size) — Session creation blocks until ONNX model is fully parsed and optimized
WebGL shader compilation (compilation, ~100-500ms per operator type) — First inference run includes shader compilation time for each unique operator
CUDA stream synchronization (async-processing, ~microseconds) — GPU execution requires synchronization points for memory transfers and kernel launches

Control Points

Graph optimization level (hyperparameter) — Controls: Aggressiveness of graph optimizations from basic (level 1) to extended (level 99). Default: ORT_ENABLE_ALL
Execution providers (architecture-switch) — Controls: Which hardware backends to use and their priority order. Default: ['CUDAExecutionProvider', 'CPUExecutionProvider']
Thread pool size (runtime-toggle) — Controls: Number of threads for intra-op and inter-op parallelism. Default: hardware_concurrency()
Memory pattern optimization (feature-flag) — Controls: Whether to optimize memory allocation patterns and enable buffer reuse. Default: true
WebGL texture format (precision-mode) — Controls: GPU texture precision (RGBA32F vs RGBA16F) trading accuracy for performance. Default: RGBA32F

Technology Stack

C++ (runtime)
Core runtime engine implementation with operator kernels and execution providers

ONNX (serialization)
Standard model format for representing neural networks with operators and computational graphs

WebAssembly (runtime)
Enables running the C++ runtime in browsers with near-native performance

WebGL (compute)
Browser GPU computation using fragment shaders for accelerated inference in web environments

CUDA (compute)
NVIDIA GPU acceleration using CUDA kernels and cuDNN for high-performance neural network operations

DirectML (compute)
Microsoft's hardware-agnostic ML acceleration layer for Windows GPUs and NPUs

pybind11 (library)
Creates Python bindings for the C++ runtime, enabling seamless integration with NumPy and PyTorch ecosystems

TypeScript (framework)
Type-safe JavaScript implementations for web and Node.js environments with tensor operations

CMake (build)
Cross-platform build system managing dependencies and compilation across multiple target platforms

Protobuf (serialization)
ONNX model serialization format for representing graphs, operators, and metadata

Key Components

InferenceSession (orchestrator) — Main execution coordinator that loads ONNX models, applies graph optimizations, creates execution plans, and manages inference runs across different execution providers onnxruntime/core/session/inference_session.cc
ExecutionProviderManager (dispatcher) — Routes operators to appropriate execution providers (CPU, CUDA, DirectML, etc.) based on hardware capabilities and operator support, managing provider priority and fallback onnxruntime/core/framework/execution_provider.cc
GraphOptimizer (optimizer) — Applies graph-level optimizations like operator fusion, constant folding, and dead code elimination to improve inference performance onnxruntime/core/optimizer/graph_optimizer.cc
KernelRegistry (registry) — Maps ONNX operator types to their implementation functions across different execution providers and data types onnxruntime/core/framework/kernel_registry.cc
SessionState (store) — Maintains runtime state for an inference session including loaded model, execution plan, memory pools, and provider contexts onnxruntime/core/session/session_state.cc
WebGLBackend (adapter) — Implements browser-based inference using WebGL compute shaders, managing texture memory and GPU program compilation js/web/lib/onnxjs/backends/webgl/backend-webgl.ts
WasmBackend (adapter) — Provides WebAssembly-based execution in browsers and Node.js, bridging JavaScript tensors to the native C++ runtime js/web/lib/wasm/wasm-core-impl.ts
CudaExecutionProvider (executor) — Executes operators on NVIDIA GPUs using CUDA and cuDNN libraries for high-performance neural network inference onnxruntime/core/providers/cuda/cuda_execution_provider.cc
PyBindModule (adapter) — Exposes C++ inference session and training APIs to Python through pybind11 bindings onnxruntime/python/onnxruntime_pybind.cc

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is onnxruntime used for?

Executes machine learning model inference and training across platforms with hardware acceleration microsoft/onnxruntime is a 9-component ml inference written in C++. Data flows through 6 distinct pipeline stages. The codebase contains 6806 files.

How is onnxruntime architected?

onnxruntime is organized into 4 architecture layers: Language Bindings, Core Runtime, Execution Providers, Operators. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through onnxruntime?

Data moves through 6 stages: Load ONNX model → Optimize graph → Create execution plan → Prepare input tensors → Execute operators → .... The system loads an ONNX model file into memory, parses it into a computational graph, applies optimizations based on the target hardware, creates an execution plan that maps operators to execution providers, then runs inference by flowing input tensors through the optimized graph and returning output tensors. For web deployment, tensors are uploaded to GPU textures or WASM memory, processed through WebGL shaders or native operators, then downloaded back to JavaScript. This pipeline design reflects a complex multi-stage processing system.

What technologies does onnxruntime use?

The core stack includes C++ (Core runtime engine implementation with operator kernels and execution providers), ONNX (Standard model format for representing neural networks with operators and computational graphs), WebAssembly (Enables running the C++ runtime in browsers with near-native performance), WebGL (Browser GPU computation using fragment shaders for accelerated inference in web environments), CUDA (NVIDIA GPU acceleration using CUDA kernels and cuDNN for high-performance neural network operations), DirectML (Microsoft's hardware-agnostic ML acceleration layer for Windows GPUs and NPUs), and 4 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does onnxruntime have?

onnxruntime exhibits 4 data pools (Model cache, Kernel registry), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle auto-scale and circuit-breaker. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does onnxruntime use?

5 design patterns detected: Execution Provider Pattern, Memory Arena Pattern, Graph Optimization Pipeline, Language Binding Facade, Asynchronous Execution.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.