jax-ml/jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

35,442 stars Python 10 components

Compiles NumPy programs with automatic differentiation to accelerated hardware via XLA

User Python functions are traced into Jaxpr intermediate representation, transformed by interpreters (differentiation, vectorization, compilation), then lowered to MLIR and compiled to XLA for execution on accelerated hardware. Each transformation creates new Jaxprs that preserve computation semantics while adding capabilities like gradients or batching.

Under the hood, the system uses 2 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.

A 10-component ml training. 1234 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Function Tracing — JaxprTrace intercepts Python function execution by replacing concrete values with Tracer objects, recording each operation as JaxprEqn entries to build a Jaxpr computation graph [Python function → Jaxpr]
Abstract Evaluation — Each primitive operation computes abstract values (ShapedArray) to track shapes and types through the computation without executing on concrete data [JaxprEqn → ShapedArray]
Interpreter Transform — Specialized interpreters like VJPTrace, BatchTrace, or JVPTrace transform the input Jaxpr by adding new equations for gradients, vectorization, or other transformations [Jaxpr → Jaxpr]
MLIR Lowering — lower_jaxpr_to_module converts Jaxpr equations to MLIR operations using MlirLoweringContext, mapping JAX primitives to platform-specific MLIR dialects [Jaxpr → MLIR module]
XLA Compilation — MLIR module is compiled through XLA to generate optimized machine code for the target accelerator (CPU, GPU, TPU), producing an XlaExecutable [MLIR module → XlaExecutable]
Hardware Execution — XlaExecutable.call() executes compiled code on the target device with concrete ArrayImpl data, managing memory transfers and device synchronization [XlaExecutable → ArrayImpl]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Jaxpr jax/_src/core.py
NamedTuple with jaxpr: Jaxpr, literals: list[Any], consts: list[Array] — captures a computation graph with input variables (invars), output variables (outvars), and equations (eqns)
Created during function tracing by converting Python operations into primitive equations, transformed by interpreters (differentiation, batching), then lowered to MLIR

JaxprEqn jax/_src/core.py
NamedTuple with primitive: Primitive, invars: list[Var], outvars: list[Var], params: dict[str, Any] — represents a single operation in the computation graph
Generated when tracing Python operations, processed sequentially by interpreters to create new equations with transformed variables

ShapedArray jax/_src/abstract_arrays.py
AbstractValue with shape: tuple[int, ...], dtype: np.dtype, weak_type: bool — abstract representation of array shape and type without data
Computed during tracing to track array shapes through transformations, used for type checking and memory allocation planning

ArrayImpl jax/_src/array.py
JAX array implementation with aval: ShapedArray, sharding: Sharding, _arrays: list[Array] — concrete array data distributed across devices
Created from NumPy arrays or computation results, manages data placement across devices, destroyed when computation completes

XlaExecutable jax/_src/interpreters/pxla.py
Compiled executable with xla_executable: Any, input_avals: Sequence[ShapedArray], output_avals: Sequence[ShapedArray], input_shardings: Sequence[JSharding] — ready-to-run compiled function
Created by compiling Jaxpr through MLIR to XLA, cached for repeated execution, invoked with concrete array data

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

Assumes jaxlib is properly installed and its version is compatible, but only checks version match after successful import — incompatible jaxlib versions that import successfully but have ABI mismatches will cause cryptic runtime errors in XLA compilation

If this fails: Silent failures or crashes deep in XLA when jaxlib C++ extensions have incompatible ABIs, making debugging extremely difficult

jax/_src/lib/__init__.py:jaxlib_import

critical Environment weakly guarded

Assumes exactly one of jaxlib.mosaic.gpu, jax_cuda12_plugin, or jax_cuda13_plugin will provide _mosaic_gpu_ext, but doesn't validate that the imported extension matches the actual CUDA runtime version or GPU architecture

If this fails: GPU kernels may fail at runtime with cryptic CUDA errors if the wrong plugin version is loaded for the actual hardware/driver combination

jax/_src/lib/mosaic_gpu.py:_mosaic_gpu_ext_import

critical Environment unguarded

Assumes lib_cuda_examples.so exists in the same directory as the Python file and has the expected FooFwd/FooBwd symbols, but never validates the shared library is compatible with current CUDA runtime or has correct symbol signatures

If this fails: ctypes.cdll.LoadLibrary succeeds but pycapsule creation fails with segfaults or undefined symbols at FFI call time

examples/ffi/src/jax_ffi_example/cuda_examples.py:library_loading

critical Shape weakly guarded

Assumes input arrays a and b have identical shapes and dtype float32, but only validates dtype and shape equality without checking memory layout, strides, or contiguity that the underlying CUDA kernel expects

If this fails: CUDA kernel may read incorrect memory locations or crash if arrays have non-contiguous strides or unexpected memory layout

examples/ffi/src/jax_ffi_example/cuda_examples.py:foo_fwd

critical Resource unguarded

Assumes cudaMemcpyAsync will complete before the CUDA stream is used by subsequent operations, but doesn't add explicit synchronization or check for CUDA errors from the memcpy operation

If this fails: Race conditions where downstream GPU operations read uninitialized data if memcpy hasn't completed, leading to wrong results

examples/ffi/src/jax_ffi_example/gpu_examples.cc:StateExecute

warning Domain unguarded

Assumes the input array attribute contains int32_t elements that can be safely summed into an int64_t without overflow, but never checks the array size or validates that sum won't exceed int64_t max value

If this fails: Integer overflow in the sum calculation leads to wrong results when processing large arrays or arrays with large values

examples/ffi/src/jax_ffi_example/cpu_examples.cc:ArrayAttrImpl

warning Ordering unguarded

Assumes input and output arrays have the same memory layout and that it's safe to read from x and write to y in the same sequential order, but doesn't validate pointer alignment or check for memory overlap

If this fails: Memory corruption or undefined behavior if input and output arrays overlap in memory or have different alignment requirements

examples/ffi/src/jax_ffi_example/rms_norm.cc:ComputeRmsNorm

warning Scale unguarded

Assumes the size parameter fits in int64_t and that the variance calculation (sm / size) won't cause numerical instability, but doesn't validate size > 0 or handle the case where all elements are zero

If this fails: Division by zero or numerical overflow when size is 0 or when sum of squares is extremely large, leading to NaN or Inf results

examples/ffi/src/jax_ffi_example/rms_norm.cc:ComputeRmsNorm

warning Contract unguarded

Assumes the num parameter is a positive integer that can be used to create a valid numpy array range, but never validates num >= 0 or checks that np.arange(num) won't exhaust memory

If this fails: Memory exhaustion or negative array indices when num is very large or negative, causing crashes or wrong results

examples/ffi/src/jax_ffi_example/cpu_examples.py:array_attr

warning Contract unguarded

Assumes the input x is a JAX array with a numeric dtype compatible with the C++ implementation, but never validates that x.dtype maps to a supported C++ type (T in the template)

If this fails: Type mismatch between Python array dtype and C++ template instantiation leads to incorrect memory interpretation and wrong results

examples/ffi/src/jax_ffi_example/rms_norm.py:rms_norm

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Compilation Cache (cache)
Caches XlaExecutable objects keyed by function signature and static arguments to avoid recompilation on repeated jit calls

Device Memory (buffer)
GPU/TPU memory pools managed by XLA for array storage and computation, handles allocation and deallocation of device buffers

Primitive Registry (registry)
Global registry mapping primitive names to Primitive objects with their transformation rules for each interpreter

Feedback Loops

Gradient Computation (training-loop, balancing) — Trigger: jax.grad() call. Action: VJPTrace builds forward computation graph, then backpropagates gradients through transpose of each primitive operation. Exit: All gradients computed for input variables.
JIT Recompilation (cache-invalidation, balancing) — Trigger: Function called with new static argument values or array shapes. Action: Cache miss triggers fresh tracing and MLIR compilation, new XlaExecutable stored in compilation cache. Exit: Executable cached for current signature.

Delays

Compilation Delay (compilation, ~seconds to minutes for large models) — First call to jit functions incurs compilation cost, subsequent calls use cached executable
Device Transfer (async-processing, ~microseconds to milliseconds) — Host-to-device memory transfers can be pipelined with computation but add latency

Control Points

JAX_PLATFORM_NAME (env-var) — Controls: Default device backend selection (cpu, gpu, tpu)
jax_enable_x64 (runtime-toggle) — Controls: Whether to use 64-bit floating point by default instead of 32-bit. Default: False
jax_disable_jit (runtime-toggle) — Controls: Disables JIT compilation for debugging, functions execute in Python interpreter. Default: False

Technology Stack

XLA (compute)
Compiles MLIR to optimized machine code for accelerators, handles memory management and device scheduling

MLIR (framework)
Multi-level intermediate representation for compiler transformations, bridges JAX primitives to hardware-specific operations

NumPy (library)
Provides array API compatibility and reference semantics for JAX array operations

jaxlib (runtime)
C++ extension providing XLA bindings, CUDA/TPU support, and low-level array operations

Key Components

Tracer (tracer) — Abstract base for objects that represent computation during function tracing — when JAX traces a Python function, concrete values are replaced with Tracer objects that record operations jax/_src/core.py
JaxprTrace (orchestrator) — Main tracing context that converts Python function execution into Jaxpr representation by intercepting operations on Tracer objects and building equation lists jax/_src/core.py
grad_impl (transformer) — Implementation of automatic differentiation that transforms functions to compute gradients using reverse-mode AD through the ad interpreter jax/_src/api.py
vmap_impl (transformer) — Vectorization transformer that automatically batches operations across array dimensions using the batching interpreter jax/_src/api.py
jit_impl (transformer) — Just-in-time compiler that converts functions to XLA executables through MLIR lowering and caches them for fast repeated execution jax/_src/api.py
JVPTrace (processor) — Forward-mode automatic differentiation that propagates tangent values alongside primal computation to compute directional derivatives jax/_src/interpreters/ad.py
VJPTrace (processor) — Reverse-mode automatic differentiation that builds computation graphs during forward pass then backpropagates gradients jax/_src/interpreters/ad.py
BatchTrace (processor) — Vectorization interpreter that transforms operations to work on batched dimensions, inserting broadcast and reduce operations as needed jax/_src/interpreters/batching.py
lower_jaxpr_to_module (transformer) — Converts Jaxpr to MLIR module by mapping JAX primitives to MLIR operations, handling type conversions and platform-specific lowering rules jax/_src/interpreters/mlir.py
MlirLoweringContext (orchestrator) — Manages MLIR compilation state including type mappings, value tracking, and platform-specific lowering rules during Jaxpr-to-MLIR conversion jax/_src/interpreters/mlir.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is jax used for?

Compiles NumPy programs with automatic differentiation to accelerated hardware via XLA jax-ml/jax is a 10-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 1234 files.

How is jax architected?

jax is organized into 5 architecture layers: User API, Core Tracing, Interpreters, MLIR Lowering, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through jax?

Data moves through 6 stages: Function Tracing → Abstract Evaluation → Interpreter Transform → MLIR Lowering → XLA Compilation → .... User Python functions are traced into Jaxpr intermediate representation, transformed by interpreters (differentiation, vectorization, compilation), then lowered to MLIR and compiled to XLA for execution on accelerated hardware. Each transformation creates new Jaxprs that preserve computation semantics while adding capabilities like gradients or batching. This pipeline design reflects a complex multi-stage processing system.

What technologies does jax use?

The core stack includes XLA (Compiles MLIR to optimized machine code for accelerators, handles memory management and device scheduling), MLIR (Multi-level intermediate representation for compiler transformations, bridges JAX primitives to hardware-specific operations), NumPy (Provides array API compatibility and reference semantics for JAX array operations), jaxlib (C++ extension providing XLA bindings, CUDA/TPU support, and low-level array operations). A lean dependency footprint.

What system dynamics does jax have?

jax exhibits 3 data pools (Compilation Cache, Device Memory), 2 feedback loops, 3 control points, 2 delays. The feedback loops handle training-loop and cache-invalidation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does jax use?

3 design patterns detected: Transformation Composition, Trace-Transform-Lower, Abstract Interpretation.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.