google/flax

Flax is a neural network library for JAX that is designed for flexibility.

7,164 stars Jupyter Notebook 9 components

Neural network library for JAX that simplifies model creation and training

Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing.

Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 9-component library. 351 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Load and preprocess data — Dataset loaders in examples/ read from tensorflow_datasets or custom sources, apply tokenization for text or normalization for images, create batches with proper padding and masking (config: per_device_batch_size, dataset_name, max_corpus_chars)
Initialize model and optimizer — Create Module instance with nnx.Rngs for parameter initialization, wrap with Optimizer class that holds optax gradient transformation and tracks parameter state (config: vocab_size)
Forward pass through model — Module.__call__ processes input batch through neural network layers (conv, attention, MLP), each layer accesses its Variable parameters and produces intermediate activations [Batch → logits tensor]
Compute loss and metrics — Loss functions like cross-entropy compare model logits against target labels, produce scalar loss value plus auxiliary metrics like accuracy for monitoring [logits tensor → loss scalar]
Backward pass and parameter updates — nnx.grad automatically differentiates loss with respect to model parameters, optimizer applies gradient updates using momentum, weight decay, learning rate schedules [loss scalar → Updated Module]
Checkpoint and evaluation — Training state gets saved to disk every N steps using orbax checkpointing, model runs on validation data to compute eval metrics, logs training progress [Updated Module → saved checkpoint] (config: eval_per_device_batch_size, eval_dataset_name, eval_split)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Module flax/nnx/module.py
Python class with nnx.Variable fields containing parameter tensors, inherits from nnx.Module base class with __call__ method for forward pass
Created during model initialization, holds parameters as Variable objects, gets split into graph+state for JAX transformations, then merged back

Variable flax/nnx/variables.py
Container holding jax.Array with metadata like variable type (Param, BatchStat, etc) and unique identity for graph operations
Wraps parameter tensors, gets extracted during graph splits, modified by optimizers, and restored during merges

State flax/nnx/state.py
Dict-like structure mapping paths to Variables, created by splitting Module graphs for JAX function transformations
Extracted from Module graphs before JAX transforms, passed through pure functions, used to update Module state after transforms

GraphDef flax/nnx/graph.py
Immutable structure describing Module topology with node types, connections, and metadata - the 'blueprint' without parameter values
Created during graph splits to preserve Module structure, used by JAX for compilation decisions, combined with State to reconstruct Modules

Batch examples/mnist/train.py
Dict with image: jnp.ndarray[batch_size, 28, 28, 1] and label: jnp.ndarray[batch_size] for MNIST, varies by example
Loaded from datasets, preprocessed with normalization and batching, fed to model forward pass, used for gradient computation

TrainState flax/training/train_state.py
Container with step: int, params: PyTree, opt_state: optax.OptState, tx: optax.GradientTransformation for training loop state
Initialized with model parameters and optimizer, updated each training step with new gradients, saved/restored for checkpointing

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

JAX distributed initialization will either succeed or throw a ValueError with the specific message 'coordinator_address should be defined'

If this fails: Any other exception type (NetworkError, TimeoutError, etc.) will crash the program instead of being handled gracefully for single-GPU setups

examples/gemma/main.py:jax.distributed.initialize

critical Resource unguarded

TensorFlow GPU device hiding always succeeds and never raises an exception

If this fails: If TF device management fails (permissions, driver issues), the program crashes before JAX initialization, providing no fallback path

examples/imagenet/main.py:tf.config.experimental.set_visible_devices

warning Ordering unguarded

jax.config.config_with_absl() must be called after flag definitions but before app.run() to properly integrate JAX flags with absl

If this fails: If called in wrong order or not at all, JAX configuration flags may not be parsed correctly, leading to silent configuration mismatches between intended and actual settings

examples/lm1b/main.py:jax.config.config_with_absl

warning Contract unguarded

The train.train_and_evaluate function expects FLAGS.config to be a valid ml_collections.ConfigDict with all required fields for the specific example

If this fails: Missing config fields cause AttributeError deep in training loop rather than early validation, wasting setup time and making debugging harder

examples/*/main.py:train.train_and_evaluate

critical Environment unguarded

JAX distributed environment is properly initialized before calling process_index() and process_count()

If this fails: In misconfigured multi-host setups, these functions may return incorrect values (all processes think they are index 0) leading to data corruption from multiple processes writing to same checkpoint paths

examples/*/main.py:jax.process_index

critical Resource unguarded

The workdir path is writable and has sufficient disk space for checkpoints, logs, and temporary files

If this fails: Training runs for hours before failing when trying to save first checkpoint, losing all progress and potentially corrupting partial checkpoint files

examples/*/main.py:FLAGS.workdir

info Temporal unguarded

The platform work unit is available and writable at startup time

If this fails: In environments where work unit tracking is disabled or fails, the set_task_status call may silently fail or throw exceptions that aren't handled, potentially crashing the training job

examples/*/main.py:platform.work_unit().set_task_status

warning Contract unguarded

The config object contains all the required attributes (model_dir, experiment, batch_size, etc.) that train.py expects to find in FLAGS

If this fails: Missing config attributes cause AttributeError when accessing FLAGS.missing_field in train.py, but the error location is misleading since the issue is in config file structure

examples/nlp_seq/main.py:FLAGS assignment

warning Ordering weakly guarded

TensorFlow GPU hiding must happen before any JAX device initialization to prevent memory conflicts

If this fails: If JAX initializes GPU memory before TF device hiding, both frameworks may compete for GPU memory leading to OOM errors or degraded performance that's hard to debug

examples/*/main.py:tf.config.experimental.set_visible_devices before JAX calls

info Scale unguarded

The local_devices() list is small enough to fit in a single log line without truncation

If this fails: On large multi-GPU systems (8+ GPUs per host), device logging may be truncated or overflow log buffers, making device debugging harder

examples/*/main.py:jax.local_devices logging

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Module parameters (state-store)
Each Module holds Variable objects containing parameter tensors, accumulated gradients, and optimizer momentum terms

Training checkpoints (checkpoint)
Periodic snapshots of complete training state saved to disk for recovery and inference deployment

Dataset cache (cache)
Preprocessed and tokenized training data cached in memory to avoid repeated computation during training epochs

Feedback Loops

Training loop (training-loop, reinforcing) — Trigger: gradient computation. Action: optimizer updates parameters based on loss gradients. Exit: max steps reached or convergence.
Learning rate scheduling (convergence, balancing) — Trigger: training step count. Action: adjusts learning rate using warmup and decay schedules. Exit: training completion.
Gradient accumulation (gradient-accumulation, reinforcing) — Trigger: small batch size. Action: accumulates gradients over multiple mini-batches before parameter update. Exit: target effective batch size reached.

Delays

JIT compilation (compilation, ~seconds on first call) — first training step takes much longer while JAX compiles the computation graph
Checkpoint saving (checkpoint-save, ~seconds for large models) — training pauses while orbax writes parameter tensors to disk, can be async
Data loading (batch-window, ~milliseconds per batch) — GPU may idle while waiting for next batch preprocessing and transfer

Control Points

per_device_batch_size (hyperparameter) — Controls: memory usage and gradient noise during training. Default: 32
vocab_size (architecture-switch) — Controls: embedding layer dimensions and tokenizer vocabulary. Default: varies by model
dataset_name (runtime-toggle) — Controls: which dataset pipeline to use for training and evaluation. Default: lm1b
JAX_PLATFORMS (env-var) — Controls: whether JAX uses CPU, GPU, or TPU devices for computation

Technology Stack

JAX (compute)
provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon

optax (library)
supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap

orbax (serialization)
handles efficient checkpointing and model serialization, especially for large models across multiple devices

ml_collections (library)
provides ConfigDict for managing hyperparameters and model configurations in examples

tensorflow_datasets (library)
loads common ML datasets like ImageNet, MNIST, LM1B for training examples

sentencepiece (library)
tokenizes text data for NLP examples like Gemma and LM1B language models

treescope (library)
provides rich visualization and debugging tools for PyTree structures and model introspection

Key Components

nnx.Module (factory) — Base class for all NNX neural network modules, manages Variable instances and provides graph split/merge functionality for JAX transformations flax/nnx/module.py
GraphDefState (adapter) — Splits NNX modules into pure structure (GraphDef) and mutable state (State) so they can pass through JAX transformations, then merges them back flax/nnx/graph.py
Optimizer (optimizer) — Wraps optax optimizers to work with NNX modules, applies gradient updates to module parameters while preserving graph structure flax/nnx/optim.py
transforms.jit (transformer) — JIT compilation wrapper that automatically handles NNX module graph splitting before compilation and merging after execution flax/nnx/transforms/compilation.py
transforms.grad (transformer) — Automatic differentiation that works with NNX modules by splitting them before taking gradients and reconstructing gradients in module form flax/nnx/transforms/autodiff.py
Linen.Module (factory) — Base class for Linen API modules using functional programming style where parameters are managed externally in separate variables dict flax/linen/module.py
train_state.TrainState (store) — Standard container for training loop state including parameters, optimizer state, step counter, and learning rate schedule flax/training/train_state.py
checkpoints.save_checkpoint (serializer) — Saves training state to disk using orbax format, handles large model sharding across devices and async writing for performance flax/training/checkpoints.py
common_utils.shard (dispatcher) — Distributes data and model parameters across multiple devices for data-parallel training, handles device placement and synchronization flax/training/common_utils.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is flax used for?

Neural network library for JAX that simplifies model creation and training google/flax is a 9-component library written in Jupyter Notebook. Data flows through 6 distinct pipeline stages. The codebase contains 351 files.

How is flax architected?

flax is organized into 4 architecture layers: Core Library, Training Utilities, Example Applications, Testing & Benchmarks. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through flax?

Data moves through 6 stages: Load and preprocess data → Initialize model and optimizer → Forward pass through model → Compute loss and metrics → Backward pass and parameter updates → .... Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing. This pipeline design reflects a complex multi-stage processing system.

What technologies does flax use?

The core stack includes JAX (provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon), optax (supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap), orbax (handles efficient checkpointing and model serialization, especially for large models across multiple devices), ml_collections (provides ConfigDict for managing hyperparameters and model configurations in examples), tensorflow_datasets (loads common ML datasets like ImageNet, MNIST, LM1B for training examples), sentencepiece (tokenizes text data for NLP examples like Gemma and LM1B language models), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does flax have?

flax exhibits 3 data pools (Module parameters, Training checkpoints), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does flax use?

4 design patterns detected: Split-Apply-Merge, Variable Typing, Config-Driven Examples, Dual API Support.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.