google/flax
Flax is a neural network library for JAX that is designed for flexibility.
Neural network library for JAX that simplifies model creation and training
Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing.
Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
A 9-component library. 351 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing.
- Load and preprocess data — Dataset loaders in examples/ read from tensorflow_datasets or custom sources, apply tokenization for text or normalization for images, create batches with proper padding and masking (config: per_device_batch_size, dataset_name, max_corpus_chars)
- Initialize model and optimizer — Create Module instance with nnx.Rngs for parameter initialization, wrap with Optimizer class that holds optax gradient transformation and tracks parameter state (config: vocab_size)
- Forward pass through model — Module.__call__ processes input batch through neural network layers (conv, attention, MLP), each layer accesses its Variable parameters and produces intermediate activations [Batch → logits tensor]
- Compute loss and metrics — Loss functions like cross-entropy compare model logits against target labels, produce scalar loss value plus auxiliary metrics like accuracy for monitoring [logits tensor → loss scalar]
- Backward pass and parameter updates — nnx.grad automatically differentiates loss with respect to model parameters, optimizer applies gradient updates using momentum, weight decay, learning rate schedules [loss scalar → Updated Module]
- Checkpoint and evaluation — Training state gets saved to disk every N steps using orbax checkpointing, model runs on validation data to compute eval metrics, logs training progress [Updated Module → saved checkpoint] (config: eval_per_device_batch_size, eval_dataset_name, eval_split)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
flax/nnx/module.pyPython class with nnx.Variable fields containing parameter tensors, inherits from nnx.Module base class with __call__ method for forward pass
Created during model initialization, holds parameters as Variable objects, gets split into graph+state for JAX transformations, then merged back
flax/nnx/variables.pyContainer holding jax.Array with metadata like variable type (Param, BatchStat, etc) and unique identity for graph operations
Wraps parameter tensors, gets extracted during graph splits, modified by optimizers, and restored during merges
flax/nnx/state.pyDict-like structure mapping paths to Variables, created by splitting Module graphs for JAX function transformations
Extracted from Module graphs before JAX transforms, passed through pure functions, used to update Module state after transforms
flax/nnx/graph.pyImmutable structure describing Module topology with node types, connections, and metadata - the 'blueprint' without parameter values
Created during graph splits to preserve Module structure, used by JAX for compilation decisions, combined with State to reconstruct Modules
examples/mnist/train.pyDict with image: jnp.ndarray[batch_size, 28, 28, 1] and label: jnp.ndarray[batch_size] for MNIST, varies by example
Loaded from datasets, preprocessed with normalization and batching, fed to model forward pass, used for gradient computation
flax/training/train_state.pyContainer with step: int, params: PyTree, opt_state: optax.OptState, tx: optax.GradientTransformation for training loop state
Initialized with model parameters and optimizer, updated each training step with new gradients, saved/restored for checkpointing
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
JAX distributed initialization will either succeed or throw a ValueError with the specific message 'coordinator_address should be defined'
If this fails: Any other exception type (NetworkError, TimeoutError, etc.) will crash the program instead of being handled gracefully for single-GPU setups
examples/gemma/main.py:jax.distributed.initialize
TensorFlow GPU device hiding always succeeds and never raises an exception
If this fails: If TF device management fails (permissions, driver issues), the program crashes before JAX initialization, providing no fallback path
examples/imagenet/main.py:tf.config.experimental.set_visible_devices
jax.config.config_with_absl() must be called after flag definitions but before app.run() to properly integrate JAX flags with absl
If this fails: If called in wrong order or not at all, JAX configuration flags may not be parsed correctly, leading to silent configuration mismatches between intended and actual settings
examples/lm1b/main.py:jax.config.config_with_absl
The train.train_and_evaluate function expects FLAGS.config to be a valid ml_collections.ConfigDict with all required fields for the specific example
If this fails: Missing config fields cause AttributeError deep in training loop rather than early validation, wasting setup time and making debugging harder
examples/*/main.py:train.train_and_evaluate
JAX distributed environment is properly initialized before calling process_index() and process_count()
If this fails: In misconfigured multi-host setups, these functions may return incorrect values (all processes think they are index 0) leading to data corruption from multiple processes writing to same checkpoint paths
examples/*/main.py:jax.process_index
The workdir path is writable and has sufficient disk space for checkpoints, logs, and temporary files
If this fails: Training runs for hours before failing when trying to save first checkpoint, losing all progress and potentially corrupting partial checkpoint files
examples/*/main.py:FLAGS.workdir
The platform work unit is available and writable at startup time
If this fails: In environments where work unit tracking is disabled or fails, the set_task_status call may silently fail or throw exceptions that aren't handled, potentially crashing the training job
examples/*/main.py:platform.work_unit().set_task_status
The config object contains all the required attributes (model_dir, experiment, batch_size, etc.) that train.py expects to find in FLAGS
If this fails: Missing config attributes cause AttributeError when accessing FLAGS.missing_field in train.py, but the error location is misleading since the issue is in config file structure
examples/nlp_seq/main.py:FLAGS assignment
TensorFlow GPU hiding must happen before any JAX device initialization to prevent memory conflicts
If this fails: If JAX initializes GPU memory before TF device hiding, both frameworks may compete for GPU memory leading to OOM errors or degraded performance that's hard to debug
examples/*/main.py:tf.config.experimental.set_visible_devices before JAX calls
The local_devices() list is small enough to fit in a single log line without truncation
If this fails: On large multi-GPU systems (8+ GPUs per host), device logging may be truncated or overflow log buffers, making device debugging harder
examples/*/main.py:jax.local_devices logging
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Each Module holds Variable objects containing parameter tensors, accumulated gradients, and optimizer momentum terms
Periodic snapshots of complete training state saved to disk for recovery and inference deployment
Preprocessed and tokenized training data cached in memory to avoid repeated computation during training epochs
Feedback Loops
- Training loop (training-loop, reinforcing) — Trigger: gradient computation. Action: optimizer updates parameters based on loss gradients. Exit: max steps reached or convergence.
- Learning rate scheduling (convergence, balancing) — Trigger: training step count. Action: adjusts learning rate using warmup and decay schedules. Exit: training completion.
- Gradient accumulation (gradient-accumulation, reinforcing) — Trigger: small batch size. Action: accumulates gradients over multiple mini-batches before parameter update. Exit: target effective batch size reached.
Delays
- JIT compilation (compilation, ~seconds on first call) — first training step takes much longer while JAX compiles the computation graph
- Checkpoint saving (checkpoint-save, ~seconds for large models) — training pauses while orbax writes parameter tensors to disk, can be async
- Data loading (batch-window, ~milliseconds per batch) — GPU may idle while waiting for next batch preprocessing and transfer
Control Points
- per_device_batch_size (hyperparameter) — Controls: memory usage and gradient noise during training. Default: 32
- vocab_size (architecture-switch) — Controls: embedding layer dimensions and tokenizer vocabulary. Default: varies by model
- dataset_name (runtime-toggle) — Controls: which dataset pipeline to use for training and evaluation. Default: lm1b
- JAX_PLATFORMS (env-var) — Controls: whether JAX uses CPU, GPU, or TPU devices for computation
Technology Stack
provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon
supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap
handles efficient checkpointing and model serialization, especially for large models across multiple devices
provides ConfigDict for managing hyperparameters and model configurations in examples
loads common ML datasets like ImageNet, MNIST, LM1B for training examples
tokenizes text data for NLP examples like Gemma and LM1B language models
provides rich visualization and debugging tools for PyTree structures and model introspection
Key Components
- nnx.Module (factory) — Base class for all NNX neural network modules, manages Variable instances and provides graph split/merge functionality for JAX transformations
flax/nnx/module.py - GraphDefState (adapter) — Splits NNX modules into pure structure (GraphDef) and mutable state (State) so they can pass through JAX transformations, then merges them back
flax/nnx/graph.py - Optimizer (optimizer) — Wraps optax optimizers to work with NNX modules, applies gradient updates to module parameters while preserving graph structure
flax/nnx/optim.py - transforms.jit (transformer) — JIT compilation wrapper that automatically handles NNX module graph splitting before compilation and merging after execution
flax/nnx/transforms/compilation.py - transforms.grad (transformer) — Automatic differentiation that works with NNX modules by splitting them before taking gradients and reconstructing gradients in module form
flax/nnx/transforms/autodiff.py - Linen.Module (factory) — Base class for Linen API modules using functional programming style where parameters are managed externally in separate variables dict
flax/linen/module.py - train_state.TrainState (store) — Standard container for training loop state including parameters, optimizer state, step counter, and learning rate schedule
flax/training/train_state.py - checkpoints.save_checkpoint (serializer) — Saves training state to disk using orbax format, handles large model sharding across devices and async writing for performance
flax/training/checkpoints.py - common_utils.shard (dispatcher) — Distributes data and model parameters across multiple devices for data-parallel training, handles device placement and synchronization
flax/training/common_utils.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is flax used for?
Neural network library for JAX that simplifies model creation and training google/flax is a 9-component library written in Jupyter Notebook. Data flows through 6 distinct pipeline stages. The codebase contains 351 files.
How is flax architected?
flax is organized into 4 architecture layers: Core Library, Training Utilities, Example Applications, Testing & Benchmarks. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through flax?
Data moves through 6 stages: Load and preprocess data → Initialize model and optimizer → Forward pass through model → Compute loss and metrics → Backward pass and parameter updates → .... Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing. This pipeline design reflects a complex multi-stage processing system.
What technologies does flax use?
The core stack includes JAX (provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon), optax (supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap), orbax (handles efficient checkpointing and model serialization, especially for large models across multiple devices), ml_collections (provides ConfigDict for managing hyperparameters and model configurations in examples), tensorflow_datasets (loads common ML datasets like ImageNet, MNIST, LM1B for training examples), sentencepiece (tokenizes text data for NLP examples like Gemma and LM1B language models), and 1 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does flax have?
flax exhibits 3 data pools (Module parameters, Training checkpoints), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does flax use?
4 design patterns detected: Split-Apply-Merge, Variable Typing, Config-Driven Examples, Dual API Support.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.