google/flax

Flax is a neural network library for JAX that is designed for flexibility.

7,164 stars Jupyter Notebook 9 components

Neural network library for JAX that simplifies model creation and training

Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing.

Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 9-component library. 351 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing.

  1. Load and preprocess data — Dataset loaders in examples/ read from tensorflow_datasets or custom sources, apply tokenization for text or normalization for images, create batches with proper padding and masking (config: per_device_batch_size, dataset_name, max_corpus_chars)
  2. Initialize model and optimizer — Create Module instance with nnx.Rngs for parameter initialization, wrap with Optimizer class that holds optax gradient transformation and tracks parameter state (config: vocab_size)
  3. Forward pass through model — Module.__call__ processes input batch through neural network layers (conv, attention, MLP), each layer accesses its Variable parameters and produces intermediate activations [Batch → logits tensor]
  4. Compute loss and metrics — Loss functions like cross-entropy compare model logits against target labels, produce scalar loss value plus auxiliary metrics like accuracy for monitoring [logits tensor → loss scalar]
  5. Backward pass and parameter updates — nnx.grad automatically differentiates loss with respect to model parameters, optimizer applies gradient updates using momentum, weight decay, learning rate schedules [loss scalar → Updated Module]
  6. Checkpoint and evaluation — Training state gets saved to disk every N steps using orbax checkpointing, model runs on validation data to compute eval metrics, logs training progress [Updated Module → saved checkpoint] (config: eval_per_device_batch_size, eval_dataset_name, eval_split)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Module flax/nnx/module.py
Python class with nnx.Variable fields containing parameter tensors, inherits from nnx.Module base class with __call__ method for forward pass
Created during model initialization, holds parameters as Variable objects, gets split into graph+state for JAX transformations, then merged back
Variable flax/nnx/variables.py
Container holding jax.Array with metadata like variable type (Param, BatchStat, etc) and unique identity for graph operations
Wraps parameter tensors, gets extracted during graph splits, modified by optimizers, and restored during merges
State flax/nnx/state.py
Dict-like structure mapping paths to Variables, created by splitting Module graphs for JAX function transformations
Extracted from Module graphs before JAX transforms, passed through pure functions, used to update Module state after transforms
GraphDef flax/nnx/graph.py
Immutable structure describing Module topology with node types, connections, and metadata - the 'blueprint' without parameter values
Created during graph splits to preserve Module structure, used by JAX for compilation decisions, combined with State to reconstruct Modules
Batch examples/mnist/train.py
Dict with image: jnp.ndarray[batch_size, 28, 28, 1] and label: jnp.ndarray[batch_size] for MNIST, varies by example
Loaded from datasets, preprocessed with normalization and batching, fed to model forward pass, used for gradient computation
TrainState flax/training/train_state.py
Container with step: int, params: PyTree, opt_state: optax.OptState, tx: optax.GradientTransformation for training loop state
Initialized with model parameters and optimizer, updated each training step with new gradients, saved/restored for checkpointing

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

JAX distributed initialization will either succeed or throw a ValueError with the specific message 'coordinator_address should be defined'

If this fails: Any other exception type (NetworkError, TimeoutError, etc.) will crash the program instead of being handled gracefully for single-GPU setups

examples/gemma/main.py:jax.distributed.initialize
critical Resource unguarded

TensorFlow GPU device hiding always succeeds and never raises an exception

If this fails: If TF device management fails (permissions, driver issues), the program crashes before JAX initialization, providing no fallback path

examples/imagenet/main.py:tf.config.experimental.set_visible_devices
warning Ordering unguarded

jax.config.config_with_absl() must be called after flag definitions but before app.run() to properly integrate JAX flags with absl

If this fails: If called in wrong order or not at all, JAX configuration flags may not be parsed correctly, leading to silent configuration mismatches between intended and actual settings

examples/lm1b/main.py:jax.config.config_with_absl
warning Contract unguarded

The train.train_and_evaluate function expects FLAGS.config to be a valid ml_collections.ConfigDict with all required fields for the specific example

If this fails: Missing config fields cause AttributeError deep in training loop rather than early validation, wasting setup time and making debugging harder

examples/*/main.py:train.train_and_evaluate
critical Environment unguarded

JAX distributed environment is properly initialized before calling process_index() and process_count()

If this fails: In misconfigured multi-host setups, these functions may return incorrect values (all processes think they are index 0) leading to data corruption from multiple processes writing to same checkpoint paths

examples/*/main.py:jax.process_index
critical Resource unguarded

The workdir path is writable and has sufficient disk space for checkpoints, logs, and temporary files

If this fails: Training runs for hours before failing when trying to save first checkpoint, losing all progress and potentially corrupting partial checkpoint files

examples/*/main.py:FLAGS.workdir
info Temporal unguarded

The platform work unit is available and writable at startup time

If this fails: In environments where work unit tracking is disabled or fails, the set_task_status call may silently fail or throw exceptions that aren't handled, potentially crashing the training job

examples/*/main.py:platform.work_unit().set_task_status
warning Contract unguarded

The config object contains all the required attributes (model_dir, experiment, batch_size, etc.) that train.py expects to find in FLAGS

If this fails: Missing config attributes cause AttributeError when accessing FLAGS.missing_field in train.py, but the error location is misleading since the issue is in config file structure

examples/nlp_seq/main.py:FLAGS assignment
warning Ordering weakly guarded

TensorFlow GPU hiding must happen before any JAX device initialization to prevent memory conflicts

If this fails: If JAX initializes GPU memory before TF device hiding, both frameworks may compete for GPU memory leading to OOM errors or degraded performance that's hard to debug

examples/*/main.py:tf.config.experimental.set_visible_devices before JAX calls
info Scale unguarded

The local_devices() list is small enough to fit in a single log line without truncation

If this fails: On large multi-GPU systems (8+ GPUs per host), device logging may be truncated or overflow log buffers, making device debugging harder

examples/*/main.py:jax.local_devices logging

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Module parameters (state-store)
Each Module holds Variable objects containing parameter tensors, accumulated gradients, and optimizer momentum terms
Training checkpoints (checkpoint)
Periodic snapshots of complete training state saved to disk for recovery and inference deployment
Dataset cache (cache)
Preprocessed and tokenized training data cached in memory to avoid repeated computation during training epochs

Feedback Loops

Delays

Control Points

Technology Stack

JAX (compute)
provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon
optax (library)
supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap
orbax (serialization)
handles efficient checkpointing and model serialization, especially for large models across multiple devices
ml_collections (library)
provides ConfigDict for managing hyperparameters and model configurations in examples
tensorflow_datasets (library)
loads common ML datasets like ImageNet, MNIST, LM1B for training examples
sentencepiece (library)
tokenizes text data for NLP examples like Gemma and LM1B language models
treescope (library)
provides rich visualization and debugging tools for PyTree structures and model introspection

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is flax used for?

Neural network library for JAX that simplifies model creation and training google/flax is a 9-component library written in Jupyter Notebook. Data flows through 6 distinct pipeline stages. The codebase contains 351 files.

How is flax architected?

flax is organized into 4 architecture layers: Core Library, Training Utilities, Example Applications, Testing & Benchmarks. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through flax?

Data moves through 6 stages: Load and preprocess data → Initialize model and optimizer → Forward pass through model → Compute loss and metrics → Backward pass and parameter updates → .... Data enters through dataset loaders that create batches of examples (images, text tokens, etc.), flows through Module forward passes to produce logits, gets compared against labels in loss functions, generates gradients via automatic differentiation, and updates model parameters through optimizers. The NNX API handles module state automatically while Linen requires explicit parameter passing. This pipeline design reflects a complex multi-stage processing system.

What technologies does flax use?

The core stack includes JAX (provides the underlying tensor operations, automatic differentiation, and JIT compilation that Flax builds upon), optax (supplies the gradient transformation algorithms (Adam, SGD, etc.) that Flax optimizers wrap), orbax (handles efficient checkpointing and model serialization, especially for large models across multiple devices), ml_collections (provides ConfigDict for managing hyperparameters and model configurations in examples), tensorflow_datasets (loads common ML datasets like ImageNet, MNIST, LM1B for training examples), sentencepiece (tokenizes text data for NLP examples like Gemma and LM1B language models), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does flax have?

flax exhibits 3 data pools (Module parameters, Training checkpoints), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does flax use?

4 design patterns detected: Split-Apply-Merge, Variable Typing, Config-Driven Examples, Dual API Support.

Analyzed on April 20, 2026 by CodeSea. Written by .