google/flax
Flax is a neural network library for JAX that is designed for flexibility.
Neural network library for JAX with two APIs: NNX and Linen
Training data flows through model forward pass, loss computation, gradient calculation, and parameter updates using JAX transformations
Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.
Structural Verdict
A 13-component ml training with 15 connections. 343 files analyzed. Well-connected — clear data flow between components.
How Data Flows Through the System
Training data flows through model forward pass, loss computation, gradient calculation, and parameter updates using JAX transformations
- Data Loading — Load and preprocess training batches from datasets (config: per_device_batch_size, dataset_name)
- Forward Pass — Pass input through neural network layers to compute predictions
- Loss Calculation — Compute loss between predictions and ground truth labels
- Gradient Computation — Use JAX grad to compute gradients of loss with respect to parameters
- Parameter Update — Apply optimizer (AdamW, SGD) to update model parameters using gradients
- Checkpointing — Save model state and training progress to disk periodically
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Neural network weights and biases stored in Variable objects
Momentum, learning rate schedules, and other optimizer internal state
Serialized model and training state saved to disk
Running means and variances for batch normalization layers
Feedback Loops
- Training Loop (convergence, balancing) — Trigger: Training step. Action: Compute gradients and update parameters. Exit: Maximum steps reached or convergence.
- Learning Rate Schedule (training-loop, balancing) — Trigger: Optimizer step. Action: Decay learning rate based on schedule. Exit: Training completion.
- Batch Norm Updates (training-loop, balancing) — Trigger: Forward pass. Action: Update running statistics. Exit: Evaluation mode.
Delays & Async Processing
- JIT Compilation (async-processing, ~First call only) — Initial training step slower due to XLA compilation
- Checkpoint Saving (async-processing, ~Variable based on model size) — Training pauses periodically to save model state
- Data Loading (async-processing, ~Per batch) — GPU may wait for next batch if data loading is slow
Control Points
- Learning Rate (runtime-toggle) — Controls: Parameter update step size. Default: varies by model
- Batch Size (runtime-toggle) — Controls: Number of samples per training step. Default: per_device_batch_size
- Model Architecture (feature-flag) — Controls: Layer sizes, attention heads, transformer blocks. Default: config-dependent
- Dataset Selection (env-var) — Controls: Which dataset to train on. Default: dataset_name
Technology Stack
Core computation framework for automatic differentiation and JIT compilation
Numerical computing foundation and array operations
Gradient-based optimization algorithms like Adam and SGD
Checkpointing and model serialization
Efficient storage and loading of large arrays
Configuration management for hyperparameters
Unit testing framework
Package building and distribution
Key Components
- Module (class) — Base class for NNX neural network modules with mutable state and reference semantics
flax/nnx/module.py - Module (class) — Base class for Linen modules using functional programming and immutable state
flax/linen/module.py - Variable (class) — Container for mutable state in NNX modules like parameters and batch norm statistics
flax/nnx/variablelib.py - GraphDef (class) — Graph definition for splitting NNX modules into static structure and dynamic state
flax/nnx/graphlib.py - Scope (class) — Manages variable collections and transformations in Linen's functional API
flax/core/scope.py - FrozenDict (class) — Immutable dictionary implementation for parameter storage in Linen
flax/core/frozen_dict.py - Linear (class) — NNX linear/dense layer implementation with weight and bias parameters
flax/nnx/nn/linear.py - Dense (class) — Linen dense layer implementation with functional parameter management
flax/linen/linear.py - train_and_evaluate (function) — Main training loop for Gemma language model with checkpointing and evaluation
examples/gemma/train.py - Transformer (class) — Gemma transformer architecture implementation with attention and feed-forward layers
examples/gemma/transformer.py - jit (function) — JIT compilation wrapper for NNX functions with graph splitting support
flax/nnx/transforms/compilation.py - grad (function) — Automatic differentiation for NNX modules with gradient computation
flax/nnx/transforms/autodiff.py - +1 more components
Sub-Modules
Object-oriented neural network library with mutable state and Python reference semantics
Functional neural network library with immutable state and explicit parameter threading
Complete training scripts demonstrating Flax usage across different ML domains
Performance measurement and comparison tools for different training scenarios
Configuration
examples/gemma/configs/default.py (python-dataclass)
vocab_size(int, unknown) — default: 35_000 # lm1b dataset vocab size: 35913 (Gemma expected vocab size: 262_144)max_corpus_chars(int, unknown) — default: 10**7dataset_name(str, unknown) — default: 'lm1b'eval_dataset_name(str, unknown) — default: 'lm1b'eval_split(str, unknown) — default: 'test'per_device_batch_size(int, unknown) — default: 32eval_per_device_batch_size(int, unknown) — default: 32
examples/gemma/configs/gemma3_4b.py (python-dataclass)
vocab_size(int, unknown) — default: 35_000 # lm1b dataset vocab size: 35913 (Gemma expected vocab size: 262_144)max_corpus_chars(int, unknown) — default: 10**7dataset_name(str, unknown) — default: 'lm1b'eval_dataset_name(str, unknown) — default: 'lm1b'eval_split(str, unknown) — default: 'test'per_device_batch_size(int, unknown) — default: 32eval_per_device_batch_size(int, unknown) — default: 32
examples/gemma/configs/small.py (python-dataclass)
vocab_size(int, unknown) — default: 35_000 # lm1b dataset vocab size: 35913 (Gemma expected vocab size: 262_144)max_corpus_chars(int, unknown) — default: 10**7dataset_name(str, unknown) — default: 'lm1b'eval_dataset_name(str, unknown) — default: 'lm1b'eval_split(str, unknown) — default: 'test'per_device_batch_size(int, unknown) — default: 32eval_per_device_batch_size(int, unknown) — default: 32
examples/gemma/configs/tiny.py (python-dataclass)
vocab_size(int, unknown) — default: 35_000 # lm1b dataset vocab size: 35913 (Gemma expected vocab size: 262_144)max_corpus_chars(int, unknown) — default: 10**7dataset_name(str, unknown) — default: 'lm1b'eval_dataset_name(str, unknown) — default: 'lm1b'eval_split(str, unknown) — default: 'test'per_device_batch_size(int, unknown) — default: 32eval_per_device_batch_size(int, unknown) — default: 32
Science Pipeline
- Input Preprocessing — Tokenization and sequence padding for text, normalization for images [Variable batch sizes → (batch_size, sequence_length) or (batch_size, height, width, channels)]
examples/*/input_pipeline.py - Embedding Lookup — Convert token IDs to dense embeddings [(batch_size, sequence_length) → (batch_size, sequence_length, embed_dim)]
examples/gemma/transformer.py - Transformer Forward — Apply attention layers and feed-forward networks [(batch_size, sequence_length, embed_dim) → (batch_size, sequence_length, embed_dim)]
examples/gemma/transformer.py - Output Projection — Project hidden states to vocabulary logits [(batch_size, sequence_length, embed_dim) → (batch_size, sequence_length, vocab_size)]
examples/gemma/transformer.py - Loss Computation — Cross-entropy loss between logits and target tokens [(batch_size, sequence_length, vocab_size) → scalar loss]
examples/*/train.py
Assumptions & Constraints
- [warning] Assumes input tensor last dimension matches layer input features but no runtime shape check enforced (shape)
- [info] Assumes all tensors use same dtype (float32/bfloat16) but mixing dtypes could cause silent precision loss (dtype)
- [critical] Assumes batch dimension is first axis for computing statistics but no assertion validates this (shape)
- [warning] Assumes input images are in [0,255] range but normalization may fail if inputs are already normalized (value-range)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is flax used for?
Neural network library for JAX with two APIs: NNX and Linen google/flax is a 13-component ml training written in Jupyter Notebook. Well-connected — clear data flow between components. The codebase contains 343 files.
How is flax architected?
flax is organized into 5 architecture layers: Core, NNX API, Linen API, Examples, and 1 more. Well-connected — clear data flow between components. This layered structure enables tight integration between components.
How does data flow through flax?
Data moves through 6 stages: Data Loading → Forward Pass → Loss Calculation → Gradient Computation → Parameter Update → .... Training data flows through model forward pass, loss computation, gradient calculation, and parameter updates using JAX transformations This pipeline design reflects a complex multi-stage processing system.
What technologies does flax use?
The core stack includes JAX (Core computation framework for automatic differentiation and JIT compilation), NumPy (Numerical computing foundation and array operations), Optax (Gradient-based optimization algorithms like Adam and SGD), Orbax (Checkpointing and model serialization), TensorStore (Efficient storage and loading of large arrays), ml_collections (Configuration management for hyperparameters), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does flax have?
flax exhibits 4 data pools (Model Parameters, Optimizer State), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle convergence and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does flax use?
5 design patterns detected: Module System Duality, Graph Splitting, Transform Wrappers, Variable Collections, Config-Driven Examples.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.