determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

3,220 stars Go 7 components

Schedules and tracks machine learning training jobs across distributed compute clusters

Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning.

Under the hood, the system uses 4 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 7-component ml training. 2136 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning.

  1. Submit experiment configuration — User runs 'det experiment create config.yaml .' which submits experiment config and training code to master via REST API
  2. Schedule trials on agents — Master validates experiment config, creates trial instances based on hyperparameter search settings, and assigns trials to available agents based on resource requirements [ExperimentConfig → TrialInfo]
  3. Launch training containers — Agents receive trial assignments and launch Docker containers with user code, harness library, and environment info serialized to filesystem paths like /run/determined/info/ [TrialInfo → RendezvousInfo]
  4. Initialize trial context — Harness loads TrialInfo, RendezvousInfo, and ClusterInfo from filesystem, creates EnvContext and TrialContext objects, and initializes user's Trial class or Core API hooks [RendezvousInfo → EnvContext]
  5. Execute training loop — TrialController manages training loop execution, calling user's train_batch() and evaluate() methods, handling distributed synchronization via RendezvousInfo, and periodically saving model state [EnvContext → Checkpoint]
  6. Report metrics and checkpoints — Harness sends training metrics, validation metrics, and checkpoint metadata to master via gRPC API. Master stores metrics in database and checkpoint paths in configured storage [Checkpoint]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ExperimentConfig harness/determined/_experiment_config.py
Dict[str, Any] with nested config sections like resources: {slots_per_trial: int}, hyperparameters: Dict, searcher: {name: str, metric: str}, optimizations: Dict
Created from YAML config files, validated by master, passed to harness for trial execution configuration
TrialInfo harness/determined/_info.py
Object with trial_id: str, experiment_id: str, trial_seed: int, hparams: Dict[str, Any], latest_checkpoint: Optional[str], steps_completed: int
Created by master when scheduling a trial, serialized to container filesystem, loaded by harness during trial initialization
RendezvousInfo harness/determined/_info.py
Object with container_addrs: List[str], container_rank: int, container_slot_counts: List[int] for distributed training coordination
Generated by master for multi-slot trials, written to container filesystem, consumed by harness to establish distributed training communication
Checkpoint harness/determined/common/experimental/checkpoint/_checkpoint.py
Dataclass with experiment_config: Dict[str, Any], experiment_id: int, trial_id: int, hparams: Dict[str, Any], validation_metrics: Dict[str, Any]
Created during trial execution, stored in configured checkpoint storage, tracked in master database, restored for trial resumption or inference
ClusterInfo harness/determined/_info.py
Object with cluster_id: str, cluster_name: Optional[str], master_url: str, master_cert_name: Optional[str] identifying the cluster
Set by master during container launch, used by harness to establish authenticated connections back to master

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

User training code is importable from filesystem paths without validating module structure or Python path setup

If this fails: If user code has import dependencies not available in container PYTHONPATH or circular imports, the trial fails with confusing ImportError during initialization

harness/determined/__init__.py:import_from_path
critical Contract weakly guarded

The experiment_config Dict[str, Any] passed to EnvContext contains all required config sections like 'resources', 'hyperparameters', and 'searcher' with correct nested structure

If this fails: If master sends malformed config (missing required keys or wrong types), det.ExperimentConfig() construction silently succeeds but trials crash when accessing expected config values like experiment_config['resources']['slots_per_trial']

harness/determined/_env_context.py:EnvContext.__init__
warning Temporal unguarded

Authentication credentials remain valid for the entire duration of a test session and master certificate hasn't changed

If this fails: Long-running e2e tests fail mid-execution with authentication errors if master rotates certificates or credentials expire, causing flaky test results

e2e_tests/tests/api_utils.py:make_session
warning Resource unguarded

The 'det agent list --json' command completes within default subprocess timeout and produces valid JSON output

If this fails: If cluster has hundreds of agents or network latency is high, the command times out and test setup fails without indicating whether it's a performance or connectivity issue

e2e_tests/tests/cluster/managed_cluster.py:get_agent_data
warning Environment weakly guarded

When stdin is not a character device, it contains valid YAML data that can be unmarshaled into map[string]interface{}

If this fails: If piped input contains malformed YAML or binary data, the template processing fails with log.Fatal() terminating the entire process instead of graceful error handling

master/cmd/determined-gotmpl/main.go:stdinData
warning Ordering unguarded

The steps_completed value from TrialInfo represents the exact number of training steps completed and checkpoints are saved synchronously

If this fails: If trial crashes between completing a training step and saving checkpoint, resume from latest_checkpoint will repeat the last step, potentially causing incorrect learning rate scheduling or data shuffling

harness/determined/_env_context.py:steps_completed initialization
critical Domain unguarded

slot_ids array indices correspond to container_gpus array indices, establishing GPU device mapping for distributed training

If this fails: If master provides mismatched array lengths or out-of-order mappings, workers in distributed training bind to wrong GPU devices, causing CUDA errors or suboptimal performance

harness/determined/_env_context.py:slot_ids and container_gpus
info Scale unguarded

Go source files being parsed fit entirely in memory and don't exceed parser's internal limits

If this fails: For very large generated code files (hundreds of thousands of lines), the AST parser runs out of memory or fails, breaking the build process

master/cmd/stream-gen/main.go:parser.ParseFiles
warning Contract weakly guarded

Command line arguments os.Args always contains at least one element (the program name) before manipulation

If this fails: If agent is invoked through unusual process spawning that provides empty os.Args, the index access os.Args[0] panics with index out of range

agent/cmd/determined-agent/main.go:maybeInjectRootAlias
warning Environment unguarded

The session singleton can be started successfully and external dependencies for performance testing are available

If this fails: If performance testing infrastructure (databases, monitoring tools) is unavailable, TestProgramWithConfig.pre() fails but error handling uses sys.exit() without cleanup, leaving partial test state

performance/daist/daist/framework/main.py:session.start()

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Master Database (database)
PostgreSQL database storing experiment metadata, trial results, user accounts, and checkpoint references
Checkpoint Storage (file-store)
Configured storage backend (S3, GCS, shared filesystem) where model checkpoints and training artifacts are persistently stored
Container Filesystem Info (file-store)
JSON files written by master containing trial configuration, rendezvous info, and cluster details for harness consumption

Feedback Loops

Delays

Control Points

Technology Stack

Go (runtime)
Implementation language for master service and agents, providing concurrent request handling and efficient resource management
Python (runtime)
Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows
PostgreSQL (database)
Primary database for storing experiment metadata, trial results, user accounts, and cluster state
Docker (runtime)
Container runtime for isolating and executing user training workloads on agent nodes
gRPC/Protocol Buffers (framework)
API protocol for communication between harness and master, and between master and CLI/web UI
Echo (framework)
HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI
Kubernetes (infra)
Optional container orchestration platform that can serve as the resource manager backend instead of native agents

Key Components

Package Structure

master (app)
The central orchestrator service that manages experiments, resource allocation, and cluster state.
harness (library)
Python library that integrates with user training code to enable distributed training, checkpointing, and experiment tracking.
e2e_tests (tooling)
End-to-end tests that validate the platform by running real experiments through the complete system.

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is determined used for?

Schedules and tracks machine learning training jobs across distributed compute clusters determined-ai/determined is a 7-component ml training written in Go. Data flows through 6 distinct pipeline stages. The codebase contains 2136 files.

How is determined architected?

determined is organized into 4 architecture layers: Cluster Management, Experiment Runtime, User Integration, Validation & Testing. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through determined?

Data moves through 6 stages: Submit experiment configuration → Schedule trials on agents → Launch training containers → Initialize trial context → Execute training loop → .... Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning. This pipeline design reflects a complex multi-stage processing system.

What technologies does determined use?

The core stack includes Go (Implementation language for master service and agents, providing concurrent request handling and efficient resource management), Python (Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows), PostgreSQL (Primary database for storing experiment metadata, trial results, user accounts, and cluster state), Docker (Container runtime for isolating and executing user training workloads on agent nodes), gRPC/Protocol Buffers (API protocol for communication between harness and master, and between master and CLI/web UI), Echo (HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does determined have?

determined exhibits 3 data pools (Master Database, Checkpoint Storage), 4 feedback loops, 4 control points, 3 delays. The feedback loops handle retry and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does determined use?

4 design patterns detected: Master-Agent Coordination, Plugin Architecture, Configuration-Driven Behavior, Filesystem-Based IPC.

Analyzed on April 20, 2026 by CodeSea. Written by .