determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

3,220 stars Go 7 components

Schedules and tracks machine learning training jobs across distributed compute clusters

Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning.

Under the hood, the system uses 4 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 7-component ml training. 2136 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Submit experiment configuration — User runs 'det experiment create config.yaml .' which submits experiment config and training code to master via REST API
Schedule trials on agents — Master validates experiment config, creates trial instances based on hyperparameter search settings, and assigns trials to available agents based on resource requirements [ExperimentConfig → TrialInfo]
Launch training containers — Agents receive trial assignments and launch Docker containers with user code, harness library, and environment info serialized to filesystem paths like /run/determined/info/ [TrialInfo → RendezvousInfo]
Initialize trial context — Harness loads TrialInfo, RendezvousInfo, and ClusterInfo from filesystem, creates EnvContext and TrialContext objects, and initializes user's Trial class or Core API hooks [RendezvousInfo → EnvContext]
Execute training loop — TrialController manages training loop execution, calling user's train_batch() and evaluate() methods, handling distributed synchronization via RendezvousInfo, and periodically saving model state [EnvContext → Checkpoint]
Report metrics and checkpoints — Harness sends training metrics, validation metrics, and checkpoint metadata to master via gRPC API. Master stores metrics in database and checkpoint paths in configured storage [Checkpoint]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ExperimentConfig harness/determined/_experiment_config.py
Dict[str, Any] with nested config sections like resources: {slots_per_trial: int}, hyperparameters: Dict, searcher: {name: str, metric: str}, optimizations: Dict
Created from YAML config files, validated by master, passed to harness for trial execution configuration

TrialInfo harness/determined/_info.py
Object with trial_id: str, experiment_id: str, trial_seed: int, hparams: Dict[str, Any], latest_checkpoint: Optional[str], steps_completed: int
Created by master when scheduling a trial, serialized to container filesystem, loaded by harness during trial initialization

RendezvousInfo harness/determined/_info.py
Object with container_addrs: List[str], container_rank: int, container_slot_counts: List[int] for distributed training coordination
Generated by master for multi-slot trials, written to container filesystem, consumed by harness to establish distributed training communication

Checkpoint harness/determined/common/experimental/checkpoint/_checkpoint.py
Dataclass with experiment_config: Dict[str, Any], experiment_id: int, trial_id: int, hparams: Dict[str, Any], validation_metrics: Dict[str, Any]
Created during trial execution, stored in configured checkpoint storage, tracked in master database, restored for trial resumption or inference

ClusterInfo harness/determined/_info.py
Object with cluster_id: str, cluster_name: Optional[str], master_url: str, master_cert_name: Optional[str] identifying the cluster
Set by master during container launch, used by harness to establish authenticated connections back to master

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

User training code is importable from filesystem paths without validating module structure or Python path setup

If this fails: If user code has import dependencies not available in container PYTHONPATH or circular imports, the trial fails with confusing ImportError during initialization

harness/determined/__init__.py:import_from_path

critical Contract weakly guarded

The experiment_config Dict[str, Any] passed to EnvContext contains all required config sections like 'resources', 'hyperparameters', and 'searcher' with correct nested structure

If this fails: If master sends malformed config (missing required keys or wrong types), det.ExperimentConfig() construction silently succeeds but trials crash when accessing expected config values like experiment_config['resources']['slots_per_trial']

harness/determined/_env_context.py:EnvContext.__init__

warning Temporal unguarded

Authentication credentials remain valid for the entire duration of a test session and master certificate hasn't changed

If this fails: Long-running e2e tests fail mid-execution with authentication errors if master rotates certificates or credentials expire, causing flaky test results

e2e_tests/tests/api_utils.py:make_session

warning Resource unguarded

The 'det agent list --json' command completes within default subprocess timeout and produces valid JSON output

If this fails: If cluster has hundreds of agents or network latency is high, the command times out and test setup fails without indicating whether it's a performance or connectivity issue

e2e_tests/tests/cluster/managed_cluster.py:get_agent_data

warning Environment weakly guarded

When stdin is not a character device, it contains valid YAML data that can be unmarshaled into map[string]interface{}

If this fails: If piped input contains malformed YAML or binary data, the template processing fails with log.Fatal() terminating the entire process instead of graceful error handling

master/cmd/determined-gotmpl/main.go:stdinData

warning Ordering unguarded

The steps_completed value from TrialInfo represents the exact number of training steps completed and checkpoints are saved synchronously

If this fails: If trial crashes between completing a training step and saving checkpoint, resume from latest_checkpoint will repeat the last step, potentially causing incorrect learning rate scheduling or data shuffling

harness/determined/_env_context.py:steps_completed initialization

critical Domain unguarded

slot_ids array indices correspond to container_gpus array indices, establishing GPU device mapping for distributed training

If this fails: If master provides mismatched array lengths or out-of-order mappings, workers in distributed training bind to wrong GPU devices, causing CUDA errors or suboptimal performance

harness/determined/_env_context.py:slot_ids and container_gpus

info Scale unguarded

Go source files being parsed fit entirely in memory and don't exceed parser's internal limits

If this fails: For very large generated code files (hundreds of thousands of lines), the AST parser runs out of memory or fails, breaking the build process

master/cmd/stream-gen/main.go:parser.ParseFiles

warning Contract weakly guarded

Command line arguments os.Args always contains at least one element (the program name) before manipulation

If this fails: If agent is invoked through unusual process spawning that provides empty os.Args, the index access os.Args[0] panics with index out of range

agent/cmd/determined-agent/main.go:maybeInjectRootAlias

warning Environment unguarded

The session singleton can be started successfully and external dependencies for performance testing are available

If this fails: If performance testing infrastructure (databases, monitoring tools) is unavailable, TestProgramWithConfig.pre() fails but error handling uses sys.exit() without cleanup, leaving partial test state

performance/daist/daist/framework/main.py:session.start()

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Master Database (database)
PostgreSQL database storing experiment metadata, trial results, user accounts, and checkpoint references

Checkpoint Storage (file-store)
Configured storage backend (S3, GCS, shared filesystem) where model checkpoints and training artifacts are persistently stored

Container Filesystem Info (file-store)
JSON files written by master containing trial configuration, rendezvous info, and cluster details for harness consumption

Feedback Loops

Trial Retry Loop (retry, balancing) — Trigger: Container failure or trial error reported to master. Action: Master reschedules failed trial on different agent with exponential backoff. Exit: Trial succeeds or maximum retry count exceeded.
Hyperparameter Search Loop (training-loop, reinforcing) — Trigger: Trial completion with validation metrics. Action: Master's searcher algorithm generates new hyperparameter combinations and spawns additional trials. Exit: Search algorithm converges or maximum trials reached.
Agent Health Check Loop (polling, balancing) — Trigger: Master periodically pings agents. Action: Agents send heartbeat responses with resource availability and container status. Exit: Never - runs continuously while cluster active.
Metrics Reporting Loop (training-loop, reinforcing) — Trigger: Training batch completion in harness. Action: Harness sends training metrics to master, master updates database and notifies web UI via websockets. Exit: Trial completion or termination.

Delays

Container Startup Delay (async-processing, ~10-60 seconds) — Time between trial scheduling and actual training start due to image pulling and container initialization
Checkpoint Save Delay (async-processing, ~varies by model size) — Training pauses while model state is serialized and uploaded to checkpoint storage
Distributed Training Synchronization (async-processing, ~milliseconds to seconds) — Workers wait for gradient synchronization before proceeding to next training step

Control Points

Resource Pool Configuration (architecture-switch) — Controls: Which agents and GPU types are available for different experiment types. Default: defined in master config YAML
Checkpoint Frequency (hyperparameter) — Controls: How often trials save model checkpoints during training. Default: min_validation_period config setting
Distributed Training Backend (architecture-switch) — Controls: Whether to use Horovod, NCCL, or other communication backend for multi-GPU training. Default: detected based on framework and cluster setup
Profiling Mode (feature-flag) — Controls: Whether to collect detailed performance profiling data during training. Default: profiling.enabled config flag

Technology Stack

Go (runtime)
Implementation language for master service and agents, providing concurrent request handling and efficient resource management

Python (runtime)
Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows

PostgreSQL (database)
Primary database for storing experiment metadata, trial results, user accounts, and cluster state

Docker (runtime)
Container runtime for isolating and executing user training workloads on agent nodes

gRPC/Protocol Buffers (framework)
API protocol for communication between harness and master, and between master and CLI/web UI

Echo (framework)
HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI

Kubernetes (infra)
Optional container orchestration platform that can serve as the resource manager backend instead of native agents

Key Components

determined-master (orchestrator) — Central coordinator that schedules experiments, manages resource pools, tracks agent health, and provides APIs for web UI and CLI interactions master/cmd/determined-master/main.go
determined-agent (executor) — Worker daemon that runs on compute nodes, manages containers, reports resource availability, and executes training workloads assigned by master agent/cmd/determined-agent/main.go
TrialController (orchestrator) — Coordinates the execution lifecycle of a single trial within a container, managing training loops, checkpointing, metrics reporting, and distributed training synchronization harness/determined/_trial_controller.py
EnvContext (adapter) — Encapsulates all environment information needed by a trial including cluster connection details, resource allocation, and experiment configuration harness/determined/_env_context.py
Core API (adapter) — Minimal integration layer that allows users to add Determined functionality to existing training code without full framework adoption harness/determined/core/__init__.py
ManagedCluster (orchestrator) — Test harness that programmatically starts, stops, and manages Determined clusters for end-to-end testing scenarios e2e_tests/tests/cluster/managed_cluster.py
Session (adapter) — Manages authenticated API connections to Determined master for test scenarios, handling login and certificate validation e2e_tests/tests/api_utils.py

Package Structure

master (app)
The central orchestrator service that manages experiments, resource allocation, and cluster state.

harness (library)
Python library that integrates with user training code to enable distributed training, checkpointing, and experiment tracking.

e2e_tests (tooling)
End-to-end tests that validate the platform by running real experiments through the complete system.

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is determined used for?

Schedules and tracks machine learning training jobs across distributed compute clusters determined-ai/determined is a 7-component ml training written in Go. Data flows through 6 distinct pipeline stages. The codebase contains 2136 files.

How is determined architected?

determined is organized into 4 architecture layers: Cluster Management, Experiment Runtime, User Integration, Validation & Testing. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through determined?

Data moves through 6 stages: Submit experiment configuration → Schedule trials on agents → Launch training containers → Initialize trial context → Execute training loop → .... Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning. This pipeline design reflects a complex multi-stage processing system.

What technologies does determined use?

The core stack includes Go (Implementation language for master service and agents, providing concurrent request handling and efficient resource management), Python (Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows), PostgreSQL (Primary database for storing experiment metadata, trial results, user accounts, and cluster state), Docker (Container runtime for isolating and executing user training workloads on agent nodes), gRPC/Protocol Buffers (API protocol for communication between harness and master, and between master and CLI/web UI), Echo (HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does determined have?

determined exhibits 3 data pools (Master Database, Checkpoint Storage), 4 feedback loops, 4 control points, 3 delays. The feedback loops handle retry and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does determined use?

4 design patterns detected: Master-Agent Coordination, Plugin Architecture, Configuration-Driven Behavior, Filesystem-Based IPC.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.