determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Open-source distributed ML platform for training, hyperparameter tuning, and resource management
Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master
Under the hood, the system uses 3 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.
Structural Verdict
A 8-component ml training with 4 connections. 2136 files analyzed. Loosely coupled — components are relatively independent.
How Data Flows Through the System
Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master
- Experiment Submission — User submits experiment config via CLI or API to master
- Trial Scheduling — Master creates trials and schedules them on available agents based on resource requirements
- Trial Execution — Agent receives trial, starts container, and executes user training code with harness integration
- Metrics Collection — Training metrics and checkpoints stream back to master during execution
- Resource Management — Master reallocates resources based on scheduler policy and trial progress
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
PostgreSQL storing experiment configs, trial states, metrics, and checkpoints
Real-time metrics flowing from trials to master for progress tracking
Current allocation of compute resources across agents and trials
Feedback Loops
- Trial Progress Monitoring (polling, balancing) — Trigger: Active trials. Action: Collect metrics and update progress. Exit: Trial completion or failure.
- Resource Reallocation (auto-scale, balancing) — Trigger: Resource demand changes. Action: Reschedule trials based on availability. Exit: Stable allocation.
- Agent Heartbeat (polling, balancing) — Trigger: Agent startup. Action: Send status updates to master. Exit: Agent shutdown.
Delays & Async Processing
- Trial Queue Wait (queue-drain, ~Variable based on resources) — Experiments wait for available compute slots
- Container Startup (async-processing, ~Image pull + initialization time) — Delay before trial execution begins
- Checkpoint Persistence (async-processing) — Background checkpoint saving doesn't block training
Control Points
- Scheduler Type (env-var) — Controls: Fair share vs priority vs round-robin scheduling algorithm
- Resource Pool Limits (runtime-toggle) — Controls: Maximum containers per agent and resource allocation
- Debug Mode (feature-flag) — Controls: Enhanced logging and debugging features
Package Structure
This monorepo contains 5 packages:
Resource management agent that runs on compute nodes to execute ML workloads and communicate with the master.
Central coordination service that manages experiments, schedules work, handles authentication, and provides the API/web interface.
Python library that provides APIs for users to integrate their ML code with Determined's training and experiment management.
End-to-end integration tests that validate the entire platform across different deployment scenarios.
Web dashboard for visualizing experiments, managing models, viewing metrics, and cluster administration.
Technology Stack
Backend services (master/agent)
ML integration library and user APIs
Web dashboard
Experiment metadata and state
Container runtime for trials
Service communication
Container orchestration
Testing framework
Key Components
- determined-agent main (cli-command) — Entry point for the agent service that manages compute resources and executes trials
agent/cmd/determined-agent/main.go - determined-master main (cli-command) — Entry point for the master service that coordinates experiments and manages the cluster
master/cmd/determined-master/main.go - ExperimentConfig (class) — Configuration management for ML experiments including hyperparameters and resource settings
harness/determined/_experiment_config.py - TrialContext (class) — Runtime context for individual training trials providing metrics reporting and checkpointing
harness/determined/_trial_context.py - EnvContext (class) — Environment context containing cluster information, GPUs, and trial metadata
harness/determined/_env_context.py - ManagedCluster (class) — Test harness for managing local development clusters during integration testing
e2e_tests/tests/cluster/managed_cluster.py - api_utils (module) — Utility functions for API authentication and session management in tests
e2e_tests/tests/api_utils.py - stream-gen (cli-command) — Code generation tool for streaming API bindings between Go backend and TypeScript frontend
master/cmd/stream-gen/main.go
Configuration
codecov.yml (yaml)
codecov.require_ci_to_pass(boolean, unknown) — default: truecodecov.notify.wait_for_ci(boolean, unknown) — default: falsecoverage.status.project.default.informational(boolean, unknown) — default: truecoverage.status.project.backend.target(string, unknown) — default: 45%coverage.status.project.backend.threshold(string, unknown) — default: 3%coverage.status.project.backend.flags(array, unknown) — default: backendcoverage.status.project.backend.informational(boolean, unknown) — default: falsecoverage.status.patch.default.informational(boolean, unknown) — default: true- +11 more parameters
bindings/generate_bindings_ts.py (python-dataclass)
commenting(bool, unknown) — default: field(default=False, init=False)indent_level(int, unknown) — default: field(default=0)tab_char(str, unknown) — default: field(default=" ")
bindings/swagger_parser.py (python-dataclass)
name(str, unknown)type(TypeAnno, unknown)required(bool, unknown)
bindings/swagger_parser.py (python-dataclass)
name(str, unknown)
Science Pipeline
- Load Experiment Config — Parse YAML config into ExperimentConfig dict
harness/determined/_experiment_config.py - Initialize Trial Context — Create runtime context with cluster info and hyperparameters [ExperimentConfig + cluster metadata → TrialContext object]
harness/determined/_trial_context.py - Execute Training Loop — User-defined training with metrics reporting hooks [Model + training data → Metrics + checkpoints]
harness/determined/_trial.py - Collect Metrics — Stream training/validation metrics to master [Dict[str, float] → Stored metrics in database]
harness/determined/_trial_context.py
Assumptions & Constraints
- [info] Assumes NVIDIA GPUs are available and accessible via UUID, but falls back gracefully to CPU-only (device)
- [warning] Assumes slots_per_trial is positive integer but no validation enforces this constraint (value-range)
- [critical] Assumes container_gpus list matches slot_ids list length for GPU assignment (format)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is determined used for?
Open-source distributed ML platform for training, hyperparameter tuning, and resource management determined-ai/determined is a 8-component ml training written in Go. Loosely coupled — components are relatively independent. The codebase contains 2136 files.
How is determined architected?
determined is organized into 4 architecture layers: User Interface, Coordination Layer, Execution Layer, Infrastructure. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.
How does data flow through determined?
Data moves through 5 stages: Experiment Submission → Trial Scheduling → Trial Execution → Metrics Collection → Resource Management. Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master This pipeline design reflects a complex multi-stage processing system.
What technologies does determined use?
The core stack includes Go (Backend services (master/agent)), Python (ML integration library and user APIs), TypeScript/React (Web dashboard), PostgreSQL (Experiment metadata and state), Docker (Container runtime for trials), gRPC/Protobuf (Service communication), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does determined have?
determined exhibits 3 data pools (Experiment Database, Trial Metrics Stream), 3 feedback loops, 3 control points, 3 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does determined use?
4 design patterns detected: Master-Agent Architecture, Trial-Based Execution, Context Injection, Plugin Architecture.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.