determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Schedules and tracks machine learning training jobs across distributed compute clusters
Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning.
Under the hood, the system uses 4 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
A 7-component ml training. 2136 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning.
- Submit experiment configuration — User runs 'det experiment create config.yaml .' which submits experiment config and training code to master via REST API
- Schedule trials on agents — Master validates experiment config, creates trial instances based on hyperparameter search settings, and assigns trials to available agents based on resource requirements [ExperimentConfig → TrialInfo]
- Launch training containers — Agents receive trial assignments and launch Docker containers with user code, harness library, and environment info serialized to filesystem paths like /run/determined/info/ [TrialInfo → RendezvousInfo]
- Initialize trial context — Harness loads TrialInfo, RendezvousInfo, and ClusterInfo from filesystem, creates EnvContext and TrialContext objects, and initializes user's Trial class or Core API hooks [RendezvousInfo → EnvContext]
- Execute training loop — TrialController manages training loop execution, calling user's train_batch() and evaluate() methods, handling distributed synchronization via RendezvousInfo, and periodically saving model state [EnvContext → Checkpoint]
- Report metrics and checkpoints — Harness sends training metrics, validation metrics, and checkpoint metadata to master via gRPC API. Master stores metrics in database and checkpoint paths in configured storage [Checkpoint]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
harness/determined/_experiment_config.pyDict[str, Any] with nested config sections like resources: {slots_per_trial: int}, hyperparameters: Dict, searcher: {name: str, metric: str}, optimizations: Dict
Created from YAML config files, validated by master, passed to harness for trial execution configuration
harness/determined/_info.pyObject with trial_id: str, experiment_id: str, trial_seed: int, hparams: Dict[str, Any], latest_checkpoint: Optional[str], steps_completed: int
Created by master when scheduling a trial, serialized to container filesystem, loaded by harness during trial initialization
harness/determined/_info.pyObject with container_addrs: List[str], container_rank: int, container_slot_counts: List[int] for distributed training coordination
Generated by master for multi-slot trials, written to container filesystem, consumed by harness to establish distributed training communication
harness/determined/common/experimental/checkpoint/_checkpoint.pyDataclass with experiment_config: Dict[str, Any], experiment_id: int, trial_id: int, hparams: Dict[str, Any], validation_metrics: Dict[str, Any]
Created during trial execution, stored in configured checkpoint storage, tracked in master database, restored for trial resumption or inference
harness/determined/_info.pyObject with cluster_id: str, cluster_name: Optional[str], master_url: str, master_cert_name: Optional[str] identifying the cluster
Set by master during container launch, used by harness to establish authenticated connections back to master
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
User training code is importable from filesystem paths without validating module structure or Python path setup
If this fails: If user code has import dependencies not available in container PYTHONPATH or circular imports, the trial fails with confusing ImportError during initialization
harness/determined/__init__.py:import_from_path
The experiment_config Dict[str, Any] passed to EnvContext contains all required config sections like 'resources', 'hyperparameters', and 'searcher' with correct nested structure
If this fails: If master sends malformed config (missing required keys or wrong types), det.ExperimentConfig() construction silently succeeds but trials crash when accessing expected config values like experiment_config['resources']['slots_per_trial']
harness/determined/_env_context.py:EnvContext.__init__
Authentication credentials remain valid for the entire duration of a test session and master certificate hasn't changed
If this fails: Long-running e2e tests fail mid-execution with authentication errors if master rotates certificates or credentials expire, causing flaky test results
e2e_tests/tests/api_utils.py:make_session
The 'det agent list --json' command completes within default subprocess timeout and produces valid JSON output
If this fails: If cluster has hundreds of agents or network latency is high, the command times out and test setup fails without indicating whether it's a performance or connectivity issue
e2e_tests/tests/cluster/managed_cluster.py:get_agent_data
When stdin is not a character device, it contains valid YAML data that can be unmarshaled into map[string]interface{}
If this fails: If piped input contains malformed YAML or binary data, the template processing fails with log.Fatal() terminating the entire process instead of graceful error handling
master/cmd/determined-gotmpl/main.go:stdinData
The steps_completed value from TrialInfo represents the exact number of training steps completed and checkpoints are saved synchronously
If this fails: If trial crashes between completing a training step and saving checkpoint, resume from latest_checkpoint will repeat the last step, potentially causing incorrect learning rate scheduling or data shuffling
harness/determined/_env_context.py:steps_completed initialization
slot_ids array indices correspond to container_gpus array indices, establishing GPU device mapping for distributed training
If this fails: If master provides mismatched array lengths or out-of-order mappings, workers in distributed training bind to wrong GPU devices, causing CUDA errors or suboptimal performance
harness/determined/_env_context.py:slot_ids and container_gpus
Go source files being parsed fit entirely in memory and don't exceed parser's internal limits
If this fails: For very large generated code files (hundreds of thousands of lines), the AST parser runs out of memory or fails, breaking the build process
master/cmd/stream-gen/main.go:parser.ParseFiles
Command line arguments os.Args always contains at least one element (the program name) before manipulation
If this fails: If agent is invoked through unusual process spawning that provides empty os.Args, the index access os.Args[0] panics with index out of range
agent/cmd/determined-agent/main.go:maybeInjectRootAlias
The session singleton can be started successfully and external dependencies for performance testing are available
If this fails: If performance testing infrastructure (databases, monitoring tools) is unavailable, TestProgramWithConfig.pre() fails but error handling uses sys.exit() without cleanup, leaving partial test state
performance/daist/daist/framework/main.py:session.start()
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
PostgreSQL database storing experiment metadata, trial results, user accounts, and checkpoint references
Configured storage backend (S3, GCS, shared filesystem) where model checkpoints and training artifacts are persistently stored
JSON files written by master containing trial configuration, rendezvous info, and cluster details for harness consumption
Feedback Loops
- Trial Retry Loop (retry, balancing) — Trigger: Container failure or trial error reported to master. Action: Master reschedules failed trial on different agent with exponential backoff. Exit: Trial succeeds or maximum retry count exceeded.
- Hyperparameter Search Loop (training-loop, reinforcing) — Trigger: Trial completion with validation metrics. Action: Master's searcher algorithm generates new hyperparameter combinations and spawns additional trials. Exit: Search algorithm converges or maximum trials reached.
- Agent Health Check Loop (polling, balancing) — Trigger: Master periodically pings agents. Action: Agents send heartbeat responses with resource availability and container status. Exit: Never - runs continuously while cluster active.
- Metrics Reporting Loop (training-loop, reinforcing) — Trigger: Training batch completion in harness. Action: Harness sends training metrics to master, master updates database and notifies web UI via websockets. Exit: Trial completion or termination.
Delays
- Container Startup Delay (async-processing, ~10-60 seconds) — Time between trial scheduling and actual training start due to image pulling and container initialization
- Checkpoint Save Delay (async-processing, ~varies by model size) — Training pauses while model state is serialized and uploaded to checkpoint storage
- Distributed Training Synchronization (async-processing, ~milliseconds to seconds) — Workers wait for gradient synchronization before proceeding to next training step
Control Points
- Resource Pool Configuration (architecture-switch) — Controls: Which agents and GPU types are available for different experiment types. Default: defined in master config YAML
- Checkpoint Frequency (hyperparameter) — Controls: How often trials save model checkpoints during training. Default: min_validation_period config setting
- Distributed Training Backend (architecture-switch) — Controls: Whether to use Horovod, NCCL, or other communication backend for multi-GPU training. Default: detected based on framework and cluster setup
- Profiling Mode (feature-flag) — Controls: Whether to collect detailed performance profiling data during training. Default: profiling.enabled config flag
Technology Stack
Implementation language for master service and agents, providing concurrent request handling and efficient resource management
Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows
Primary database for storing experiment metadata, trial results, user accounts, and cluster state
Container runtime for isolating and executing user training workloads on agent nodes
API protocol for communication between harness and master, and between master and CLI/web UI
HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI
Optional container orchestration platform that can serve as the resource manager backend instead of native agents
Key Components
- determined-master (orchestrator) — Central coordinator that schedules experiments, manages resource pools, tracks agent health, and provides APIs for web UI and CLI interactions
master/cmd/determined-master/main.go - determined-agent (executor) — Worker daemon that runs on compute nodes, manages containers, reports resource availability, and executes training workloads assigned by master
agent/cmd/determined-agent/main.go - TrialController (orchestrator) — Coordinates the execution lifecycle of a single trial within a container, managing training loops, checkpointing, metrics reporting, and distributed training synchronization
harness/determined/_trial_controller.py - EnvContext (adapter) — Encapsulates all environment information needed by a trial including cluster connection details, resource allocation, and experiment configuration
harness/determined/_env_context.py - Core API (adapter) — Minimal integration layer that allows users to add Determined functionality to existing training code without full framework adoption
harness/determined/core/__init__.py - ManagedCluster (orchestrator) — Test harness that programmatically starts, stops, and manages Determined clusters for end-to-end testing scenarios
e2e_tests/tests/cluster/managed_cluster.py - Session (adapter) — Manages authenticated API connections to Determined master for test scenarios, handling login and certificate validation
e2e_tests/tests/api_utils.py
Package Structure
The central orchestrator service that manages experiments, resource allocation, and cluster state.
Python library that integrates with user training code to enable distributed training, checkpointing, and experiment tracking.
End-to-end tests that validate the platform by running real experiments through the complete system.
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is determined used for?
Schedules and tracks machine learning training jobs across distributed compute clusters determined-ai/determined is a 7-component ml training written in Go. Data flows through 6 distinct pipeline stages. The codebase contains 2136 files.
How is determined architected?
determined is organized into 4 architecture layers: Cluster Management, Experiment Runtime, User Integration, Validation & Testing. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through determined?
Data moves through 6 stages: Submit experiment configuration → Schedule trials on agents → Launch training containers → Initialize trial context → Execute training loop → .... Experiments begin when users submit YAML configs and training code to the master via CLI. The master validates the config, schedules trials across available agents, and launches containers with the harness library. The harness initializes the user's training code, manages distributed coordination if needed, periodically saves checkpoints and reports metrics back to master. The master aggregates results and can spawn additional trials for hyperparameter tuning. This pipeline design reflects a complex multi-stage processing system.
What technologies does determined use?
The core stack includes Go (Implementation language for master service and agents, providing concurrent request handling and efficient resource management), Python (Implementation language for harness library that integrates with user ML training code and existing PyTorch/TensorFlow workflows), PostgreSQL (Primary database for storing experiment metadata, trial results, user accounts, and cluster state), Docker (Container runtime for isolating and executing user training workloads on agent nodes), gRPC/Protocol Buffers (API protocol for communication between harness and master, and between master and CLI/web UI), Echo (HTTP web framework in the master service providing REST APIs and WebSocket connections for web UI), and 1 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does determined have?
determined exhibits 3 data pools (Master Database, Checkpoint Storage), 4 feedback loops, 4 control points, 3 delays. The feedback loops handle retry and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does determined use?
4 design patterns detected: Master-Agent Coordination, Plugin Architecture, Configuration-Driven Behavior, Filesystem-Based IPC.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.