determined-ai/determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

3,216 stars Go 8 components 4 connections

Open-source distributed ML platform for training, hyperparameter tuning, and resource management

Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master

Under the hood, the system uses 3 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.

Structural Verdict

A 8-component ml training with 4 connections. 2136 files analyzed. Loosely coupled — components are relatively independent.

How Data Flows Through the System

Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master

  1. Experiment Submission — User submits experiment config via CLI or API to master
  2. Trial Scheduling — Master creates trials and schedules them on available agents based on resource requirements
  3. Trial Execution — Agent receives trial, starts container, and executes user training code with harness integration
  4. Metrics Collection — Training metrics and checkpoints stream back to master during execution
  5. Resource Management — Master reallocates resources based on scheduler policy and trial progress

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Experiment Database (database)
PostgreSQL storing experiment configs, trial states, metrics, and checkpoints
Trial Metrics Stream (buffer)
Real-time metrics flowing from trials to master for progress tracking
Resource Pool State (state-store)
Current allocation of compute resources across agents and trials

Feedback Loops

Delays & Async Processing

Control Points

Package Structure

This monorepo contains 5 packages:

agent (app)
Resource management agent that runs on compute nodes to execute ML workloads and communicate with the master.
master (app)
Central coordination service that manages experiments, schedules work, handles authentication, and provides the API/web interface.
harness (library)
Python library that provides APIs for users to integrate their ML code with Determined's training and experiment management.
e2e_tests (tooling)
End-to-end integration tests that validate the entire platform across different deployment scenarios.
webui (app)
Web dashboard for visualizing experiments, managing models, viewing metrics, and cluster administration.

Technology Stack

Go (framework)
Backend services (master/agent)
Python (framework)
ML integration library and user APIs
TypeScript/React (framework)
Web dashboard
PostgreSQL (database)
Experiment metadata and state
Docker (infra)
Container runtime for trials
gRPC/Protobuf (framework)
Service communication
Kubernetes (infra)
Container orchestration
pytest (testing)
Testing framework

Key Components

Configuration

codecov.yml (yaml)

bindings/generate_bindings_ts.py (python-dataclass)

bindings/swagger_parser.py (python-dataclass)

bindings/swagger_parser.py (python-dataclass)

Science Pipeline

  1. Load Experiment Config — Parse YAML config into ExperimentConfig dict harness/determined/_experiment_config.py
  2. Initialize Trial Context — Create runtime context with cluster info and hyperparameters [ExperimentConfig + cluster metadata → TrialContext object] harness/determined/_trial_context.py
  3. Execute Training Loop — User-defined training with metrics reporting hooks [Model + training data → Metrics + checkpoints] harness/determined/_trial.py
  4. Collect Metrics — Stream training/validation metrics to master [Dict[str, float] → Stored metrics in database] harness/determined/_trial_context.py

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is determined used for?

Open-source distributed ML platform for training, hyperparameter tuning, and resource management determined-ai/determined is a 8-component ml training written in Go. Loosely coupled — components are relatively independent. The codebase contains 2136 files.

How is determined architected?

determined is organized into 4 architecture layers: User Interface, Coordination Layer, Execution Layer, Infrastructure. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.

How does data flow through determined?

Data moves through 5 stages: Experiment Submission → Trial Scheduling → Trial Execution → Metrics Collection → Resource Management. Experiments flow from user submission through master scheduling to agent execution with continuous metrics reporting back to master This pipeline design reflects a complex multi-stage processing system.

What technologies does determined use?

The core stack includes Go (Backend services (master/agent)), Python (ML integration library and user APIs), TypeScript/React (Web dashboard), PostgreSQL (Experiment metadata and state), Docker (Container runtime for trials), gRPC/Protobuf (Service communication), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does determined have?

determined exhibits 3 data pools (Experiment Database, Trial Metrics Stream), 3 feedback loops, 3 control points, 3 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does determined use?

4 design patterns detected: Master-Agent Architecture, Trial-Based Execution, Context Injection, Plugin Architecture.

Analyzed on March 31, 2026 by CodeSea. Written by .