Hidden Assumptions in determined

12 assumptions this code never checks · 4 critical · spanning Environment, Contract, Temporal, Resource, Ordering, Domain, Scale

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at determined-ai/determined and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If user code has import dependencies not available in container PYTHONPATH or circular imports, the trial fails with confusing ImportError during initialization

Worth your attention first

If master sends malformed config (missing required keys or wrong types), det.ExperimentConfig() construction silently succeeds but trials crash when accessing expected config values like experiment_config['resources']['slots_per_trial']

Worth your attention first

If master provides mismatched array lengths or out-of-order mappings, workers in distributed training bind to wrong GPU devices, causing CUDA errors or suboptimal performance

Show everything (9 more)

Temporal

Authentication credentials remain valid for the entire duration of a test session and master certificate hasn't changed

If this fails: Long-running e2e tests fail mid-execution with authentication errors if master rotates certificates or credentials expire, causing flaky test results

e2e_tests/tests/api_utils.py:make_session

Resource

The 'det agent list --json' command completes within default subprocess timeout and produces valid JSON output

If this fails: If cluster has hundreds of agents or network latency is high, the command times out and test setup fails without indicating whether it's a performance or connectivity issue

e2e_tests/tests/cluster/managed_cluster.py:get_agent_data

Environment

When stdin is not a character device, it contains valid YAML data that can be unmarshaled into map[string]interface{}

If this fails: If piped input contains malformed YAML or binary data, the template processing fails with log.Fatal() terminating the entire process instead of graceful error handling

master/cmd/determined-gotmpl/main.go:stdinData

Ordering

The steps_completed value from TrialInfo represents the exact number of training steps completed and checkpoints are saved synchronously

If this fails: If trial crashes between completing a training step and saving checkpoint, resume from latest_checkpoint will repeat the last step, potentially causing incorrect learning rate scheduling or data shuffling

harness/determined/_env_context.py:steps_completed initialization

Scale

Go source files being parsed fit entirely in memory and don't exceed parser's internal limits

If this fails: For very large generated code files (hundreds of thousands of lines), the AST parser runs out of memory or fails, breaking the build process

master/cmd/stream-gen/main.go:parser.ParseFiles

Contract

Command line arguments os.Args always contains at least one element (the program name) before manipulation

If this fails: If agent is invoked through unusual process spawning that provides empty os.Args, the index access os.Args[0] panics with index out of range

agent/cmd/determined-agent/main.go:maybeInjectRootAlias

Environment

The session singleton can be started successfully and external dependencies for performance testing are available

If this fails: If performance testing infrastructure (databases, monitoring tools) is unavailable, TestProgramWithConfig.pre() fails but error handling uses sys.exit() without cleanup, leaving partial test state

performance/daist/daist/framework/main.py:session.start()

Resource

GPU device IDs in container_gpus list are valid CUDA device identifiers that exist and are accessible within the container

If this fails: If master assigns non-existent GPU IDs or container lacks proper device permissions, CUDA initialization fails with cryptic device errors rather than clear resource allocation messages

harness/determined/_env_context.py:container_gpus List[str]

Temporal

API sessions cached by lru_cache remain valid for the entire test process lifetime and don't need refreshing

If this fails: If master restarts or network connectivity is lost during test execution, cached sessions become stale but continue to be returned by lru_cache, causing subsequent API calls to fail with authentication errors

e2e_tests/tests/api_utils.py:@functools.lru_cache decorators

See the full structural analysis of determined: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of determined-ai/determined →

Frequently Asked Questions

What does determined assume that could break in production?

The one most likely to cause trouble: User training code is importable from filesystem paths without validating module structure or Python path setup If this fails, If user code has import dependencies not available in container PYTHONPATH or circular imports, the trial fails with confusing ImportError during initialization

How many hidden assumptions does determined have?

CodeSea found 12 assumptions determined relies on but never validates, 4 of them critical, spanning Environment, Contract, Temporal, Resource, Ordering, Domain, Scale. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.