Hidden Assumptions in kohya_ss

15 assumptions this code never checks · 3 critical · spanning Environment, Contract, Resource, Domain, Temporal, Scale, Ordering, Shape

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at bmaltais/kohya_ss and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

Training fails silently or with cryptic errors if sd-scripts is missing, installed elsewhere, or permissions prevent execution — user sees 'command not found' without understanding that external dependencies are missing

Worth your attention first

If training script changes expected prompt file location or format, sample image generation silently fails during training without user notification — validation images never appear

Worth your attention first

Training process crashes with CUDA out-of-memory errors or hangs indefinitely if GPU resources are insufficient — no validation occurs before launch

Show everything (12 more)
Contract

The TOML config file contains only valid parameter keys that match the expected schema — any typos or deprecated keys are ignored silently

If this fails: Invalid config keys get silently dropped, causing user settings to revert to defaults without warning — user thinks their custom settings are applied but training uses different values

kohya_gui/class_gui_config.py:load_config
Domain

Default learning rate of '1e-6' is appropriate for all model types (SD 1.5, SDXL, SD3, Flux.1) and training approaches (LoRA, Dreambooth, fine-tuning)

If this fails: Training converges extremely slowly or fails to learn with inappropriate learning rates — SDXL may need 1e-5, LoRA may need 1e-4, but system uses same default for all

kohya_gui/class_basic_training.py:learning_rate_value
Temporal

Only one training process should run at a time, but process state tracking relies on a single instance variable that could become stale if the process crashes or is killed externally

If this fails: If training process dies unexpectedly, GUI still thinks it's running and prevents new training starts — user must restart entire GUI to recover

kohya_gui/class_command_executor.py:process
Environment

The user running the GUI has write permissions to create directories in scriptdir/outputs, scriptdir/logs, and scriptdir/reg paths

If this fails: Directory creation fails silently or with permission denied errors, but training continues and then fails when trying to write outputs — confusing delayed failure mode

kohya_gui/class_folders.py:create_directory_if_not_exists
Scale

The selected GPU architecture supports the chosen mixed precision mode — fp16 requires compute capability 7.0+, bf16 requires Ampere+, fp8 requires Hopper+

If this fails: Training fails with cryptic CUDA errors or falls back to slower fp32 without notification — user expects performance benefits but gets degraded training speed

kohya_gui/class_accelerate_launch.py:mixed_precision
Ordering

Sample prompts are written to the prompt file before training starts, but the training script may read this file at initialization — race condition if file is created after script launch

If this fails: Sample image generation uses empty or default prompts instead of user-specified ones if timing is wrong — validation images don't match user expectations

kohya_gui/class_sample_images.py:create_prompt_file
Contract

SDXL parameters like cache_text_encoder_outputs and no_half_vae are only relevant when SDXL mode is enabled, but parameter validation doesn't enforce this constraint

If this fails: Non-SDXL training may receive SDXL-specific flags that are ignored or cause errors — confusing parameter interaction without clear error messages

kohya_gui/class_sdxl_parameters.py:initialize_accordion
Domain

Gradient accumulation steps and batch size are configured independently, but effective batch size = batch_size * gradient_accumulation_steps must fit within GPU memory constraints

If this fails: User can configure valid individual values that combine to exceed VRAM limits — training crashes with memory errors despite valid individual parameters

kohya_gui/class_advanced_training.py:gradient_accumulation_steps
Temporal

Config file writes are atomic and won't be interrupted — partial writes could corrupt the TOML file if system crashes during save

If this fails: Corrupted config files cause startup failures or silent loss of user settings — no backup or recovery mechanism for configuration data

kohya_gui/class_gui_config.py:save_config
Resource

The output directory has sufficient disk space for model checkpoints, logs, and sample images — typical LoRA training can generate several GB of data

If this fails: Training fails mid-process when disk fills up, potentially corrupting checkpoint files — no pre-flight disk space validation

kohya_gui/class_folders.py:current_output_dir
Environment

The subprocess environment inherits all necessary environment variables (CUDA_VISIBLE_DEVICES, LD_LIBRARY_PATH) required by the training scripts

If this fails: Training script can't access GPUs or required libraries despite them being available to the GUI process — mysterious 'no GPU found' errors

kohya_gui/class_command_executor.py:execute_command
Shape

Sample prompts are plain text strings that can be written to a file with UTF-8 encoding, but prompts containing special characters or null bytes could corrupt the file

If this fails: Training script fails to parse malformed prompt file or generates unexpected results — sample images don't match intended prompts

kohya_gui/class_sample_images.py:create_prompt_file

See the full structural analysis of kohya_ss: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of bmaltais/kohya_ss →

Frequently Asked Questions

What does kohya_ss assume that could break in production?

The one most likely to cause trouble: The sd-scripts training modules (train_network.py, train_db.py, fine_tune.py) exist at predictable paths relative to the kohya_ss installation and are executable If this fails, Training fails silently or with cryptic errors if sd-scripts is missing, installed elsewhere, or permissions prevent execution — user sees 'command not found' without understanding that external dependencies are missing

How many hidden assumptions does kohya_ss have?

CodeSea found 15 assumptions kohya_ss relies on but never validates, 3 of them critical, spanning Environment, Contract, Resource, Domain, Temporal, Scale, Ordering, Shape. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.