Hidden Assumptions in graphcast

14 assumptions this code never checks · 4 critical · spanning Scale, Contract, Domain, Temporal, Ordering, Environment, Resource

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at google-deepmind/graphcast and picked out the few most likely to cause trouble — explained plainly, with what to do about each. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here — in plain terms, with what to do about each. The rest are minor; they're under "Show everything".

Worth your attention first

This forecasting system does not estimate the starting state of the atmosphere on its own. It depends on being handed starting conditions produced by a traditional physics-based weather analysis system, including a spread of slightly different starting states. If you feed it inputs that lack this kind of carefully prepared ensemble of starting conditions, the uncertainty estimates the paper reports will not hold.

What to do: Make sure your inputs come from a proper ensemble data assimilation analysis as the paper used, and do not trust the ensemble spread if you initialize from a single or improvised starting state.

From the paper

“GenCast also relies on initial conditions from a traditional NWP ensemble data assimilation system, and therefore for operational use those systems must still be available.”

Read it in the paper · Conclusions ↗
Worth a check

The authors tested this system only at a coarse global grid of roughly one degree. They explicitly say that for real operational use it should be trained and tested at finer resolution, and that finer resolution would likely give better results. The outputs you get are at the coarse scale the paper validated, not at the fine scale operational forecasts use.

What to do: Treat outputs as coarse-resolution forecasts and avoid presenting them as equivalent to finer-resolution operational ensemble products without re-validating at the resolution you need.

From the paper

“For operational use, GenCast should be trained and tested at higher resolution, and would likely yield better results.”

Read it in the paper · Conclusions ↗
Worth a check

The authors deliberately left precipitation out of their headline accuracy results because they are not fully confident in the quality of the training data for rainfall, and they did not tailor their evaluation to it. So any rain-related output should be treated with extra caution; it was not held to the same standard as the other weather variables.

What to do: Treat precipitation forecasts as provisional and check the paper's separate precipitation appendix and metrics before relying on rain predictions.

From the paper

“we lack full confidence in the quality of ERA5 precipitation data, and that we have not tailored our evaluation to precipitation specifically”

Read it in the paper · Results ↗
Show everything (11 more)
Temporal

The system was built around a twelve-hour forecast step on purpose, because the underlying historical data mixes two different kinds of transitions depending on the timing window. By stepping twelve hours it always jumps cleanly between windows. If your data or step timing does not respect this structure, the model is being used outside the regime it was designed for.

What to do: Keep the twelve-hour step and align inputs to the same time-of-day convention the authors used, rather than re-timing steps in ways that break the assimilation-window structure.

“By choosing a 12 hour time step we avoid training on this bimodal distribution, and ensure that our model always predicts a target from the next assimilation window.”

Read it in the paper · Methods ↗
autoregressive.py rollout window and 12h step configuration
Contract

The way the authors measured forecast reliability compares the spread of forecasts against a single best-guess past state, not against a spread of plausible past states. They warn that this scoring method unfairly favors forecasts that are too confident at short lead times, and can make a properly-spread forecast look overconfident. So short-range spread numbers should be read with that caveat in mind.

What to do: When judging short-lead-time spread or reliability, remember the paper's caution that the scoring method itself biases these numbers and do not over-interpret apparent over- or under-dispersion early in the forecast.

“this rewards under-dispersion at short lead times”

Read it in the paper · Verification ↗
Verification / evaluation harness (latitude-weighted metrics against deterministic analysis)
Domain

The model relies on a separate statistics file to scale its inputs and outputs correctly. If that file doesn't exactly match the model you loaded — for example, if you downloaded the wrong statistics file, or mixed files from different model versions — the model will still run without any complaint, but every number it produces will be quietly wrong. The forecast will look like a forecast, but it won't be correct.

What to do: Before running, double-check that your statistics file, your model checkpoint, and your task configuration all come from the same model version — mismatching these is the single easiest way to get plausible-looking but incorrect forecasts.

graphcast/normalization.py:Normalizer
Domain

The model expects your weather data to be formatted in a very specific way: particular variable names, with latitude running from the North Pole down to the South Pole, and specific pressure altitudes. If you get data from any source other than ERA5 — or even ERA5 downloaded through a tool that reorders dimensions — the model will either crash or, more dangerously, silently use the wrong data in the wrong place and produce convincing-looking but wrong forecasts.

What to do: Before feeding in any data, confirm it uses ERA5 variable names, that latitude runs from 90 down to -90, longitude from 0 to 359.75, and that your pressure levels exactly match the list the model was configured with.

graphcast/graphcast.py:GraphCast.__init__
Scale

The model's internal map between the weather grid and its working graph was built assuming a specific grid resolution. If you feed it data at a different resolution — even a commonly available coarser resolution — the map is rebuilt with the wrong density, and the model silently operates on a completely different structure than it was trained on, producing bad forecasts.

What to do: Always use data at exactly the resolution the checkpoint was trained on; for the standard GraphCast model that means 0.25-degree global ERA5 data.

graphcast/grid_mesh_connectivity.py:radius_query_indices
Contract

When you load a saved model file, the code trusts that the file and the software version you're running are perfectly in sync. If the model file was saved with a newer or older version of the code, extra settings can be silently discarded or substituted with wrong defaults, causing the model to run differently than intended with no warning.

What to do: Always use the same version of the code to load a checkpoint as was used to save it, and keep track of which code version produced each checkpoint file.

graphcast/checkpoint.py:load
Ordering

During multi-step forecasting, the model feeds its own predictions back as inputs for the next step. This only works correctly if the predictions come out in exactly the same format the model expects as input. If anything about the time ordering or variable structure is slightly off, either the model crashes part-way through a long forecast, or all steps after the first use misaligned data with no error shown.

What to do: When customizing inputs or variables, verify that every predicted output variable exactly matches the expected input structure before attempting a multi-step rollout.

graphcast/autoregressive.py:Predictor.__call__
Environment

The geometric structure of the model's internal graph is built using high-precision math on the CPU, then handed off to the accelerator. On certain hardware setups, that precision is silently cut in half before the model ever sees it, slightly but persistently corrupting the spatial encoding in every single forecast, with no warning.

What to do: If running on TPU or in a restricted-precision environment, verify that spatial edge features are explicitly cast to the intended precision before the first forward pass.

graphcast/graphcast.py:GraphCast.__init__
Domain

The training loss weights each location on the globe by its true surface area, which requires the latitude values to be in degrees. If a data pipeline inadvertently converts latitudes to radians before they reach the loss function, every location is weighted almost equally, the model gets the wrong training signal, and the resulting model performs worse — with no error or warning during training.

What to do: When writing custom data loaders, make sure latitude coordinates are kept in degrees all the way through to the loss function.

graphcast/losses.py:weighted_mse_per_level
Temporal

The model always predicts one fixed time-step ahead regardless of what time labels you attach to your output template. If you accidentally ask for outputs at hourly or daily intervals instead of the model's native 6-hour step, the model runs without complaint but the time labels on the outputs are wrong — the forecast data corresponds to different times than indicated.

What to do: Make sure your output template uses time steps that exactly match the model's native forecast interval — 6 hours for GraphCast, 12 hours for GenCast.

graphcast/autoregressive.py:Predictor.__call__
Resource

Training the model over multiple forecast steps at once requires storing a large amount of intermediate data in memory — one copy for each time step in the rollout. By default, a setting that would cut this memory roughly in half is turned off. Trying to train over many steps without turning it on will simply crash with a confusing out-of-memory error.

What to do: When training with rollouts longer than a few steps, turn on the gradient checkpointing option to avoid running out of accelerator memory.

graphcast/autoregressive.py:Predictor.loss

See the full structural analysis of graphcast: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of google-deepmind/graphcast →

Compare graphcast

Frequently Asked Questions

What does graphcast assume that could break in production?

The one most likely to cause trouble: This forecasting system does not estimate the starting state of the atmosphere on its own. It depends on being handed starting conditions produced by a traditional physics-based weather analysis system, including a spread of slightly different starting states. If you feed it inputs that lack this kind of carefully prepared ensemble of starting conditions, the uncertainty estimates the paper reports will not hold. What to do: Make sure your inputs come from a proper ensemble data assimilation analysis as the paper used, and do not trust the ensemble spread if you initialize from a single or improvised starting state.

How many hidden assumptions does graphcast have?

CodeSea found 14 assumptions graphcast relies on but never validates, 4 of them critical, spanning Scale, Contract, Domain, Temporal, Ordering, Environment, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.