scitools/iris
A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
Loads, analyses, and visualizes multi-dimensional Earth science datasets from various formats
Data enters Iris through format-specific loaders that detect file types and create data proxies for lazy loading. These proxies are wrapped in Cube objects with coordinates extracted from file metadata. Analysis operations transform the cubes by manipulating their data arrays and coordinate systems, often triggering lazy evaluation only when results are needed. Finally, data exits through format-specific writers or visualization functions that render the multi-dimensional arrays with proper geospatial context.
Under the hood, the system uses 2 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.
A 9-component library. 721 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Data enters Iris through format-specific loaders that detect file types and create data proxies for lazy loading. These proxies are wrapped in Cube objects with coordinates extracted from file metadata. Analysis operations transform the cubes by manipulating their data arrays and coordinate systems, often triggering lazy evaluation only when results are needed. Finally, data exits through format-specific writers or visualization functions that render the multi-dimensional arrays with proper geospatial context.
- Format detection and file parsing — The load function examines file extensions and headers to determine the appropriate format loader (NetCDF, GRIB, UM, etc.), then delegates to format-specific parsers that extract metadata without loading large data arrays
- Data proxy creation — Format loaders create data proxy objects (like NetCDFDataProxy) that provide array-like interfaces to file data without loading it into memory, enabling lazy evaluation throughout the pipeline [File metadata → LazyArray]
- Coordinate extraction and standardization — Coordinate variables from files are converted into DimCoord and AuxCoord objects with proper units, standard names, and CF metadata compliance, establishing the dimensional structure of the dataset [File metadata → DimCoord]
- Cube construction and metadata enrichment — The Cube constructor combines data arrays, coordinates, and metadata into a unified object, applying CF conventions and resolving coordinate systems for proper geospatial reference [LazyArray → Cube]
- Analysis operations and transformations — User-requested operations like aggregation, interpolation, or regridding are applied to Cube objects, manipulating data arrays and coordinate systems while maintaining metadata consistency and lazy evaluation where possible [Cube → Cube]
- Data realization and output — When concrete results are needed (for saving or plotting), lazy arrays are computed through dask, data is written to output formats with appropriate metadata, or passed to matplotlib/cartopy for visualization [Cube]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
lib/iris/cube.pyContainer with data: np.ndarray or dask.array (shape varies), dim_coords: list[DimCoord], aux_coords: list[AuxCoord], cell_measures: list[CellMeasure], ancillary_variables: list[AncillaryVariable], attributes: dict, cell_methods: list[CellMethod]
Created by file loaders from raw data, enriched with coordinates and metadata, passed through analysis operations that may transform dimensions or data, and serialized back to files or rendered as plots
lib/iris/coords.pyCoordinate with points: np.ndarray (1D, length matches cube dimension), bounds: np.ndarray or None (2D for intervals), standard_name: str, units: cf_units.Unit, attributes: dict
Created during file loading from dimension variables, used to define cube structure and enable coordinate-based indexing and operations like aggregation along axes
lib/iris/coords.pyCoordinate with points: np.ndarray (N-D, arbitrary shape), bounds: np.ndarray or None, standard_name: str or None, long_name: str or None, units: cf_units.Unit, attributes: dict
Created for auxiliary coordinate variables that don't define cube dimensions but provide additional spatial or temporal reference information
lib/iris/experimental/ugrid/mesh.pyUGRID mesh with topology_dimension: int, node_coords_and_axes: list[tuple[Coord, str]], face_node_connectivity: Connectivity, face_coords_and_axes: list[tuple[Coord, str]], edge_node_connectivity: Connectivity or None
Created from UGRID mesh files defining connectivity between nodes, edges, and faces, used to create MeshCoords for unstructured data analysis
lib/iris/_lazy_data.pyDask array wrapper with array: dask.array.Array, dtype: np.dtype, shape: tuple[int], chunks: tuple[tuple[int]]
Created during lazy loading to defer data reading from disk, manipulated through dask operations during analysis, and realized to numpy arrays only when needed
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The DATA_GEN_PYTHON environment variable points to a Python executable that has all required dependencies (this repo, Mule, test modules) installed in its environment
If this fails: Data generation functions will fail with import errors or missing dependencies at runtime, corrupting benchmark datasets or causing benchmark failures
benchmarks/benchmarks/generate_data/__init__.py:DATA_GEN_PYTHON
The external Python environment remains stable and available throughout the entire benchmarking run sequence, which may span multiple commits and hours
If this fails: Mid-benchmark failures when the external environment becomes unavailable, requiring full benchmark restart and invalidating timing comparisons across commits
benchmarks/benchmarks/generate_data/__init__.py:run_function_elsewhere
System has sufficient memory to load UM files with shape (1920, 2560) of float32 data (~19MB per cube) plus coordinate arrays without memory pressure
If this fails: Silent memory swapping causes benchmark timing to include disk I/O, making results unreliable, or OOM kills benchmark process
benchmarks/benchmarks/cperf/__init__.py:_UM_DIMS_YX
UM files always load longitude/latitude as DimCoords (which are always realized) while LFRic files load them as MeshCoords (which are lazy by default)
If this fails: Benchmark assertions fail if file format behavior changes, and timing comparisons become invalid if coordinate realization strategy differs between formats
benchmarks/benchmarks/cperf/load.py:time_load
The iris.tests.stock.netcdf module exists and contains the expected functions at the time run_function_elsewhere executes
If this fails: Data generation fails with AttributeError when checking out older Iris commits that don't have expected stock functions, breaking benchmark runs across commit history
benchmarks/benchmarks/generate_data/stock.py:_create_file__xios_common
Coordinate dimensions returned by c.cube_dims(source_cube) remain stable throughout the lifetime of the benchmark setup and match the source_cube's dimensional structure
If this fails: Cube construction fails with dimension mismatch errors if source cube's coordinate mapping changes between setup and benchmark execution
benchmarks/benchmarks/cube.py:setup
Previously generated benchmark data files remain valid and compatible with current Iris version when REUSE_DATA is enabled
If this fails: Benchmarks use stale data that doesn't match current Iris behavior, producing misleading performance measurements or silent failures due to format incompatibilities
benchmarks/benchmarks/generate_data/__init__.py:REUSE_DATA
The cubesphere size calculation int(np.sqrt(np.prod(_UM_DIMS_YX) / 6)) produces a valid cubesphere dimension that can be handled by LFRic mesh generation
If this fails: Mesh generation fails when calculated cubesphere size exceeds implementation limits or produces invalid mesh topology, causing benchmark crashes
benchmarks/benchmarks/cperf/__init__.py:_N_CUBESPHERE_UM_EQUIVALENT
Object persistence between ASV repeat runs behaves consistently - objects created in setup() will remain modified after first benchmark run
If this fails: Subsequent benchmark runs operate on already-modified objects, producing invalid timing measurements that don't reflect real-world performance
benchmarks/benchmarks/aggregate_collapse.py:disable_repeat_between_setup
The checked-out commit of Iris contains parseable setup.py/pyproject.toml with standard Python packaging metadata for dependency extraction
If this fails: Environment preparation fails when checking out commits with non-standard build configurations, breaking benchmark runs for historical commits
benchmarks/asv_delegated.py:_prep_env_override
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Dask arrays that represent file data without loading it into memory, enabling scalable processing of large Earth science datasets
Cached coordinate values and bounds to avoid repeated computation during coordinate operations and cube manipulations
Registry mapping file extensions and format signatures to appropriate loader classes for automatic format detection
Feedback Loops
- Lazy evaluation deferral (recursive, balancing) — Trigger: Analysis operations on large datasets. Action: Operations create new dask computation graphs without triggering computation, allowing chaining of operations. Exit: When concrete data is requested through .data property or save/plot operations.
- Coordinate alignment during merge (convergence, balancing) — Trigger: CubeList.merge() with misaligned coordinates. Action: Iteratively compares coordinate metadata and values to find common dimensional structure. Exit: When all cubes align on common coordinates or merge fails due to incompatibility.
Delays
- Lazy data loading (async-processing, ~Variable (depends on file size and disk I/O)) — File metadata is available immediately but large data arrays are only loaded when accessed
- Dask computation scheduling (batch-window, ~Variable (depends on computation graph complexity)) — Multiple operations are batched together and executed when compute() is called
- Coordinate bound calculation (cache-ttl) — Coordinate bounds are calculated on first access and cached for subsequent operations
Control Points
- IRIS_FUTURE (feature-flag) — Controls: Enables experimental features and API changes through environment variable flags
- Dask chunk size (runtime-toggle) — Controls: Memory usage and parallelization strategy for large array operations
- NetCDF engine selection (architecture-switch) — Controls: Which NetCDF library backend to use (netCDF4-python, h5netcdf)
Technology Stack
Provides lazy array operations and parallel computation for scalable processing of large Earth science datasets
Core array processing and mathematical operations, serving as the foundation for all numerical computations
Primary interface for reading and writing NetCDF files, the most common format in Earth sciences
Handles unit conversion and validation according to Climate and Forecast metadata conventions
Plotting and visualization backend for creating scientific plots and charts from Cube data
Geospatial visualization with map projections and coordinate system transformations for Earth science data
Scientific computing algorithms including interpolation and statistical operations used in analysis module
Testing framework for comprehensive unit and integration tests across the codebase
Key Components
- Cube (registry) — Central data container that holds n-dimensional arrays with associated coordinates, metadata, and lazy evaluation support, providing the primary interface for all data operations
lib/iris/cube.py - load (orchestrator) — Coordinates the loading process by detecting file formats, calling appropriate format-specific loaders, and assembling the results into Cube objects with proper metadata
lib/iris/__init__.py - NetCDFDataProxy (adapter) — Provides lazy access to NetCDF data arrays without loading them into memory, implementing the array interface for seamless integration with numpy/dask operations
lib/iris/fileformats/_nc_load_rules/engine.py - CubeList (processor) — Manages collections of Cube objects and provides operations like merge and concatenate that combine multiple cubes based on coordinate alignment and metadata compatibility
lib/iris/cube.py - analysis (transformer) — Implements mathematical operations on Cube data including aggregation (mean, sum), interpolation, regridding, and statistical analysis while preserving coordinate relationships
lib/iris/analysis/ - CoordSystem (resolver) — Defines coordinate reference systems and map projections, enabling transformation between different spatial coordinate systems and proper geospatial visualization
lib/iris/coord_systems.py - MeshCoord (adapter) — Provides coordinate interface for unstructured mesh data, mapping between mesh topology and coordinate values for UGRID-compliant datasets
lib/iris/experimental/ugrid/mesh.py - FieldsFileVariant (adapter) — Handles UK Met Office UM fieldsfile format by parsing binary headers and data fields, converting them into Iris-compatible data structures
lib/iris/fileformats/um/_ff_replacement.py - as_lazy_data (factory) — Creates dask arrays from various input sources including numpy arrays and data proxies, enabling consistent lazy evaluation throughout the system
lib/iris/_lazy_data.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is iris used for?
Loads, analyses, and visualizes multi-dimensional Earth science datasets from various formats scitools/iris is a 9-component library written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 721 files.
How is iris architected?
iris is organized into 5 architecture layers: File Format Layer, Core Data Model, Analysis Operations, Mesh Support, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through iris?
Data moves through 6 stages: Format detection and file parsing → Data proxy creation → Coordinate extraction and standardization → Cube construction and metadata enrichment → Analysis operations and transformations → .... Data enters Iris through format-specific loaders that detect file types and create data proxies for lazy loading. These proxies are wrapped in Cube objects with coordinates extracted from file metadata. Analysis operations transform the cubes by manipulating their data arrays and coordinate systems, often triggering lazy evaluation only when results are needed. Finally, data exits through format-specific writers or visualization functions that render the multi-dimensional arrays with proper geospatial context. This pipeline design reflects a complex multi-stage processing system.
What technologies does iris use?
The core stack includes Dask (Provides lazy array operations and parallel computation for scalable processing of large Earth science datasets), NumPy (Core array processing and mathematical operations, serving as the foundation for all numerical computations), NetCDF4 (Primary interface for reading and writing NetCDF files, the most common format in Earth sciences), CF-Units (Handles unit conversion and validation according to Climate and Forecast metadata conventions), Matplotlib (Plotting and visualization backend for creating scientific plots and charts from Cube data), Cartopy (Geospatial visualization with map projections and coordinate system transformations for Earth science data), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does iris have?
iris exhibits 3 data pools (Lazy data arrays, Coordinate caches), 2 feedback loops, 3 control points, 3 delays. The feedback loops handle recursive and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does iris use?
4 design patterns detected: Lazy Loading with Proxies, Plugin Architecture for File Formats, Metadata-Rich Data Containers, Coordinate System Abstraction.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.