xarray-contrib/xarray-spatial
Spatial analysis algorithms for xarray implemented in numba
High-performance spatial analysis algorithms for raster data in Python
Raster data enters as xarray.DataArray objects from files or in-memory arrays. The dispatch system examines the underlying data type (numpy, dask, cupy) and selects the appropriate backend implementation. Spatial algorithms apply domain-specific transformations using numba-compiled kernels, preserving coordinate reference systems and spatial metadata. Results return as new DataArrays that can be chained through additional operations or exported to standard geospatial formats.
Under the hood, the system uses 2 feedback loops, 2 data pools, 2 control points to manage its runtime behavior.
A 10-component library. 292 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Raster data enters as xarray.DataArray objects from files or in-memory arrays. The dispatch system examines the underlying data type (numpy, dask, cupy) and selects the appropriate backend implementation. Spatial algorithms apply domain-specific transformations using numba-compiled kernels, preserving coordinate reference systems and spatial metadata. Results return as new DataArrays that can be chained through additional operations or exported to standard geospatial formats.
- Load raster data into xarray DataArray — GeoTIFF reader parses file headers to extract spatial metadata (CRS, geotransform), loads pixel data into numpy/dask/cupy arrays, creates DataArray with proper coordinate dimensions
- Backend dispatch based on array type — dispatch_backend function inspects DataArray.data type - numpy arrays use CPU kernels, dask arrays enable distributed processing, cupy arrays trigger GPU computation paths [SpatialDataArray]
- Execute spatial algorithm with selected backend — Algorithm functions like flow_direction or reclassify call backend-specific implementations - numba.jit kernels for numpy, distributed tasks for dask, CUDA kernels for cupy [SpatialDataArray → SpatialDataArray]
- Apply focal operations and convolutions — apply_kernel slides ConvolutionKernel over raster using optimized boundary handling, computing statistics or filtering within neighborhood windows [ConvolutionKernel → Filtered Raster Results]
- Generate derived products — Hydrology algorithms chain flow_direction→flow_accumulation→watershed, terrain analysis computes slope→aspect→curvature, multispectral indices combine bands using standard formulas [FlowDirectionGrid → Derived Analysis Products]
- Export results to geospatial formats — GeoTiffWriter serializes final DataArrays with preserved spatial metadata, coordinate systems, and compression options for interoperability with GIS software [SpatialDataArray]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
xrspatial/_common.pyxarray.DataArray with dims ['y', 'x'], coords containing spatial references (CRS, geotransform), and .data attribute containing numpy.ndarray, dask.array.Array, or cupy.ndarray
Created from raster files or in-memory arrays, passed through dispatch system to select backend, processed by numba-compiled kernels, returned as new DataArray with preserved spatial metadata
xrspatial/convolution.py2D numpy array with shape (height, width) containing filter weights, wrapped in kernel object with metadata like divisor and offset
Defined as numpy arrays, validated for odd dimensions, applied to raster data through sliding window operations
xrspatial/hydro/flow_direction.pyInteger-coded DataArray where each cell contains D8 direction values (1,2,4,8,16,32,64,128) indicating flow direction to one of 8 neighbors
Generated from elevation DEM using steepest descent, consumed by downstream hydrology algorithms for watershed delineation and flow routing
xrspatial/classify.pynumpy array of break values for reclassification, paired with corresponding new_values array of same length
Created from statistical analysis (quantiles, natural breaks, equal intervals), used to map continuous raster values to discrete classes
xrspatial/geotiff/_geotags.pydataclass with 6-parameter affine transformation (scale_x, shear_x, offset_x, shear_y, scale_y, offset_y) plus CRS information
Extracted from raster file headers, used to convert between pixel coordinates and real-world geographic coordinates
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Input raster DataArray has exactly 2 spatial dimensions named 'y' and 'x' in that order, with y representing rows and x representing columns in a regular grid
If this fails: If input has different dimension names ('lat'/'lon', 'row'/'col') or wrong order ('x','y'), the polygonize algorithm will silently process wrong axes or crash with confusing dimension errors
xrspatial/polygonize.py:polygonize
Input elevation DataArray contains physical elevation values where higher numbers mean higher terrain - algorithm computes steepest descent assuming this relationship
If this fails: If input contains inverted values (depth instead of elevation) or non-physical data (categorical codes), flow direction vectors will point uphill instead of downhill, producing completely wrong watershed boundaries
xrspatial/hydro/flow_direction.py:flow_direction
Source raster contains sparse seed points (non-zero values) with most cells being zero - algorithm efficiency depends on having few source locations to expand from
If this fails: If source raster is dense with many non-zero values, the priority queue-based expansion will consume excessive memory and run orders of magnitude slower than expected
xrspatial/cost_distance.py:cost_distance
All spatial algorithm implementations across numpy/dask/cupy backends produce numerically identical results for the same input data
If this fails: Users switching backends expecting consistent outputs may get different results due to floating-point precision differences or algorithmic variations between CPU and GPU implementations
xrspatial/utils.py:dispatch_backend
Memory can hold N_sources cost surfaces simultaneously, each the same size as input raster
If this fails: For large rasters with many source points, memory usage scales as (raster_size × N_sources), causing OOM crashes on systems with insufficient RAM
xrspatial/balanced_allocation.py:balanced_allocation
RTX triangulation cache remains valid between calls with identical input point coordinates - reuses cached triangulation for performance
If this fails: If point coordinates are numerically identical but values changed, cached triangulation produces stale interpolation results instead of recomputing with new values
xrspatial/gpu_rtx/interpolate.py:RTXTriangulator
Input bins array is monotonically increasing and new_values array has same length as bins - uses bins as break points for reclassification
If this fails: If bins are unordered or arrays have mismatched lengths, reclassification assigns wrong class values or crashes with index errors during value lookup
xrspatial/classify.py:reclassify
Input red and near-infrared band DataArrays represent surface reflectance values in range [0,1] or digital numbers that can be directly used in vegetation index formulas
If this fails: If bands contain top-of-atmosphere radiance or atmospherically uncorrected values, computed NDVI values will be biased and not comparable to standard vegetation indices
xrspatial/multispectral.py:ndvi
When type='cupy' is requested, CUDA GPU is available and CuPy is properly installed with compatible CUDA version
If this fails: Benchmarks fail with NotImplementedError if CUDA environment is misconfigured, but error message doesn't specify which CUDA/CuPy version mismatch caused the failure
benchmarks/benchmarks/common.py:get_xr_dataarray
Erosion simulation iterations parameter is reasonable (hundreds to low thousands) - each iteration processes entire raster for thermal and hydraulic erosion
If this fails: Large iteration counts combined with big rasters cause exponential runtime growth that can run for hours without progress indication or early termination options
xrspatial/erosion.py:erode
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Intermediate raster results cached during multi-step algorithms to avoid recomputation
Pre-compiled numba kernels indexed by operation type and backend for fast dispatch
Feedback Loops
- Iterative Flow Accumulation (convergence, reinforcing) — Trigger: Flow direction calculation identifies pour points. Action: Traverse upstream cells accumulating drainage area until reaching watershed boundaries. Exit: All cells processed or no more upstream contributors.
- Erosion Simulation Steps (training-loop, balancing) — Trigger: User-specified iteration count and random seed. Action: Apply thermal and hydraulic erosion operators to elevation surface, update heights based on sediment transport. Exit: Iteration count reached.
Delays
- Numba JIT Compilation (compilation, ~First call only, ~1-5 seconds) — Initial function calls trigger kernel compilation, subsequent calls execute at native speed
- Dask Task Scheduling (async-processing, ~Variable based on cluster size) — Large raster operations split into tasks, results materialized when .compute() called
Control Points
- Backend Selection (architecture-switch) — Controls: Whether operations run on CPU (numpy), distributed (dask), or GPU (cupy/rtx). Default: Runtime detection
- Chunk Size Configuration (hyperparameter) — Controls: Memory usage and parallelization granularity for large raster operations. Default: max(1, dimension//2)
Technology Stack
Provides labeled multi-dimensional arrays with coordinate systems for representing georeferenced raster data
JIT compiles Python functions to native code for high-performance spatial algorithm kernels
Enables distributed and out-of-core processing of large raster datasets through lazy evaluation and task graphs
GPU-accelerated array operations providing CUDA backend for spatial algorithms on NVIDIA hardware
Foundation array library providing CPU-based numerical computations and memory layout for raster data
Hardware-accelerated ray tracing operations for advanced spatial interpolation and rendering on RTX GPUs
Key Components
- dispatch_backend (dispatcher) — Inspects input DataArray.data type and routes computation to numpy, dask, cupy, or rtx backend implementations
xrspatial/utils.py - apply_kernel (executor) — Performs sliding window convolution operations on raster data using numba-compiled kernels for focal statistics and filtering
xrspatial/focal.py - flow_direction (processor) — Computes D8 flow direction from digital elevation model using steepest descent algorithm
xrspatial/hydro/flow_direction.py - flow_accumulation (processor) — Calculates upstream contributing area for each cell by following flow direction paths and accumulating drainage area
xrspatial/hydro/flow_accumulation.py - reclassify (transformer) — Maps continuous raster values to discrete classes using user-defined or statistically-derived break points
xrspatial/classify.py - GeoTiffWriter (serializer) — Writes xarray DataArrays to GeoTIFF format with proper spatial metadata, tags, and compression without requiring GDAL
xrspatial/geotiff/writer.py - multispectral_indices (processor) — Computes vegetation and land cover indices (NDVI, EVI, ARVI, etc.) from multispectral satellite imagery bands
xrspatial/multispectral.py - polygonize (transformer) — Converts raster regions with identical values into vector polygon geometries using contour tracing algorithms
xrspatial/polygonize.py - surface_curvature (processor) — Calculates terrain curvature measures (plan, profile, mean) from elevation surfaces using second derivatives
xrspatial/terrain.py - RTXTriangulator (accelerator) — Uses RTX GPU hardware-accelerated triangulation for fast spatial interpolation of scattered point data
xrspatial/gpu_rtx/interpolate.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is xarray-spatial used for?
High-performance spatial analysis algorithms for raster data in Python xarray-contrib/xarray-spatial is a 10-component library written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 292 files.
How is xarray-spatial architected?
xarray-spatial is organized into 4 architecture layers: Algorithm Modules, Backend Dispatch, Kernel Definitions, I/O and Format Support. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through xarray-spatial?
Data moves through 6 stages: Load raster data into xarray DataArray → Backend dispatch based on array type → Execute spatial algorithm with selected backend → Apply focal operations and convolutions → Generate derived products → .... Raster data enters as xarray.DataArray objects from files or in-memory arrays. The dispatch system examines the underlying data type (numpy, dask, cupy) and selects the appropriate backend implementation. Spatial algorithms apply domain-specific transformations using numba-compiled kernels, preserving coordinate reference systems and spatial metadata. Results return as new DataArrays that can be chained through additional operations or exported to standard geospatial formats. This pipeline design reflects a complex multi-stage processing system.
What technologies does xarray-spatial use?
The core stack includes xarray (Provides labeled multi-dimensional arrays with coordinate systems for representing georeferenced raster data), numba (JIT compiles Python functions to native code for high-performance spatial algorithm kernels), dask (Enables distributed and out-of-core processing of large raster datasets through lazy evaluation and task graphs), cupy (GPU-accelerated array operations providing CUDA backend for spatial algorithms on NVIDIA hardware), numpy (Foundation array library providing CPU-based numerical computations and memory layout for raster data), rtx (Hardware-accelerated ray tracing operations for advanced spatial interpolation and rendering on RTX GPUs). A focused set of dependencies that keeps the build manageable.
What system dynamics does xarray-spatial have?
xarray-spatial exhibits 2 data pools (Raster Cache, Kernel Registry), 2 feedback loops, 2 control points, 2 delays. The feedback loops handle convergence and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does xarray-spatial use?
4 design patterns detected: Backend Polymorphism, Numba Acceleration, Spatial Metadata Preservation, Comprehensive Benchmarking.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.