unidata/netcdf4-python
netcdf4-python: python/numpy interface to the netCDF C library
Wraps netCDF C library to read/write scientific array data with numpy integration
Data enters through Dataset constructor which opens netCDF files via the C library and creates Python proxy objects for the file's hierarchical structure. Variables act as lazy array proxies - reads trigger netCDF C library calls that return raw data, which gets converted to numpy arrays with proper dtype mapping and missing value handling. Writes flow the opposite direction: numpy arrays get validated, optionally compressed/chunked, and passed to the C library for storage. Complex numbers get special handling through the nc_complex extension, either as compound types or dimensional encoding.
Under the hood, the system uses 2 feedback loops, 2 data pools, 4 control points to manage its runtime behavior.
A 6-component library. 83 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Data enters through Dataset constructor which opens netCDF files via the C library and creates Python proxy objects for the file's hierarchical structure. Variables act as lazy array proxies - reads trigger netCDF C library calls that return raw data, which gets converted to numpy arrays with proper dtype mapping and missing value handling. Writes flow the opposite direction: numpy arrays get validated, optionally compressed/chunked, and passed to the C library for storage. Complex numbers get special handling through the nc_complex extension, either as compound types or dimensional encoding.
- File Open and Introspection — Dataset.__init__ calls nc_open from the netCDF C library, then queries the file structure using nc_inq_* functions to populate Python objects for groups, dimensions, and variables without loading actual array data [file path → Dataset] (config: format, diskless, persist +1)
- Variable Access Setup — When accessing dataset.variables['name'], a Variable proxy object is created that holds dimension info, chunking parameters, and compression settings from the netCDF file metadata [Dataset → Variable]
- Array Data Reading — Variable.__getitem__ with slice notation triggers nc_get_vars calls to the C library, followed by automatic conversion to numpy arrays with proper dtype mapping and masked array creation for missing values [Variable → numpy.ndarray] (config: auto_mask, auto_scale)
- Array Data Writing — Variable.__setitem__ validates numpy array shape/dtype compatibility, optionally applies compression (zlib, quantization), then calls nc_put_vars to write data through the C library [numpy.ndarray] (config: zlib, complevel, shuffle +1)
- Complex Number Encoding — When auto_complex=True, the nc_complex extension detects complex arrays and either encodes them as compound types with r/i fields or as real arrays with an extra 'complex' dimension for storage compatibility [numpy.ndarray → CompoundType] (config: auto_complex)
- File Close and Sync — Dataset.close() or context manager exit calls nc_close to flush pending writes and release the file handle, ensuring data integrity [Dataset]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
src/netCDF4Python object wrapping netCDF file handle with attributes: groups (dict), dimensions (dict), variables (dict), plus netCDF global attributes
Created when opening netCDF file, provides hierarchical access to all file contents, closed to flush writes and release file handle
src/netCDF4Array-like object with numpy array data plus netCDF metadata: dimensions (tuple), dtype, attributes (dict), chunking/compression parameters
Defined with dimensions and data type, accepts numpy array assignments, returns numpy arrays on read
numpyN-dimensional array with dtype (float32/64, int32/64, etc.), shape tuple, and optional mask for missing values
Returned from Variable reads, passed to Variable writes, automatically converted between netCDF and numpy data types
src/netCDF4Structured data type with named fields, each having a numpy dtype - used for complex numbers as {r: float64, i: float64}
Created to define custom data structures, used in Variable creation, automatically converted to/from numpy structured arrays
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The subprocess.run() call to nc-config or pkg-config will always succeed and produce valid stdout - no timeout, permission checks, or command existence validation
If this fails: Build process crashes with subprocess.CalledProcessError or AttributeError on flags.stdout if nc-config is missing, corrupted, or hangs indefinitely
_build/utils.py:get_config_flags
The presence of string 'nc_inq_compound' anywhere in netcdf.h indicates a valid netCDF4 installation - but this could match in comments, string literals, or documentation sections that don't represent actual API availability
If this fails: Build system incorrectly detects netCDF4 support when only netCDF3 is available, leading to link-time failures or runtime crashes when netCDF4-specific functions are called
_build/utils.py:is_netcdf4_include_dir
MPI.COMM_WORLD and MPI.Info() objects remain valid throughout the Dataset lifetime and all MPI processes can simultaneously open the same file path without filesystem conflicts
If this fails: Parallel writes fail with EACCES or deadlock if filesystem doesn't support concurrent access, or segfault if MPI communicator becomes invalid during file operations
examples/mpi_example.py:Dataset constructor
All MPI processes call set_collective(True) in the same order and before any collective write operations - no synchronization barrier enforces this ordering
If this fails: Collective I/O operations deadlock or produce corrupted data if processes don't coordinate their collective mode transitions, especially with processes joining at different times
examples/mpi_example.py:v.set_collective
NC_MAX_VAR_DIMS (currently 1024) static array size is sufficient for all netCDF variable dimensionalities that will ever be encountered
If this fails: Buffer overflow and memory corruption when processing variables with more dimensions than NC_MAX_VAR_DIMS, potentially allowing arbitrary code execution
external/nc_complex/src/nc_complex.c:coord_one
Complex number detection relies on dimension names exactly matching '_pfnc_complex', 'complex', or 'ri' - case sensitive string comparison with no normalization or fuzzy matching
If this fails: Complex arrays with dimensions named 'Complex', 'COMPLEX', or 'real_imag' get treated as regular float arrays, silently losing complex number semantics and producing wrong mathematical results
external/nc_complex/src/nc_complex.c:known_dim_names
All three build functions (wheel, sdist, editable) have identical dependency requirements and netcdf4_has_parallel_support() returns consistent results across multiple calls during the same build
If this fails: Inconsistent builds where some build artifacts include mpi4py dependency while others don't, causing import errors when parallel-enabled wheels are installed in environments without mpi4py
_build/backend.py:get_requires_for_build_*
All files opened with utf-8 encoding are actually UTF-8 encoded - no BOM detection, encoding validation, or fallback for files that might be ASCII, Latin-1, or other encodings
If this fails: UnicodeDecodeError when processing netcdf.h files or setup.cfg files that contain non-UTF-8 characters, breaking the build process on systems with different default encodings
_build/utils.py:OPEN_KWARGS
File close/reopen sequence assumes all MPI processes complete their writes and close the file before any process attempts to reopen it - no explicit barrier synchronization
If this fails: Race condition where some processes try to read while others are still writing, leading to incomplete data reads, file corruption, or 'file locked' errors
examples/mpi_example.py:nc.close() and reopen
NC_COMPLEX_GIT_VERSION macro is always defined at compile time and contains a valid version string - no fallback version or validation of version format
If this fails: Compilation fails with undefined macro error if built outside git repository or with incomplete build configuration, making it impossible to build the extension
external/nc_complex/src/nc_complex.c:pfnc_libvers
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Self-describing binary files containing hierarchical scientific data with metadata following CF conventions
Temporary storage for array data during read/write operations before conversion between numpy and netCDF formats
Feedback Loops
- parallel write coordination (polling, balancing) — Trigger: MPI collective write mode. Action: Each process polls for completion of other processes' writes before proceeding. Exit: All processes complete their portion.
- compression optimization (, reinforcing) — Trigger: benchmark script execution. Action: Tests different compression levels and algorithms, measures performance. Exit: All parameter combinations tested.
Delays
- build-time feature detection (compilation, ~seconds) — Build process pauses to run nc-config and inspect headers to determine available netCDF features
- array chunking (batch-window, ~variable) — Large array operations get broken into chunks based on memory constraints and file chunk size
- diskless buffering (cache-ttl, ~until close) — Diskless files accumulate all changes in memory before writing to disk on close
Control Points
- parallel support detection (architecture-switch) — Controls: Whether mpi4py gets added as build dependency based on netCDF C library parallel support. Default: runtime detection
- compression level (hyperparameter) — Controls: Trade-off between file size and write/read performance. Default: 0-9
- auto_complex (feature-flag) — Controls: Whether to automatically detect and convert complex number representations. Default: False by default
- diskless mode (runtime-toggle) — Controls: Whether file operations happen in memory with optional persistence to disk. Default: False by default
Technology Stack
Provides the actual file I/O, data format handling, and compression algorithms that Python code calls through Cython bindings
Handles all array data representation and mathematical operations, providing the primary data container for scientific arrays
Compiles Python-like code to C extensions that can directly call netCDF C library functions with minimal overhead
Enables parallel I/O operations by providing Python bindings to MPI for coordinating multiple processes writing to the same file
Serves as the underlying storage layer for netCDF4 format files, providing chunking, compression, and hierarchical organization
Manages package building and distribution, extended with custom build backend to handle netCDF C library detection
Key Components
- Dataset (gateway) — Main entry point that opens netCDF files and provides hierarchical access to groups, dimensions, variables, and attributes with automatic resource management
src/netCDF4 - build backend (adapter) — Custom setuptools build backend that detects netCDF C library features at build time and conditionally adds mpi4py dependency for parallel support
_build/backend.py - config detector (resolver) — Introspects netCDF C library installation by parsing nc-config output and checking header files to determine available features like parallel I/O
_build/utils.py - compression benchmarks (validator) — Performance testing suite that measures read/write speeds across different compression algorithms (zlib, quantization) and compression levels
examples/bench_compress*.py - parallel I/O handler (orchestrator) — Demonstrates collective and independent parallel file access using MPI, coordinating multiple processes writing to the same netCDF file
examples/mpi_example.py - nc_complex (encoder) — Extends netCDF with complex number support by encoding them either as compound types with real/imaginary fields or as arrays with an extra dimension
external/nc_complex/src/nc_complex.c
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is netcdf4-python used for?
Wraps netCDF C library to read/write scientific array data with numpy integration unidata/netcdf4-python is a 6-component library written in Cython. Data flows through 6 distinct pipeline stages. The codebase contains 83 files.
How is netcdf4-python architected?
netcdf4-python is organized into 4 architecture layers: Build System, Python API, Example Code, External Extensions. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through netcdf4-python?
Data moves through 6 stages: File Open and Introspection → Variable Access Setup → Array Data Reading → Array Data Writing → Complex Number Encoding → .... Data enters through Dataset constructor which opens netCDF files via the C library and creates Python proxy objects for the file's hierarchical structure. Variables act as lazy array proxies - reads trigger netCDF C library calls that return raw data, which gets converted to numpy arrays with proper dtype mapping and missing value handling. Writes flow the opposite direction: numpy arrays get validated, optionally compressed/chunked, and passed to the C library for storage. Complex numbers get special handling through the nc_complex extension, either as compound types or dimensional encoding. This pipeline design reflects a complex multi-stage processing system.
What technologies does netcdf4-python use?
The core stack includes netCDF C library (Provides the actual file I/O, data format handling, and compression algorithms that Python code calls through Cython bindings), numpy (Handles all array data representation and mathematical operations, providing the primary data container for scientific arrays), Cython (Compiles Python-like code to C extensions that can directly call netCDF C library functions with minimal overhead), mpi4py (Enables parallel I/O operations by providing Python bindings to MPI for coordinating multiple processes writing to the same file), HDF5 (Serves as the underlying storage layer for netCDF4 format files, providing chunking, compression, and hierarchical organization), setuptools (Manages package building and distribution, extended with custom build backend to handle netCDF C library detection). A focused set of dependencies that keeps the build manageable.
What system dynamics does netcdf4-python have?
netcdf4-python exhibits 2 data pools (netCDF file, numpy array cache), 2 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and adaptive behavior. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does netcdf4-python use?
4 design patterns detected: Proxy Pattern, Adapter Pattern, Bridge Pattern, Template Method.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.