huggingface/datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching
Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand.
Under the hood, the system uses 3 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.
A 10-component library. 219 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand.
- Resolve dataset identifier to builder — load_dataset() uses dataset_module_factory to find the appropriate DatasetBuilder class by checking local files, Hugging Face Hub, or built-in packaged modules
- Download and extract source files — DatasetBuilder.download_and_prepare() uses DownloadManager to fetch remote files with checksum validation and extract archives to local cache [URLs and checksums → local file paths]
- Generate examples from raw data — Builder calls _generate_examples() method which parses source files and yields Python dictionaries with column names matching the Features schema [raw data files → Python examples]
- Convert examples to Arrow format — ArrowWriter.write() takes Python dictionaries, applies Features.encode_example() for type conversion, and writes batches to Parquet files with metadata [Python examples → Arrow files]
- Load Arrow tables for access — ArrowReader reads Parquet files using memory mapping for efficient access and creates PyArrow tables that back Dataset objects [Arrow files → Dataset]
- Apply transformations with caching — Dataset.map() and filter() apply user functions to rows or batches, using multiprocessing when specified and caching results based on function fingerprints [Dataset → Dataset]
- Format data for ML frameworks — FormattedDataset converts Arrow columns to target format (torch.Tensor, tf.Tensor, numpy arrays) on __getitem__ access using format-specific converters [Dataset → formatted examples]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
src/datasets/arrow_dataset.pyWrapper around PyArrow Table with features: Features mapping column names to types, indices: list[int] for row ordering, format: dict specifying output format (torch/tf/numpy/pandas)
Created by builders from raw data, transformed through map/filter operations, formatted for consumption by ML training code
src/datasets/iterable_dataset.pyStreaming dataset with ex_iterable: Callable returning Iterator[dict], features: Features, shuffling: bool, distributed settings
Created for datasets too large for memory, yields examples on-demand, supports distributed processing and shuffling
src/datasets/features/features.pyDict-like mapping from column names to feature types (Value, Sequence, ClassLabel, Image, Audio, etc.) with encoding/decoding methods
Defined during dataset creation, used to validate and convert data throughout all transformations
src/datasets/info.pyDataclass with description: str, citation: str, features: Features, splits: dict, dataset_size: int, download_size: int
Created during dataset building, stored alongside data files, used for validation and user documentation
src/datasets/builder.pyConfiguration object with name: str, version: Version, description: str, plus builder-specific parameters
Defined by dataset builders to handle different configurations, used during dataset loading and caching
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Memory mapping of large Arrow/Parquet files will not exhaust virtual memory space on 32-bit systems or when loading many datasets simultaneously
If this fails: On 32-bit systems or when opening hundreds of datasets, memory mapping could fail with 'Cannot allocate memory' errors, causing silent fallback to slower in-memory loading or crashes
src/datasets/arrow_reader.py:ArrowReader.read_files
When using multiprocessing (num_proc > 1), the order of processed batches matches the original dataset order after worker processes complete
If this fails: If worker processes complete out of order due to varying batch processing times, the resulting dataset could have shuffled rows despite no explicit shuffle parameter, silently corrupting ordered sequences like time series
src/datasets/arrow_dataset.py:Dataset.map
User-defined _generate_examples() methods yield dictionaries with keys exactly matching the Features schema column names
If this fails: If a custom builder yields examples with missing keys or extra keys not in Features, ArrowWriter will either crash with KeyError or silently drop data without validation
src/datasets/builder.py:DatasetBuilder._generate_examples
Function fingerprints based on source code and arguments uniquely identify transformation behavior across Python versions and environments
If this fails: When the same function code produces different results due to dependency version changes or environment differences, stale cached results could be returned instead of recomputing, leading to inconsistent outputs
src/datasets/arrow_dataset.py:Dataset fingerprint caching
Individual Arrow record batches fit in memory (typically defaulting to 10,000 examples per batch)
If this fails: When writing datasets with very large examples (e.g., high-resolution images, long documents), a single batch could exceed available RAM causing OOM crashes during dataset creation
src/datasets/arrow_writer.py:ArrowWriter.write_batch
The cache directory (~/.cache/huggingface/datasets/) is writable and has sufficient disk space for dataset storage and temporary files during processing
If this fails: When disk space is exhausted during dataset creation, partial Arrow files could be written and cached, leading to corrupted datasets that appear valid but contain incomplete data
src/datasets/builder.py:DatasetBuilder.download_and_prepare
Image, Audio, and Video features receive file paths or bytes that are valid formats supported by underlying libraries (PIL, librosa, etc.)
If this fails: When features receive corrupted media files or unsupported formats, encoding could fail silently and store None values instead of raising clear errors, corrupting the dataset
src/datasets/features/features.py:Features.encode_example
User-provided filter functions are deterministic and return consistent boolean results for the same input across multiple calls
If this fails: If filter functions have side effects or randomness, cached filter results could become inconsistent with fresh evaluations, leading to different dataset contents when cache hits vs misses occur
src/datasets/arrow_dataset.py:Dataset.filter
Formatted outputs (torch, tensorflow, numpy) maintain consistent shapes within each column across all examples in a dataset
If this fails: When examples have variable-length sequences or missing values, format converters could produce ragged tensors or inconsistent shapes that crash downstream ML training code
src/datasets/arrow_dataset.py:Dataset.__getitem__
Worker processes can serialize and pickle the user-defined transformation function along with any closures or lambda expressions
If this fails: When map functions contain unpickleable objects (database connections, file handles, complex closures), multiprocessing silently falls back to single-process mode without warning, causing unexpected performance degradation
src/datasets/arrow_dataset.py:Dataset.map multiprocessing
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Stores processed datasets as Parquet files with metadata, organized by dataset name, config, and version hash
Caches downloaded source files and extracted archives to avoid re-downloading
Stores results of map/filter operations keyed by function fingerprints to avoid recomputation
Built-in parsers for common formats that don't require custom scripts
Feedback Loops
- Download retry with exponential backoff (retry, balancing) — Trigger: Network errors or HTTP failures during download. Action: DownloadManager retries with increasing delays and different mirrors. Exit: Success or max retries exceeded.
- Multiprocessing worker restart (retry, balancing) — Trigger: Worker process crashes during map/filter operations. Action: Dataset.map() detects failed workers and restarts them with remaining batches. Exit: All batches processed successfully.
- Cache invalidation on fingerprint mismatch (cache-invalidation, balancing) — Trigger: Function or data fingerprint differs from cached version. Action: System discards cached results and recomputes transformation. Exit: New results cached with updated fingerprint.
Delays
- Arrow file memory mapping (warmup, ~milliseconds to seconds) — Initial dataset access incurs OS page faults as memory-mapped files are loaded
- Multiprocess worker startup (warmup, ~1-5 seconds) — First map/filter operation with num_proc > 1 spawns worker processes and serializes functions
- Hub dataset download (async-processing, ~seconds to hours) — Large datasets block until download completes, though builders support resumable downloads
- Format conversion on access (compilation, ~microseconds per item) — Each __getitem__ call converts Arrow data to target format, adding per-example overhead
Control Points
- HF_DATASETS_CACHE (env-var) — Controls: Location where all cached datasets and downloads are stored. Default: ~/.cache/huggingface/datasets
- num_proc parameter (runtime-toggle) — Controls: Whether map/filter operations use multiprocessing and how many worker processes. Default: 1 (single process)
- batch_size parameter (runtime-toggle) — Controls: Number of examples processed together in batched map operations for efficiency. Default: 1000
- streaming parameter (architecture-switch) — Controls: Whether to return Dataset (full loading) or IterableDataset (streaming) for memory efficiency. Default: False
- disable_caching() (feature-flag) — Controls: Whether transformations are cached or recomputed every time. Default: False (caching enabled)
- format parameter (runtime-toggle) — Controls: Output format for dataset access - arrow, torch, tensorflow, numpy, pandas, jax. Default: arrow
Technology Stack
Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility
Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations
Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API
Fork of multiprocessing that handles lambda serialization for parallel map/filter operations
Client library for downloading datasets and models from HuggingFace Hub with authentication and caching
Progress bars for downloads, transformations, and other long-running operations
Key Components
- load_dataset (factory) — Main entry point that resolves dataset identifiers to builders, handles caching, and returns Dataset objects
src/datasets/load.py - DatasetBuilder (orchestrator) — Abstract base class that coordinates dataset downloading, processing, and Arrow file generation with caching and resumption
src/datasets/builder.py - ArrowWriter (serializer) — Converts Python dictionaries to Arrow format with type validation and batch optimization for efficient storage
src/datasets/arrow_writer.py - ArrowReader (loader) — Reads Arrow files with support for splits, sharding, and memory mapping for efficient data access
src/datasets/arrow_reader.py - Dataset.map (transformer) — Applies user-defined functions to dataset rows or batches with multiprocessing, caching, and progress tracking
src/datasets/arrow_dataset.py - Dataset.filter (processor) — Filters dataset rows based on user-defined predicates with multiprocessing support and indices management
src/datasets/arrow_dataset.py - FormattedDataset (adapter) — Wraps datasets to convert Arrow data to specific ML framework formats (torch tensors, tf datasets) on access
src/datasets/formatting/ - DownloadManager (gateway) — Handles downloading and extracting remote files with checksums, retries, and caching using fsspec backends
src/datasets/download/download_manager.py - dataset_module_factory (factory) — Dynamically loads dataset builder modules from local files, Hub repositories, or packaged modules
src/datasets/load.py - Features.encode_example (encoder) — Converts Python objects to Arrow-compatible types according to feature specifications with nested structure support
src/datasets/features/features.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is datasets used for?
Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching huggingface/datasets is a 10-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 219 files.
How is datasets architected?
datasets is organized into 5 architecture layers: Dataset Interface Layer, Builder System, Arrow Storage Layer, Format Adapters, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through datasets?
Data moves through 7 stages: Resolve dataset identifier to builder → Download and extract source files → Generate examples from raw data → Convert examples to Arrow format → Load Arrow tables for access → .... Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand. This pipeline design reflects a complex multi-stage processing system.
What technologies does datasets use?
The core stack includes Apache Arrow (Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility), PyArrow (Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations), fsspec (Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API), multiprocess (Fork of multiprocessing that handles lambda serialization for parallel map/filter operations), huggingface_hub (Client library for downloading datasets and models from HuggingFace Hub with authentication and caching), tqdm (Progress bars for downloads, transformations, and other long-running operations). A focused set of dependencies that keeps the build manageable.
What system dynamics does datasets have?
datasets exhibits 4 data pools (Arrow file cache, Download cache), 3 feedback loops, 6 control points, 4 delays. The feedback loops handle retry and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does datasets use?
5 design patterns detected: Builder Pattern for Dataset Loading, Lazy Evaluation with Memory Mapping, Content-Based Caching, Format Adapter Strategy, Streaming Iterator Protocol.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.