huggingface/datasets

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

21,423 stars Python 10 components

Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching

Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand.

Under the hood, the system uses 3 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 10-component library. 219 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Resolve dataset identifier to builder — load_dataset() uses dataset_module_factory to find the appropriate DatasetBuilder class by checking local files, Hugging Face Hub, or built-in packaged modules
Download and extract source files — DatasetBuilder.download_and_prepare() uses DownloadManager to fetch remote files with checksum validation and extract archives to local cache [URLs and checksums → local file paths]
Generate examples from raw data — Builder calls _generate_examples() method which parses source files and yields Python dictionaries with column names matching the Features schema [raw data files → Python examples]
Convert examples to Arrow format — ArrowWriter.write() takes Python dictionaries, applies Features.encode_example() for type conversion, and writes batches to Parquet files with metadata [Python examples → Arrow files]
Load Arrow tables for access — ArrowReader reads Parquet files using memory mapping for efficient access and creates PyArrow tables that back Dataset objects [Arrow files → Dataset]
Apply transformations with caching — Dataset.map() and filter() apply user functions to rows or batches, using multiprocessing when specified and caching results based on function fingerprints [Dataset → Dataset]
Format data for ML frameworks — FormattedDataset converts Arrow columns to target format (torch.Tensor, tf.Tensor, numpy arrays) on __getitem__ access using format-specific converters [Dataset → formatted examples]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Dataset src/datasets/arrow_dataset.py
Wrapper around PyArrow Table with features: Features mapping column names to types, indices: list[int] for row ordering, format: dict specifying output format (torch/tf/numpy/pandas)
Created by builders from raw data, transformed through map/filter operations, formatted for consumption by ML training code

IterableDataset src/datasets/iterable_dataset.py
Streaming dataset with ex_iterable: Callable returning Iterator[dict], features: Features, shuffling: bool, distributed settings
Created for datasets too large for memory, yields examples on-demand, supports distributed processing and shuffling

Features src/datasets/features/features.py
Dict-like mapping from column names to feature types (Value, Sequence, ClassLabel, Image, Audio, etc.) with encoding/decoding methods
Defined during dataset creation, used to validate and convert data throughout all transformations

DatasetInfo src/datasets/info.py
Dataclass with description: str, citation: str, features: Features, splits: dict, dataset_size: int, download_size: int
Created during dataset building, stored alongside data files, used for validation and user documentation

BuilderConfig src/datasets/builder.py
Configuration object with name: str, version: Version, description: str, plus builder-specific parameters
Defined by dataset builders to handle different configurations, used during dataset loading and caching

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Resource unguarded

Memory mapping of large Arrow/Parquet files will not exhaust virtual memory space on 32-bit systems or when loading many datasets simultaneously

If this fails: On 32-bit systems or when opening hundreds of datasets, memory mapping could fail with 'Cannot allocate memory' errors, causing silent fallback to slower in-memory loading or crashes

src/datasets/arrow_reader.py:ArrowReader.read_files

critical Ordering unguarded

When using multiprocessing (num_proc > 1), the order of processed batches matches the original dataset order after worker processes complete

If this fails: If worker processes complete out of order due to varying batch processing times, the resulting dataset could have shuffled rows despite no explicit shuffle parameter, silently corrupting ordered sequences like time series

src/datasets/arrow_dataset.py:Dataset.map

critical Contract weakly guarded

User-defined _generate_examples() methods yield dictionaries with keys exactly matching the Features schema column names

If this fails: If a custom builder yields examples with missing keys or extra keys not in Features, ArrowWriter will either crash with KeyError or silently drop data without validation

src/datasets/builder.py:DatasetBuilder._generate_examples

critical Temporal unguarded

Function fingerprints based on source code and arguments uniquely identify transformation behavior across Python versions and environments

If this fails: When the same function code produces different results due to dependency version changes or environment differences, stale cached results could be returned instead of recomputing, leading to inconsistent outputs

src/datasets/arrow_dataset.py:Dataset fingerprint caching

warning Scale unguarded

Individual Arrow record batches fit in memory (typically defaulting to 10,000 examples per batch)

If this fails: When writing datasets with very large examples (e.g., high-resolution images, long documents), a single batch could exceed available RAM causing OOM crashes during dataset creation

src/datasets/arrow_writer.py:ArrowWriter.write_batch

warning Environment unguarded

The cache directory (~/.cache/huggingface/datasets/) is writable and has sufficient disk space for dataset storage and temporary files during processing

If this fails: When disk space is exhausted during dataset creation, partial Arrow files could be written and cached, leading to corrupted datasets that appear valid but contain incomplete data

src/datasets/builder.py:DatasetBuilder.download_and_prepare

warning Domain weakly guarded

Image, Audio, and Video features receive file paths or bytes that are valid formats supported by underlying libraries (PIL, librosa, etc.)

If this fails: When features receive corrupted media files or unsupported formats, encoding could fail silently and store None values instead of raising clear errors, corrupting the dataset

src/datasets/features/features.py:Features.encode_example

warning Contract unguarded

User-provided filter functions are deterministic and return consistent boolean results for the same input across multiple calls

If this fails: If filter functions have side effects or randomness, cached filter results could become inconsistent with fresh evaluations, leading to different dataset contents when cache hits vs misses occur

src/datasets/arrow_dataset.py:Dataset.filter

warning Shape unguarded

Formatted outputs (torch, tensorflow, numpy) maintain consistent shapes within each column across all examples in a dataset

If this fails: When examples have variable-length sequences or missing values, format converters could produce ragged tensors or inconsistent shapes that crash downstream ML training code

src/datasets/arrow_dataset.py:Dataset.__getitem__

info Resource weakly guarded

Worker processes can serialize and pickle the user-defined transformation function along with any closures or lambda expressions

If this fails: When map functions contain unpickleable objects (database connections, file handles, complex closures), multiprocessing silently falls back to single-process mode without warning, causing unexpected performance degradation

src/datasets/arrow_dataset.py:Dataset.map multiprocessing

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Arrow file cache (file-store)
Stores processed datasets as Parquet files with metadata, organized by dataset name, config, and version hash

Download cache (file-store)
Caches downloaded source files and extracted archives to avoid re-downloading

Transformation cache (file-store)
Stores results of map/filter operations keyed by function fingerprints to avoid recomputation

Dataset registry (registry)
Built-in parsers for common formats that don't require custom scripts

Feedback Loops

Download retry with exponential backoff (retry, balancing) — Trigger: Network errors or HTTP failures during download. Action: DownloadManager retries with increasing delays and different mirrors. Exit: Success or max retries exceeded.
Multiprocessing worker restart (retry, balancing) — Trigger: Worker process crashes during map/filter operations. Action: Dataset.map() detects failed workers and restarts them with remaining batches. Exit: All batches processed successfully.
Cache invalidation on fingerprint mismatch (cache-invalidation, balancing) — Trigger: Function or data fingerprint differs from cached version. Action: System discards cached results and recomputes transformation. Exit: New results cached with updated fingerprint.

Delays

Arrow file memory mapping (warmup, ~milliseconds to seconds) — Initial dataset access incurs OS page faults as memory-mapped files are loaded
Multiprocess worker startup (warmup, ~1-5 seconds) — First map/filter operation with num_proc > 1 spawns worker processes and serializes functions
Hub dataset download (async-processing, ~seconds to hours) — Large datasets block until download completes, though builders support resumable downloads
Format conversion on access (compilation, ~microseconds per item) — Each __getitem__ call converts Arrow data to target format, adding per-example overhead

Control Points

HF_DATASETS_CACHE (env-var) — Controls: Location where all cached datasets and downloads are stored. Default: ~/.cache/huggingface/datasets
num_proc parameter (runtime-toggle) — Controls: Whether map/filter operations use multiprocessing and how many worker processes. Default: 1 (single process)
batch_size parameter (runtime-toggle) — Controls: Number of examples processed together in batched map operations for efficiency. Default: 1000
streaming parameter (architecture-switch) — Controls: Whether to return Dataset (full loading) or IterableDataset (streaming) for memory efficiency. Default: False
disable_caching() (feature-flag) — Controls: Whether transformations are cached or recomputed every time. Default: False (caching enabled)
format parameter (runtime-toggle) — Controls: Output format for dataset access - arrow, torch, tensorflow, numpy, pandas, jax. Default: arrow

Technology Stack

Apache Arrow (serialization)
Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility

PyArrow (library)
Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations

fsspec (library)
Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API

multiprocess (runtime)
Fork of multiprocessing that handles lambda serialization for parallel map/filter operations

huggingface_hub (library)
Client library for downloading datasets and models from HuggingFace Hub with authentication and caching

tqdm (library)
Progress bars for downloads, transformations, and other long-running operations

Key Components

load_dataset (factory) — Main entry point that resolves dataset identifiers to builders, handles caching, and returns Dataset objects src/datasets/load.py
DatasetBuilder (orchestrator) — Abstract base class that coordinates dataset downloading, processing, and Arrow file generation with caching and resumption src/datasets/builder.py
ArrowWriter (serializer) — Converts Python dictionaries to Arrow format with type validation and batch optimization for efficient storage src/datasets/arrow_writer.py
ArrowReader (loader) — Reads Arrow files with support for splits, sharding, and memory mapping for efficient data access src/datasets/arrow_reader.py
Dataset.map (transformer) — Applies user-defined functions to dataset rows or batches with multiprocessing, caching, and progress tracking src/datasets/arrow_dataset.py
Dataset.filter (processor) — Filters dataset rows based on user-defined predicates with multiprocessing support and indices management src/datasets/arrow_dataset.py
FormattedDataset (adapter) — Wraps datasets to convert Arrow data to specific ML framework formats (torch tensors, tf datasets) on access src/datasets/formatting/
DownloadManager (gateway) — Handles downloading and extracting remote files with checksums, retries, and caching using fsspec backends src/datasets/download/download_manager.py
dataset_module_factory (factory) — Dynamically loads dataset builder modules from local files, Hub repositories, or packaged modules src/datasets/load.py
Features.encode_example (encoder) — Converts Python objects to Arrow-compatible types according to feature specifications with nested structure support src/datasets/features/features.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is datasets used for?

Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching huggingface/datasets is a 10-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 219 files.

How is datasets architected?

datasets is organized into 5 architecture layers: Dataset Interface Layer, Builder System, Arrow Storage Layer, Format Adapters, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through datasets?

Data moves through 7 stages: Resolve dataset identifier to builder → Download and extract source files → Generate examples from raw data → Convert examples to Arrow format → Load Arrow tables for access → .... Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand. This pipeline design reflects a complex multi-stage processing system.

What technologies does datasets use?

The core stack includes Apache Arrow (Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility), PyArrow (Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations), fsspec (Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API), multiprocess (Fork of multiprocessing that handles lambda serialization for parallel map/filter operations), huggingface_hub (Client library for downloading datasets and models from HuggingFace Hub with authentication and caching), tqdm (Progress bars for downloads, transformations, and other long-running operations). A focused set of dependencies that keeps the build manageable.

What system dynamics does datasets have?

datasets exhibits 4 data pools (Arrow file cache, Download cache), 3 feedback loops, 6 control points, 4 delays. The feedback loops handle retry and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does datasets use?

5 design patterns detected: Builder Pattern for Dataset Loading, Lazy Evaluation with Memory Mapping, Content-Based Caching, Format Adapter Strategy, Streaming Iterator Protocol.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.