huggingface/datasets

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

21,423 stars Python 10 components

Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching

Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand.

Under the hood, the system uses 3 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 10-component library. 219 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand.

  1. Resolve dataset identifier to builder — load_dataset() uses dataset_module_factory to find the appropriate DatasetBuilder class by checking local files, Hugging Face Hub, or built-in packaged modules
  2. Download and extract source files — DatasetBuilder.download_and_prepare() uses DownloadManager to fetch remote files with checksum validation and extract archives to local cache [URLs and checksums → local file paths]
  3. Generate examples from raw data — Builder calls _generate_examples() method which parses source files and yields Python dictionaries with column names matching the Features schema [raw data files → Python examples]
  4. Convert examples to Arrow format — ArrowWriter.write() takes Python dictionaries, applies Features.encode_example() for type conversion, and writes batches to Parquet files with metadata [Python examples → Arrow files]
  5. Load Arrow tables for access — ArrowReader reads Parquet files using memory mapping for efficient access and creates PyArrow tables that back Dataset objects [Arrow files → Dataset]
  6. Apply transformations with caching — Dataset.map() and filter() apply user functions to rows or batches, using multiprocessing when specified and caching results based on function fingerprints [Dataset → Dataset]
  7. Format data for ML frameworks — FormattedDataset converts Arrow columns to target format (torch.Tensor, tf.Tensor, numpy arrays) on __getitem__ access using format-specific converters [Dataset → formatted examples]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Dataset src/datasets/arrow_dataset.py
Wrapper around PyArrow Table with features: Features mapping column names to types, indices: list[int] for row ordering, format: dict specifying output format (torch/tf/numpy/pandas)
Created by builders from raw data, transformed through map/filter operations, formatted for consumption by ML training code
IterableDataset src/datasets/iterable_dataset.py
Streaming dataset with ex_iterable: Callable returning Iterator[dict], features: Features, shuffling: bool, distributed settings
Created for datasets too large for memory, yields examples on-demand, supports distributed processing and shuffling
Features src/datasets/features/features.py
Dict-like mapping from column names to feature types (Value, Sequence, ClassLabel, Image, Audio, etc.) with encoding/decoding methods
Defined during dataset creation, used to validate and convert data throughout all transformations
DatasetInfo src/datasets/info.py
Dataclass with description: str, citation: str, features: Features, splits: dict, dataset_size: int, download_size: int
Created during dataset building, stored alongside data files, used for validation and user documentation
BuilderConfig src/datasets/builder.py
Configuration object with name: str, version: Version, description: str, plus builder-specific parameters
Defined by dataset builders to handle different configurations, used during dataset loading and caching

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Resource unguarded

Memory mapping of large Arrow/Parquet files will not exhaust virtual memory space on 32-bit systems or when loading many datasets simultaneously

If this fails: On 32-bit systems or when opening hundreds of datasets, memory mapping could fail with 'Cannot allocate memory' errors, causing silent fallback to slower in-memory loading or crashes

src/datasets/arrow_reader.py:ArrowReader.read_files
critical Ordering unguarded

When using multiprocessing (num_proc > 1), the order of processed batches matches the original dataset order after worker processes complete

If this fails: If worker processes complete out of order due to varying batch processing times, the resulting dataset could have shuffled rows despite no explicit shuffle parameter, silently corrupting ordered sequences like time series

src/datasets/arrow_dataset.py:Dataset.map
critical Contract weakly guarded

User-defined _generate_examples() methods yield dictionaries with keys exactly matching the Features schema column names

If this fails: If a custom builder yields examples with missing keys or extra keys not in Features, ArrowWriter will either crash with KeyError or silently drop data without validation

src/datasets/builder.py:DatasetBuilder._generate_examples
critical Temporal unguarded

Function fingerprints based on source code and arguments uniquely identify transformation behavior across Python versions and environments

If this fails: When the same function code produces different results due to dependency version changes or environment differences, stale cached results could be returned instead of recomputing, leading to inconsistent outputs

src/datasets/arrow_dataset.py:Dataset fingerprint caching
warning Scale unguarded

Individual Arrow record batches fit in memory (typically defaulting to 10,000 examples per batch)

If this fails: When writing datasets with very large examples (e.g., high-resolution images, long documents), a single batch could exceed available RAM causing OOM crashes during dataset creation

src/datasets/arrow_writer.py:ArrowWriter.write_batch
warning Environment unguarded

The cache directory (~/.cache/huggingface/datasets/) is writable and has sufficient disk space for dataset storage and temporary files during processing

If this fails: When disk space is exhausted during dataset creation, partial Arrow files could be written and cached, leading to corrupted datasets that appear valid but contain incomplete data

src/datasets/builder.py:DatasetBuilder.download_and_prepare
warning Domain weakly guarded

Image, Audio, and Video features receive file paths or bytes that are valid formats supported by underlying libraries (PIL, librosa, etc.)

If this fails: When features receive corrupted media files or unsupported formats, encoding could fail silently and store None values instead of raising clear errors, corrupting the dataset

src/datasets/features/features.py:Features.encode_example
warning Contract unguarded

User-provided filter functions are deterministic and return consistent boolean results for the same input across multiple calls

If this fails: If filter functions have side effects or randomness, cached filter results could become inconsistent with fresh evaluations, leading to different dataset contents when cache hits vs misses occur

src/datasets/arrow_dataset.py:Dataset.filter
warning Shape unguarded

Formatted outputs (torch, tensorflow, numpy) maintain consistent shapes within each column across all examples in a dataset

If this fails: When examples have variable-length sequences or missing values, format converters could produce ragged tensors or inconsistent shapes that crash downstream ML training code

src/datasets/arrow_dataset.py:Dataset.__getitem__
info Resource weakly guarded

Worker processes can serialize and pickle the user-defined transformation function along with any closures or lambda expressions

If this fails: When map functions contain unpickleable objects (database connections, file handles, complex closures), multiprocessing silently falls back to single-process mode without warning, causing unexpected performance degradation

src/datasets/arrow_dataset.py:Dataset.map multiprocessing

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Arrow file cache (file-store)
Stores processed datasets as Parquet files with metadata, organized by dataset name, config, and version hash
Download cache (file-store)
Caches downloaded source files and extracted archives to avoid re-downloading
Transformation cache (file-store)
Stores results of map/filter operations keyed by function fingerprints to avoid recomputation
Dataset registry (registry)
Built-in parsers for common formats that don't require custom scripts

Feedback Loops

Delays

Control Points

Technology Stack

Apache Arrow (serialization)
Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility
PyArrow (library)
Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations
fsspec (library)
Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API
multiprocess (runtime)
Fork of multiprocessing that handles lambda serialization for parallel map/filter operations
huggingface_hub (library)
Client library for downloading datasets and models from HuggingFace Hub with authentication and caching
tqdm (library)
Progress bars for downloads, transformations, and other long-running operations

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is datasets used for?

Load, preprocess, and stream datasets for machine learning with standardized formats and efficient caching huggingface/datasets is a 10-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 219 files.

How is datasets architected?

datasets is organized into 5 architecture layers: Dataset Interface Layer, Builder System, Arrow Storage Layer, Format Adapters, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through datasets?

Data moves through 7 stages: Resolve dataset identifier to builder → Download and extract source files → Generate examples from raw data → Convert examples to Arrow format → Load Arrow tables for access → .... Data enters through load_dataset() which resolves the dataset identifier to a builder class. The builder downloads raw files, processes them through _generate_examples() yielding Python dictionaries, which ArrowWriter converts to Arrow tables stored on disk. When accessing data, ArrowReader memory-maps the files and Dataset applies any transformations (map/filter) with caching. Finally, formatters convert Arrow data to the target ML framework format on-demand. This pipeline design reflects a complex multi-stage processing system.

What technologies does datasets use?

The core stack includes Apache Arrow (Core columnar data format providing efficient storage, memory mapping, and cross-language compatibility), PyArrow (Python bindings for Arrow, used for all data operations including Parquet I/O and table manipulations), fsspec (Unified filesystem interface supporting local, S3, GCS, HTTP downloads with consistent API), multiprocess (Fork of multiprocessing that handles lambda serialization for parallel map/filter operations), huggingface_hub (Client library for downloading datasets and models from HuggingFace Hub with authentication and caching), tqdm (Progress bars for downloads, transformations, and other long-running operations). A focused set of dependencies that keeps the build manageable.

What system dynamics does datasets have?

datasets exhibits 4 data pools (Arrow file cache, Download cache), 3 feedback loops, 6 control points, 4 delays. The feedback loops handle retry and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does datasets use?

5 design patterns detected: Builder Pattern for Dataset Loading, Lazy Evaluation with Memory Mapping, Content-Based Caching, Format Adapter Strategy, Streaming Iterator Protocol.

Analyzed on April 20, 2026 by CodeSea. Written by .