ray-project/ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

42,210 stars Python 8 components

Distributes AI/ML workloads across clusters with automatic scaling and resource management

Ray applications start with ray.init() which connects to or starts a cluster. Users submit distributed functions via @ray.remote decorators, which creates TaskSpecs that flow through the GCS for global coordination and Raylets for local scheduling. Tasks execute on CoreWorkers, producing ObjectRefs that reference results in the distributed object store. For ML workloads, higher-level libraries like Tune coordinate multiple training jobs while Serve handles model inference with automatic scaling based on request load.

Under the hood, the system uses 5 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component repository. 7101 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Ray applications start with ray.init() which connects to or starts a cluster. Users submit distributed functions via @ray.remote decorators, which creates TaskSpecs that flow through the GCS for global coordination and Raylets for local scheduling. Tasks execute on CoreWorkers, producing ObjectRefs that reference results in the distributed object store. For ML workloads, higher-level libraries like Tune coordinate multiple training jobs while Serve handles model inference with automatic scaling based on request load.

  1. Cluster Initialization — ray.init() establishes connection to existing cluster or starts local cluster by launching GCS server, Raylet processes, and CoreWorkers on each node [ClusterConfig → ClusterState]
  2. Task Submission — User calls @ray.remote decorated function, which serializes arguments, creates TaskSpec with resource requirements and function metadata, and submits to local Raylet [Function Arguments → TaskSpec]
  3. Global Scheduling — GCS receives TaskSpec from Raylet, consults cluster state for resource availability, and assigns task to node with sufficient resources using locality-aware placement [TaskSpec → Task Assignment]
  4. Local Execution — Target Raylet receives assignment, finds available CoreWorker, deserializes task arguments from object store, and executes user function in isolated process [Task Assignment → ObjectRef]
  5. Result Storage — CoreWorker serializes function return value, stores in local Plasma object store with unique ObjectID, and notifies GCS of completion for dependency resolution [Function Result → ObjectRef]
  6. Result Retrieval — ray.get() call resolves ObjectRef by fetching data from object store (local or remote), deserializing value, and returning to user application [ObjectRef → Python Object]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ObjectRef python/ray/_raylet.pyx
Python wrapper around C++ ObjectID containing task_id: bytes, object_index: int, and metadata for distributed object references
Created when ray.remote() task returns a value, stored in distributed object store, resolved when ray.get() is called
TaskSpec src/ray/core_worker/task_spec.h
C++ structure with function_descriptor, args: vector<ObjectRef>, num_returns: int, resources: ResourceSet, and scheduling metadata
Built from ray.remote() call parameters, queued by scheduler, dispatched to worker process
WorkerSpec src/ray/core_worker/core_worker.h
C++ class containing worker_id: WorkerID, process_id: pid_t, assigned_resources: ResourceSet, and runtime environment config
Created during cluster initialization, updated during task assignment, destroyed on worker termination
PlacementGroup python/ray/util/placement_group.py
Python dataclass with bundles: List[Dict[str, float]], strategy: PlacementStrategy (PACK/SPREAD/STRICT_PACK), and name: Optional[str]
Created via ray.util.placement_group(), reserved on cluster nodes, used to co-locate related tasks
Dataset python/ray/data/dataset.py
Ray Data abstraction over distributed blocks with schema: pyarrow.Schema, _plan: ExecutionPlan containing read/transform operations
Lazy evaluation tree built by chaining operations, executed when action (write/iterate) is called
ServeDeployment python/ray/serve/_private/deployment_state.py
Pydantic model with name: str, replica_config: ReplicaConfig (num_replicas, resources), route_prefix: str, and health status
Defined in serve.deployment decorator, deployed to cluster, scales based on traffic patterns

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

The function modifies sys.path by inserting local directories 'thirdparty_files' and current directory at index 0, assuming these directories exist and contain required modules like 'aiohttp' and 'runtime_env_agent'

If this fails: If the expected directory structure doesn't exist or required modules are missing, the subsequent imports will fail with ModuleNotFoundError, causing the runtime environment agent to crash during startup

python/ray/_private/runtime_env/agent/main.py:import_libs
warning Contract unguarded

The formatUrl function assumes URLs starting with '/' should have the leading slash removed for reverse proxy compatibility, but never validates if the resulting URL is actually valid or reachable

If this fails: Malformed URLs like '//example.com' become '/example.com' which could redirect requests to wrong endpoints or cause 404 errors, leading to silent failures in dashboard API calls

python/ray/dashboard/client/src/service/requestHandlers.ts:formatUrl
warning Domain unguarded

Status color mappings assume all status enums are complete and matching - if a new status value is added to TaskStatus, JobStatus, or other enums but not to the color map, it will return undefined color

If this fails: New status values render without colors, appearing as blank or default-styled chips in the dashboard, making status information invisible to users

python/ray/dashboard/client/src/components/StatusChip.tsx:getColorMap
warning Ordering unguarded

The updatePage function assumes pages are updated in the correct order by finding pageIndex via findIndex, but if multiple pages have the same ID, it will always update the first match

If this fails: If duplicate page IDs exist in the hierarchy, only the first page gets updated while later pages with the same ID remain stale, leading to inconsistent breadcrumb navigation

python/ray/dashboard/client/src/pages/layout/mainNavContext.ts:updatePage
critical Resource unguarded

Hard-coded INFERENCE_BATCH_SIZE of 900 assumes T4 GPU memory capacity, but the actual device is determined at runtime (cpu if smoke_test else cuda) without checking available GPU memory

If this fails: On GPUs with less memory than T4, inference will fail with CUDA out-of-memory errors. On CPUs, the large batch size may cause system memory exhaustion and process crashes

release/nightly_tests/dataset/image_classification_from_parquet/main.py:INFERENCE_BATCH_SIZE
critical Scale unguarded

The calculation OVERRIDE_NUM_BLOCKS = int(NUM_CONTAINERS * NUM_UNITS / IMAGES_PER_BLOCK) assumes NUM_CONTAINERS (50), NUM_UNITS (1380), and IMAGES_PER_BLOCK (11) are fixed constants that scale linearly with cluster size

If this fails: On smaller clusters, this creates too many blocks leading to task overhead and scheduling delays. On larger clusters, blocks become too large causing memory pressure and potential OOM failures

release/nightly_tests/dataset/image_embedding_from_uris/main.py:NUM_CONTAINERS
critical Temporal unguarded

Authentication error handling dispatches AUTHENTICATION_ERROR_EVENT immediately when receiving 401/403, assuming the event listener is already registered and the authentication dialog component is ready to handle it

If this fails: If the authentication dialog hasn't been initialized yet or the event listener is not registered, authentication errors are silently ignored, leaving users unable to authenticate and stuck with failed requests

python/ray/dashboard/client/src/service/requestHandlers.ts:axiosInstance.interceptors.response
critical Contract unguarded

The PyArrow schema defines fixed column names and types (metadata00-18, span_text) assuming the input Parquet files always contain exactly these columns in this exact format

If this fails: If input files have different schemas, missing columns, or type mismatches, Ray Data will fail during read operations with schema validation errors, causing the entire text embedding pipeline to crash

release/nightly_tests/dataset/text_embedding/main.py:SCHEMA
critical Environment weakly guarded

The deployment assumes CUDA is available and the GPU has sufficient memory (≥4GB) for the StableDiffusion model with fp16 precision, but never checks GPU availability or memory before loading

If this fails: On CPU-only machines or GPUs with insufficient memory, the model loading fails with CUDA errors or OOM, causing the entire Serve deployment to crash and preventing the service from starting

release/workspace_templates/03_serving_stable_diffusion/app.py:StableDiffusionV2.__init__
warning Domain unguarded

The constant INFERENCE_LATENCY_PER_IMAGE_S = 0.0094 is hard-coded based on T4 GPU performance measurements, assuming all inference will run on identical hardware with consistent performance

If this fails: On different GPU types or when GPU is under load, actual inference times differ significantly from this assumption, leading to incorrect capacity planning and potential timeouts in production workloads

release/nightly_tests/dataset/image_embedding_from_uris/main.py:INFERENCE_LATENCY_PER_IMAGE_S

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Plasma Object Store (in-memory)
Shared memory store for large arrays and tensors with zero-copy access across processes on same node
GCS State Store (database)
Persistent cluster metadata including node membership, job status, actor registry, and placement group state
Redis Backend (cache)
Optional persistent backend for GCS that provides crash recovery and cluster state persistence across restarts
Raylet Task Queue (queue)
Per-node priority queue of pending tasks waiting for worker assignment or resource availability

Feedback Loops

Delays

Control Points

Technology Stack

Apache Arrow (serialization)
Handles zero-copy serialization of columnar data structures between Python and C++ components for efficient data transfer
gRPC (framework)
Provides inter-node communication protocol for control messages between GCS, Raylets, and CoreWorkers in the distributed system
Plasma (database)
Implements shared memory object store for zero-copy data access across processes on the same node using memory-mapped files
Bazel (build)
Builds the multi-language codebase with precise dependency management between C++, Python, and Java components
React/TypeScript (framework)
Powers the web dashboard for cluster monitoring, job tracking, and distributed debugging with real-time updates
PyTorch/TensorFlow (library)
Integrates with Ray Train and RLlib for distributed deep learning training with automatic data parallelism
Kubernetes (infra)
Provides container orchestration for Ray clusters with the KubeRay operator handling node lifecycle and autoscaling
Apache Parquet (serialization)
Default storage format for Ray Data enabling efficient columnar analytics and ML feature preprocessing

Key Components

Package Structure

python (app)
Core Ray runtime engine with distributed computing primitives, dashboard, and ML libraries for training, serving, and hyperparameter tuning
release (tooling)
Testing framework and benchmark suite for validating Ray's performance and correctness across different ML workloads

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Repository Repositories

Frequently Asked Questions

What is ray used for?

Distributes AI/ML workloads across clusters with automatic scaling and resource management ray-project/ray is a 8-component repository written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 7101 files.

How is ray architected?

ray is organized into 5 architecture layers: Core Runtime, Python API Layer, ML Libraries, Dashboard & Monitoring, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through ray?

Data moves through 6 stages: Cluster Initialization → Task Submission → Global Scheduling → Local Execution → Result Storage → .... Ray applications start with ray.init() which connects to or starts a cluster. Users submit distributed functions via @ray.remote decorators, which creates TaskSpecs that flow through the GCS for global coordination and Raylets for local scheduling. Tasks execute on CoreWorkers, producing ObjectRefs that reference results in the distributed object store. For ML workloads, higher-level libraries like Tune coordinate multiple training jobs while Serve handles model inference with automatic scaling based on request load. This pipeline design reflects a complex multi-stage processing system.

What technologies does ray use?

The core stack includes Apache Arrow (Handles zero-copy serialization of columnar data structures between Python and C++ components for efficient data transfer), gRPC (Provides inter-node communication protocol for control messages between GCS, Raylets, and CoreWorkers in the distributed system), Plasma (Implements shared memory object store for zero-copy data access across processes on the same node using memory-mapped files), Bazel (Builds the multi-language codebase with precise dependency management between C++, Python, and Java components), React/TypeScript (Powers the web dashboard for cluster monitoring, job tracking, and distributed debugging with real-time updates), PyTorch/TensorFlow (Integrates with Ray Train and RLlib for distributed deep learning training with automatic data parallelism), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does ray have?

ray exhibits 4 data pools (Plasma Object Store, GCS State Store), 5 feedback loops, 5 control points, 4 delays. The feedback loops handle auto-scale and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does ray use?

5 design patterns detected: Remote Procedure Call, Actor Model, Shared Memory Object Store, Hierarchical Scheduling, Lazy Evaluation DAG.

Analyzed on April 20, 2026 by CodeSea. Written by .