ray-project/ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Distributed AI compute engine with runtime and ML libraries
Ray processes distributed workloads through task graphs and actor systems, with data flowing from user code through the scheduler to workers and back via the object store.
Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
Structural Verdict
A 10-component ml training with 4 connections. 7035 files analyzed. Loosely coupled — components are relatively independent.
How Data Flows Through the System
Ray processes distributed workloads through task graphs and actor systems, with data flowing from user code through the scheduler to workers and back via the object store.
- Submit Task/Actor — User submits remote functions or creates actors via Python API
- Schedule Work — GCS and Raylet schedule tasks to available worker nodes
- Execute on Workers — CoreWorker processes execute tasks and store results in ObjectStore
- Return Results — Object references returned to caller, data retrieved on-demand from ObjectStore
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Distributed shared memory storing Ray task results and large objects
Global control store maintaining cluster metadata, job state, and actor registry
Frontend caching layer for cluster metrics and status information
Feedback Loops
- Autoscaling (auto-scale, balancing) — Trigger: Resource utilization metrics from workers. Action: Add/remove worker nodes based on demand. Exit: Target resource utilization achieved.
- Actor Reconstruction (retry, balancing) — Trigger: Actor failure detection. Action: Recreate failed actors and restore state. Exit: Actor successfully reconstructed or max retries exceeded.
- Serve Replica Scaling (auto-scale, balancing) — Trigger: Request queue length and response times. Action: Scale deployment replicas up/down. Exit: Target ongoing requests threshold met.
Delays & Async Processing
- Lazy Dataset Evaluation (async-processing, ~Until .take() or .consume() called) — Computation deferred until results actually needed
- Object Store Eviction (cache-ttl, ~Based on LRU policy and memory pressure) — Objects may need reconstruction if evicted
- Health Check Intervals (polling, ~Configurable, typically 5-30 seconds) — Failure detection latency affects recovery times
Control Points
- RAY_ADDRESS (env-var) — Controls: Which Ray cluster to connect to. Default: null
- num_replicas (runtime-toggle) — Controls: Number of deployment replicas for serving. Default: null
- max_ongoing_requests (threshold) — Controls: Request batching and backpressure. Default: null
- object_store_memory (env-var) — Controls: Shared memory allocation for object store. Default: null
Package Structure
This monorepo contains 2 packages:
Main Ray Python package containing the core runtime, dashboard, and all AI libraries (Serve, Train, Tune, RLlib, Data).
Release testing infrastructure with benchmarks, integration tests, and example applications for validating Ray deployments.
Technology Stack
Core distributed runtime implementation
Primary user API and library implementations
Dashboard web interface
Inter-service communication
Shared memory object store
Deep learning integration
Build system
Dashboard frontend
Key Components
- ray.init (function) — Initializes Ray runtime and connects to cluster
python/ray/_private/ray_init.py - ServeDeployment (class) — Manages lifecycle and scaling of individual model deployments
python/ray/serve/_private/deployment_state.py - Algorithm (class) — Base class for all reinforcement learning algorithms
rllib/algorithms/algorithm.py - Dataset (class) — Distributed data processing with lazy evaluation and streaming
python/ray/data/dataset.py - Tuner (class) — High-level API for hyperparameter tuning experiments
python/ray/tune/tuner.py - StatusChip (component) — React component displaying color-coded status indicators across dashboard
python/ray/dashboard/client/src/components/StatusChip.tsx - ReleaseTest (class) — Framework for defining and running distributed release tests
release/ray_release/test.py - CoreWorker (class) — C++ worker process managing task execution and object storage
src/ray/core_worker/core_worker.cc - ObjectStore (class) — Distributed shared memory system for Ray objects
src/ray/object_manager/plasma/store.cc - GcsServer (class) — Global control service managing cluster metadata and coordination
src/ray/gcs/gcs_server/gcs_server.cc
Sub-Modules
Reinforcement learning library with algorithms, environments, and training infrastructure
Model serving framework for ML inference with autoscaling and deployment management
Distributed data processing engine for ML preprocessing and ETL workloads
Hyperparameter tuning library with search algorithms and schedulers
Configuration
semgrep.yml (yaml)
rules(array, unknown) — default: [object Object],[object Object]
ray-images.json (json)
ray.defaults.python(string, unknown) — default: 3.10ray.defaults.gpu_platform(string, unknown) — default: cu12.1.1-cudnn8ray.defaults.architecture(string, unknown) — default: x86_64ray.python(array, unknown) — default: 3.10,3.11,3.12,3.13ray.architectures(array, unknown) — default: x86_64,aarch64ray.exceptions(array, unknown) — default: [object Object]ray.platforms(array, unknown) — default: cpu,tpu,cu11.7.1-cudnn8,cu11.8.0-cudnn8,cu12.1.1-cudnn8,cu12.3.2-cudnn9,cu12.4.1-cudnn,cu12.5.1-cudnn,cu12.6.3-cudnn,cu12.8.1-cudnn,cu12.9.1-cudnn,cu13.0.0-cudnnray-ml.defaults.python(string, unknown) — default: 3.10- +11 more parameters
ci/ray_ci/doc/api.py (python-dataclass)
name(str, unknown)annotation_type(AnnotationType, unknown)code_type(CodeType, unknown)
ci/raydepsets/workspace.py (python-dataclass)
build_args(Dict[str, str], unknown)
Science Pipeline
- Data Ingestion — ray.data.read_* functions load from various sources (S3, files, databases) [Variable depending on source format → Ray Dataset with inferred schema]
python/ray/data/datasource/ - Data Transformation — map_batches applies user functions with configurable batch sizes [(batch_size, *feature_dims) → Transformed batch maintaining batch dimension]
python/ray/data/dataset.py - Model Training — Distributed training across multiple workers with gradient aggregation [(batch_size, *input_dims) → Updated model parameters]
python/ray/train/trainer.py - Model Serving — Deploy trained models with autoscaling based on request load [HTTP request body or batch of inputs → Model predictions as HTTP response]
python/ray/serve/deployment.py
Assumptions & Constraints
- [warning] Assumes batch functions can handle variable batch sizes but doesn't validate batch size constraints (shape)
- [critical] Assumes model input/output formats are compatible between preprocessing and serving without schema validation (format)
- [warning] Assumes observation and action spaces match between environment and policy without runtime checks (shape)
- [info] Assumes metrics are numeric types but accepts any serializable object (dtype)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is ray used for?
Distributed AI compute engine with runtime and ML libraries ray-project/ray is a 10-component ml training written in Python. Loosely coupled — components are relatively independent. The codebase contains 7035 files.
How is ray architected?
ray is organized into 4 architecture layers: Core Runtime, Python APIs, Dashboard, Testing Infrastructure. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.
How does data flow through ray?
Data moves through 4 stages: Submit Task/Actor → Schedule Work → Execute on Workers → Return Results. Ray processes distributed workloads through task graphs and actor systems, with data flowing from user code through the scheduler to workers and back via the object store. This pipeline design keeps the data transformation process straightforward.
What technologies does ray use?
The core stack includes C++ (Core distributed runtime implementation), Python (Primary user API and library implementations), React (Dashboard web interface), gRPC (Inter-service communication), Plasma (Shared memory object store), PyTorch (Deep learning integration), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does ray have?
ray exhibits 3 data pools (ObjectStore, GCS), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle auto-scale and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does ray use?
5 design patterns detected: Actor Model, Lazy Evaluation, Handle Pattern, Plugin Architecture, Event-Driven State Management.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.