zenml-io/zenml
ZenML 🙏: One AI Platform from Pipelines to Agents. https://zenml.io.
MLOps platform orchestrating AI/ML pipelines from classical ML to agentic workflows
ML data flows through ingestion, preprocessing, training, evaluation, and deployment stages orchestrated by ZenML pipelines
Under the hood, the system uses 3 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.
Structural Verdict
A 10-component ml training with 12 connections. 2156 files analyzed. Highly interconnected — components depend on each other heavily.
How Data Flows Through the System
ML data flows through ingestion, preprocessing, training, evaluation, and deployment stages orchestrated by ZenML pipelines
- Data Ingestion — Load raw datasets from various sources including LakeFS, S3, or local files
- Preprocessing — Clean, transform, and prepare data using sklearn pipelines or custom transformers
- Model Training — Train ML models using frameworks like HuggingFace Transformers with LoRA fine-tuning
- Evaluation — Compute metrics and validate model performance using test datasets
- Deployment — Deploy trained models as web services using FastAPI runners with dashboard interfaces
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Versioned ML artifacts, models, and datasets stored across pipeline runs
Pipeline run metadata, experiment tracking, and lineage information
Git-like versioned data lake with branch and commit semantics
Feedback Loops
- Pipeline Retry Logic (retry, balancing) — Trigger: Step failure or infrastructure error. Action: Retry failed pipeline steps with exponential backoff. Exit: Max retries reached or step succeeds.
- Model Performance Monitoring (training-loop, balancing) — Trigger: Scheduled evaluation runs. Action: Compare model metrics against thresholds. Exit: Performance degradation detected triggering retraining.
- Agent Outer Loop (recursive, reinforcing) — Trigger: Agent receives task. Action: Plan, execute tools, evaluate results, and iterate. Exit: Task completed or max iterations reached.
Delays & Async Processing
- Model Training (async-processing, ~minutes to hours) — Pipeline blocks until training completes with artifacts stored
- Container Build (async-processing, ~1-10 minutes) — Pipeline execution waits for custom environment containerization
- LakeFS Commit (eventual-consistency, ~seconds) — Data changes become visible after commit operation completes
Control Points
- Pipeline Configuration (env-var) — Controls: Model hyperparameters, data paths, and infrastructure settings
- Integration Registry (runtime-toggle) — Controls: Which ML framework integrations are active
- Deployment Settings (feature-flag) — Controls: Endpoint configuration, middleware, and security settings for deployed models
Technology Stack
Web framework for deployment services and REST APIs
Data validation and settings management throughout the codebase
Database ORM for metadata and experiment tracking
Command-line interface framework
Containerization for pipeline execution environments
Language model training and inference in examples
Web UI for model demos and interfaces
Data versioning and branch management for ML datasets
Key Components
- Client (class) — Main entry point for interacting with ZenML services, managing pipelines and artifacts
src/zenml/client.py - BaseDeploymentAppRunner (class) — Abstract base for creating web applications that serve ML models and pipelines
src/zenml/deployers/server/app.py - FastAPIDeploymentAppRunner (class) — FastAPI implementation of deployment runner with CORS, static files, and templating
src/zenml/deployers/server/fastapi/app.py - EndpointAdapter (class) — Converts framework-agnostic endpoint specifications to framework-specific endpoints
src/zenml/deployers/server/adapters.py - MiddlewareAdapter (class) — Adapts middleware specifications for different web frameworks
src/zenml/deployers/server/adapters.py - LakeFSRef (class) — Lightweight JSON-serializable pointer to datasets stored in LakeFS for data versioning
examples/lakefs_data_versioning/utils/lakefs_ref.py - lakefs_utils (module) — Helper functions for LakeFS interaction via SDK and S3-compatible gateway
examples/lakefs_data_versioning/utils/lakefs_utils.py - compute_metrics (function) — Computes accuracy metrics for NLP model evaluation using HuggingFace datasets
examples/e2e_nlp/utils/misc.py - sentiment_analysis (cli-command) — Launches Gradio interface for text classification sentiment analysis
examples/e2e_nlp/gradio/app.py - load_base_model (function) — Loads and configures base language models for fine-tuning with quantization options
examples/llm_finetuning/utils/loaders.py
Sub-Modules
Web application framework for deploying ML models and pipelines as HTTP services
Data versioning system using LakeFS for managing TB-scale datasets with git-like semantics
Complete framework for fine-tuning language models with LoRA, quantization, and monitoring
End-to-end NLP workflow with training, evaluation, and Gradio deployment
Configuration
pull_request_cloudbuild.yaml (yaml)
steps(array, unknown) — default: [object Object],[object Object],[object Object],[object Object],[object Object]timeout(string, unknown) — default: 3600savailableSecrets.secretManager(array, unknown) — default: [object Object],[object Object]
release-cloudbuild-nightly.yaml (yaml)
steps(array, unknown) — default: [object Object],[object Object],[object Object],[object Object],[object Object]timeout(string, unknown) — default: 3600savailableSecrets.secretManager(array, unknown) — default: [object Object],[object Object]
release-cloudbuild-preparation.yaml (yaml)
steps(array, unknown) — default: [object Object],[object Object],[object Object]timeout(string, unknown) — default: 3600savailableSecrets.secretManager(array, unknown) — default: [object Object],[object Object]
release-cloudbuild.yaml (yaml)
steps(array, unknown) — default: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]timeout(string, unknown) — default: 3600savailableSecrets.secretManager(array, unknown) — default: [object Object],[object Object]
Science Pipeline
- Load NLP dataset — datasets.load_dataset then tokenization
examples/e2e_nlp/steps/ - Model training — HuggingFace Trainer with LoRA fine-tuning [(batch_size, sequence_length) → (batch_size, num_classes)]
examples/llm_finetuning/utils/loaders.py - Metric computation — argmax on logits then accuracy calculation [(batch_size, num_classes) → scalar]
examples/e2e_nlp/utils/misc.py - LakeFS data read — S3 gateway read then pandas parquet parse
examples/lakefs_data_versioning/utils/lakefs_utils.py
Assumptions & Constraints
- [warning] Assumes logits has shape (batch_size, num_classes) and labels has shape (batch_size,) but no assertion enforces this (shape)
- [critical] Assumes GPU availability when use_accelerate=True but no device validation occurs (device)
- [info] Assumes parquet format but no file type validation before pandas read_parquet call (format)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is zenml used for?
MLOps platform orchestrating AI/ML pipelines from classical ML to agentic workflows zenml-io/zenml is a 10-component ml training written in Python. Highly interconnected — components depend on each other heavily. The codebase contains 2156 files.
How is zenml architected?
zenml is organized into 4 architecture layers: Core SDK, Server Backend, Integrations, Deployment Framework. Highly interconnected — components depend on each other heavily. This layered structure enables tight integration between components.
How does data flow through zenml?
Data moves through 5 stages: Data Ingestion → Preprocessing → Model Training → Evaluation → Deployment. ML data flows through ingestion, preprocessing, training, evaluation, and deployment stages orchestrated by ZenML pipelines This pipeline design reflects a complex multi-stage processing system.
What technologies does zenml use?
The core stack includes FastAPI (Web framework for deployment services and REST APIs), Pydantic (Data validation and settings management throughout the codebase), SQLAlchemy (Database ORM for metadata and experiment tracking), Click (Command-line interface framework), Docker (Containerization for pipeline execution environments), HuggingFace Transformers (Language model training and inference in examples), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does zenml have?
zenml exhibits 3 data pools (Artifact Store, Metadata Database), 3 feedback loops, 3 control points, 3 delays. The feedback loops handle retry and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does zenml use?
4 design patterns detected: Plugin Architecture, Pipeline as Code, Framework Abstraction, Data Versioning.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.