apache/superset

Apache Superset is a Data Visualization and Data Exploration Platform

72,492 stars TypeScript 10 components

Transforms database query results into interactive charts, dashboards, and reports via web interface

Data flows through Superset starting with database connections that provide access to source tables. Users explore data through the frontend interface, building charts by selecting datasets, metrics, and visualizations. The Explore interface sends form data to the backend, which transforms it into SQL queries, executes them against databases, caches results, and returns formatted data for visualization. Charts can be organized into dashboards with cross-filtering capabilities, and the system maintains user permissions and audit logs throughout.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component fullstack. 5546 files analyzed. Data flows through 8 distinct pipeline stages.

How Data Flows Through the System

Connect to data source — DatabaseDAO validates connection parameters, tests connectivity using database-specific engine specs (like PostgresEngineSpec), encrypts credentials, and stores Database model with sqlalchemy_uri and configuration [Database connection parameters → Database]
Sync dataset metadata — DatasetDAO queries information_schema tables through SQLAlchemy, creates Dataset models with column definitions and data types, auto-generates basic metrics like COUNT(*) [Database → Dataset]
Build chart configuration — Explore UI components collect user selections (viz_type, metrics, groupby, filters) into FormData object, validates against chart type requirements, sends to ChartRestApi via POST /api/v1/chart [Dataset → FormData]
Transform to query context — ChartRestApi converts FormData into QueryContext with datasource reference and query specifications, applies security filters from SecurityManager, validates user permissions [FormData → QueryContext]
Generate and execute SQL — QueryContextProcessor builds SQL queries from QueryContext using dataset table/column metadata, applies row-level security filters, executes via database connection pool, handles query timeouts [QueryContext → Query results]
Cache and format results — CacheManager stores query results in Redis with generated cache keys, QueryContextProcessor formats data for visualization (timestamps, numbers, nulls), returns JSON response to frontend [Query results → Formatted chart data]
Render visualization — Frontend chart components (in superset-frontend/src/visualizations/) receive data and form_data, apply client-side transformations, render using visualization libraries like D3, Plotly [Formatted chart data → Visual chart]
Compose dashboard — Dashboard layout engine positions charts in grid system, DashboardFilterStateProcessor coordinates cross-filtering between charts, maintains filter state in URL parameters [Chart → Dashboard]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Database superset/models/core.py
SQLAlchemy model with database_name: str, sqlalchemy_uri: str, encrypted_extra: JSON config, engine-specific parameters and connection pooling settings
Created via UI/API with connection parameters, tested for connectivity, stores encrypted credentials, used by query execution engine

Dataset superset/models/core.py
SQLAlchemy model with table_name: str, database_id: FK, columns: relationship to Column objects, metrics: list of calculated measures, filter configurations
Synced from database tables or created manually, defines available columns and metrics, serves as data source for visualizations

Chart superset/models/slice.py
SQLAlchemy model with slice_name: str, viz_type: str, params: JSON form_data configuration, datasource_id: FK, query_context for data fetching
Built in Explore interface with form_data configuration, saves visualization state, embedded in dashboards, generates SQL queries for rendering

Dashboard superset/models/dashboard.py
SQLAlchemy model with dashboard_title: str, position_json: layout configuration, slices: M2M relationship to charts, roles/owners for access control
Created by arranging charts in grid layout, applies cross-filtering between charts, manages user permissions and sharing settings

QueryContext superset/query_context.py
Dict with datasource: Dataset reference, queries: list of Query objects containing metrics, groupby, filters, ordering - represents complete chart data request
Built from chart form_data, validated and transformed into SQL queries, results cached with cache keys, returned as JSON to frontend

FormData superset-frontend/src/explore/types.ts
TypeScript interface with viz_type: string, datasource: string, metrics: QueryFormMetric[], groupby: QueryFormColumn[], filters: adhoc and simple filter arrays
Built incrementally in Explore UI controls, sent to backend via chart API, persisted as Chart.params, drives query generation

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

Assumes npm command is available in PATH and returns version in format 'v{major}.{minor}.{patch}' when called with --version, but never validates the output format before parsing

If this fails: If npm returns unexpected version format or is aliased to different tool, semver.compare() will crash with parsing error instead of graceful failure message

superset-extensions-cli/src/superset_extensions_cli/cli.py:_check_npm_version

critical Contract unguarded

Assumes host application will inject concrete DAO implementations by replacing the abstract BaseDAO classes, but provides no validation that injected classes implement required abstract methods

If this fails: Extension code using DatasetDAO.find_all() will get AttributeError at runtime if host fails to properly inject implementations, breaking all extensions that depend on data access

superset-core/src/superset_core/common/daos.py:BaseDAO

critical Contract unguarded

Assumes host application will initialize the global session variable before any model operations, but never validates session is configured

If this fails: Calling get_session() returns None if host hasn't initialized database session, causing all database operations to fail silently or with confusing errors

superset-core/src/superset_core/common/models.py:get_session

warning Domain weakly guarded

Assumes extensions use semantic versioning with exactly 3 numeric components (major.minor.patch) but many real projects use 4-component versions like '1.2.3.4' or pre-release identifiers like '1.0.0-beta'

If this fails: Extension with version '1.0.0-alpha' or '2.1.0.1' fails validation with confusing regex mismatch error instead of helpful version format message

superset-core/src/superset_core/extensions/constants.py:VERSION_PATTERN

warning Resource unguarded

Assumes npm install and build commands complete within reasonable time limits, but runs subprocess.run() with no timeout parameter

If this fails: Build process can hang indefinitely if npm registry is slow or build scripts have infinite loops, blocking CLI tool without any way to recover except process kill

superset-extensions-cli/src/superset_extensions_cli/cli.py:build_frontend

warning Shape weakly guarded

Assumes extension.json contains valid JSON that matches ExtensionConfig schema, but only validates the schema without checking if JSON parsing succeeded

If this fails: Malformed JSON in extension.json causes JSONDecodeError during file read, bypassing Pydantic validation and producing unhelpful error about file format rather than specific JSON syntax issue

superset-extensions-cli/src/superset_extensions_cli/cli.py:_create_manifest

warning Ordering unguarded

Assumes build steps execute in fixed order (frontend first, then backend, then manifest) but provides no rollback mechanism if later steps fail

If this fails: If manifest creation fails after successful frontend build, leaves extension in inconsistent state with built assets but no manifest, requiring manual cleanup or full rebuild

superset-extensions-cli/src/superset_extensions_cli/cli.py:build_command

warning Environment unguarded

Assumes file system events fire in deterministic order and that file writes are atomic, but watchdog can deliver events out of order or for partial writes

If this fails: Rapid file changes during development can trigger multiple concurrent builds or attempt to read partially written files, leading to build failures or corrupted output

superset-extensions-cli/src/superset_extensions_cli/cli.py:WatchHandler

info Scale unguarded

Assumes extension files fit comfortably in memory when creating zip archive, with no size limits or streaming for large extensions

If this fails: Extensions with large assets (videos, datasets, ML models) cause MemoryError during zip creation, with no indication of size limits or alternative approaches

superset-extensions-cli/src/superset_extensions_cli/cli.py:create_zip_bundle

info Domain guarded

Assumes technical names follow DNS-like naming conventions (lowercase, hyphens) but many developers expect underscore_separated or camelCase naming from other ecosystems

If this fails: Valid Python package names like 'my_extension' or JavaScript conventions like 'myExtension' are rejected, forcing developers to rename projects and potentially break existing imports

superset-core/src/superset_core/extensions/constants.py:TECHNICAL_NAME_PATTERN

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Metadata Database (database)
SQLAlchemy models storing all Superset configuration - databases, datasets, charts, dashboards, users, permissions persisted in PostgreSQL/MySQL

Query Results Cache (cache)
Redis cache storing query execution results keyed by hash of SQL query, user context, and dataset version to avoid re-executing expensive queries

Celery Task Queue (queue)
Asynchronous task processing for reports, thumbnails, cache warming, and data imports using Redis as broker and result backend

Filter State Store (cache)
Temporary storage for dashboard filter states and explore form data, enabling URL sharing and cross-tab persistence

Feedback Loops

Query Cache Invalidation (cache-invalidation, balancing) — Trigger: Dataset schema changes or manual cache clear. Action: CacheManager removes cached query results matching dataset patterns. Exit: All related cache entries cleared.
Dashboard Filter Propagation (recursive, reinforcing) — Trigger: User applies filter in dashboard. Action: DashboardFilterStateProcessor updates all eligible charts, each chart re-executes with new filters, triggers cache lookups. Exit: All charts finish loading with applied filters.
Connection Pool Recovery (circuit-breaker, balancing) — Trigger: Database connection failures exceed threshold. Action: SQLAlchemy pool marks connections as invalid, creates new connections, retries failed queries. Exit: Connection health restored or max retries reached.
Async Task Retry (retry, balancing) — Trigger: Celery task failure (report generation, thumbnail creation). Action: Task requeued with exponential backoff delay, increments retry counter. Exit: Task succeeds or max retries exceeded.

Delays

Query Execution (async-processing, ~Varies by query complexity and data size) — Frontend shows loading spinner while waiting for query results, user can navigate away
Cache TTL Expiry (cache-ttl, ~Configurable per chart/dashboard, default 1 hour) — Cached query results expire and trigger fresh database queries on next access
Celery Task Processing (queue-drain, ~Depends on worker capacity and queue depth) — Reports and thumbnails generated asynchronously, users notified when complete
Database Connection Pool Warmup (warmup, ~2-5 seconds on application start) — First queries to each database may be slower while connections are established

Control Points

FEATURE_FLAGS (feature-flag) — Controls: Enables/disables entire feature sets like dashboard filters, SQL Lab, embedded mode, async chart loading. Default: Dict of feature names to boolean values
SQLLAB_QUERY_TIMEOUT (threshold) — Controls: Maximum execution time for ad-hoc SQL queries before termination. Default: 300 seconds default
RESULTS_BACKEND_USE_MSGPACK (serialization-mode) — Controls: Whether Celery uses msgpack vs JSON for result serialization, affects performance and data size limits. Default: True for performance
CACHE_CONFIG (cache-strategy) — Controls: Redis connection parameters, default TTL values, cache key patterns for different data types. Default: Redis with 1 hour default TTL
DATABASE_DIALECT_LIMITS (database-specific-config) — Controls: Per-database limits on query size, result rows, connection pooling, SQL dialect features. Default: Varies by engine spec class

Technology Stack

Flask (framework)
Web framework providing HTTP routing, request handling, and application structure with blueprint organization

Flask-AppBuilder (framework)
Extension providing REST API generation, user authentication, role-based permissions, and admin interface scaffolding

SQLAlchemy (database)
ORM handling database connections, model definitions, query generation, and connection pooling across multiple database types

React (framework)
Frontend framework building interactive dashboard and chart interfaces with component-based architecture and state management

Redux (library)
Frontend state management coordinating data flow between dashboard filters, chart configurations, and API responses

Celery (runtime)
Asynchronous task processing for report generation, cache warming, thumbnail creation, and data import jobs

Redis (database)
Caching query results and metadata, Celery task queue broker, and session storage for user state persistence

Pandas (library)
Data processing library transforming query results, handling time series operations, and preparing data for visualization

Key Components

SupersetAppInitializer (factory) — Initializes Flask application with all extensions, database connections, security configuration, and registers blueprints - coordinates entire application bootstrap superset/initialization/__init__.py
ChartRestApi (gateway) — REST API controller that handles chart CRUD operations, executes chart queries via QueryContext, manages permissions, and returns visualization data superset/charts/api.py
QueryContextProcessor (processor) — Transforms chart configuration into SQL queries, applies security filters, executes against databases, and formats results for frontend consumption superset/query_context.py
ConnectorRegistry (registry) — Maps datasource types to their corresponding model classes and provides factory methods for creating datasource instances based on type superset/connectors/__init__.py
DatabaseDAO (store) — Data access object providing CRUD operations for Database models with connection testing, metadata extraction, and query validation capabilities superset/daos/database.py
CacheManager (store) — Manages Redis-based caching for query results, metadata, and thumbnails with configurable TTL and cache key generation strategies superset/extensions/__init__.py
SecurityManager (validator) — Enforces role-based access control, row-level security filters, and database access permissions across all data operations superset/security/__init__.py
SqlLabExecutor (executor) — Executes ad-hoc SQL queries in SQL Lab interface with templating support, query limits, and result streaming for large datasets superset/sqllab/query_render.py
DashboardFilterStateProcessor (processor) — Manages cross-filtering state across dashboard charts, coordinates filter propagation, and maintains filter persistence in URLs and cache superset/dashboards/filter_state/
ThumbnailGenerator (processor) — Generates thumbnail images of dashboards and charts using headless browser automation with Selenium for preview and sharing superset/thumbnails/

Package Structure

superset (app)
Main Superset application with Flask web server, database connectivity, chart/dashboard management, and user authentication

superset-core (library)
Core API library providing extension points and model abstractions for Superset plugins

superset-extensions-cli (tooling)
Command-line tool for scaffolding, building, and packaging Superset extensions

superset-frontend (app)
React-based web interface with chart builders, dashboard editors, and data exploration tools

superset-websocket (app)
WebSocket server for real-time communication between backend and frontend

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare superset

Related Fullstack Repositories

Frequently Asked Questions

What is superset used for?

Transforms database query results into interactive charts, dashboards, and reports via web interface apache/superset is a 10-component fullstack written in TypeScript. Data flows through 8 distinct pipeline stages. The codebase contains 5546 files.

How is superset architected?

superset is organized into 4 architecture layers: Data Access Layer, Business Logic Layer, API Layer, Presentation Layer. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through superset?

Data moves through 8 stages: Connect to data source → Sync dataset metadata → Build chart configuration → Transform to query context → Generate and execute SQL → .... Data flows through Superset starting with database connections that provide access to source tables. Users explore data through the frontend interface, building charts by selecting datasets, metrics, and visualizations. The Explore interface sends form data to the backend, which transforms it into SQL queries, executes them against databases, caches results, and returns formatted data for visualization. Charts can be organized into dashboards with cross-filtering capabilities, and the system maintains user permissions and audit logs throughout. This pipeline design reflects a complex multi-stage processing system.

What technologies does superset use?

The core stack includes Flask (Web framework providing HTTP routing, request handling, and application structure with blueprint organization), Flask-AppBuilder (Extension providing REST API generation, user authentication, role-based permissions, and admin interface scaffolding), SQLAlchemy (ORM handling database connections, model definitions, query generation, and connection pooling across multiple database types), React (Frontend framework building interactive dashboard and chart interfaces with component-based architecture and state management), Redux (Frontend state management coordinating data flow between dashboard filters, chart configurations, and API responses), Celery (Asynchronous task processing for report generation, cache warming, thumbnail creation, and data import jobs), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does superset have?

superset exhibits 4 data pools (Metadata Database, Query Results Cache), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle cache-invalidation and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does superset use?

5 design patterns detected: Command Pattern, Data Access Object (DAO), Engine Specification Pattern, Plugin Architecture, Layered Security.

How does superset compare to alternatives?

CodeSea has side-by-side architecture comparisons of superset with redash, metabase. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.