openstatushq/openstatus

🫖 Status page with uptime monitoring & API monitoring as code 🫖

8,580 stars TypeScript 9 components

Runs uptime monitoring checks, status pages, and incident notifications across global regions

The monitoring pipeline starts when distributed checker applications probe monitored endpoints every few minutes from multiple global regions, measuring response times and status codes. These results flow into the central workflows service which detects status changes and creates incident records when services go down. Status changes trigger notification workflows that send alerts via email, Slack, Discord, and other channels. Public status pages query the incident database to display current service health, while the admin dashboard allows users to configure monitors and view analytics. Screenshots are captured during incidents for documentation purposes.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component fullstack. 1483 files analyzed. Data flows through 8 distinct pipeline stages.

How Data Flows Through the System

Schedule monitoring checks — The cron service in workflows app sends HTTP requests to checker applications across regions every 30s/1m/5m/10m/30m based on monitor configuration, triggering distributed health checks [Monitor configurations → Check triggers] (config: monitors.periodicity, regions.enabled)
Execute endpoint probes — JobRunner in checker apps makes HTTP requests to monitored URLs, measuring response time and status code, handling timeouts and connection errors with configurable retry logic [Check triggers → MonitorResult] (config: monitors.timeout, monitors.retry_count)
Aggregate regional results — The checkerRoute handler in workflows service receives MonitorResult payloads from all regions and applies consensus logic to determine overall service status [MonitorResult → Status aggregation] (config: monitors.regions, monitors.degraded_after_failures)
Detect status transitions — upsertMonitorStatus compares new status against previous status in database, identifying transitions from operational→degraded→down or recovery patterns [Status aggregation → Status change events] (config: monitors.status_threshold)
Create and manage incidents — When failures are detected, findOpenIncident checks for existing incidents and either creates new IncidentRecord or updates resolution timestamp for recovery [Status change events → IncidentRecord]
Trigger notifications — triggerNotifications sends alerts to configured channels (Slack, Discord, email, webhooks) when incidents are created or resolved, with customizable message templates [IncidentRecord → Notification messages] (config: notifications.channels, notifications.templates)
Display public status — Status page apps query incident and monitor tables to render current service health, historical uptime percentages, and incident timeline for public viewing [IncidentRecord → Status page HTML] (config: status_page.theme, status_page.custom_domain)
Capture incident screenshots — Screenshot service receives QStash webhooks when incidents occur, uses Playwright to capture full-page screenshots, and stores them in R2 bucket with incident ID [ScreenshotRequest → Screenshot URLs] (config: screenshot.enabled, storage.r2_bucket)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

MonitorResult apps/workflows/src/checker/index.ts
object with monitorId: string, statusCode: number, region: monitorRegions enum, cronTimestamp: number, status: monitorStatusSchema, latency: number, message: string
Created by checker agents during endpoint probes, processed by workflows service to detect incidents, stored in database for analytics

IncidentRecord apps/workflows/src/checker/index.ts
database record with id, monitorId, resolvedAt timestamp, autoResolved boolean flag, and incident metadata
Created when monitor failures are detected, displayed on status pages during outages, resolved when monitors recover

AuthSession apps/dashboard/src/lib/auth/index.ts
NextAuth session object with user id, email, OAuth provider details, and profile information
Generated during OAuth flow with GitHub/Google, persisted in auth adapter, validated on each authenticated request

StatusPageConfig apps/status-page/src/lib/resolve-route.ts
ResolvedRoute object with type: 'hostname'|'pathname', prefix: string, locale: Locale, localeExplicit: boolean, rewritePath: string
Resolved from incoming HTTP requests based on hostname or path, used to determine which status page to display and in what language

ScreenshotRequest apps/screenshot-service/src/index.ts
Zod schema with url: URL, incidentId: number, kind: 'incident'|'recovery' enum
Triggered via QStash when incidents occur, processed by Playwright to capture page screenshots, stored in R2 bucket for incident documentation

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

OPENSTATUS_KEY environment variable contains a valid API key that never expires and has sufficient permissions

If this fails: If the API key is invalid, expired, or lacks permissions, all monitor checks fail silently without alerting operators - the checker appears to run but produces no results

apps/checker/cmd/private/main.go:main

warning Temporal unguarded

Monitor configuration updates can wait up to 10 minutes to be picked up by checker agents

If this fails: Critical monitors added during outages won't be checked for up to 10 minutes, and disabled monitors continue running unnecessary checks, wasting resources and potentially triggering false alerts

apps/checker/cmd/private/main.go:configRefreshInterval

critical Resource unguarded

Container has sufficient memory to launch Chromium browser instances without being killed by OOM

If this fails: Screenshot capture fails silently during high incident volume when multiple Chromium instances exhaust container memory, leaving incidents without visual evidence

apps/screenshot-service/src/index.ts:playwright.chromium.launch

critical Scale weakly guarded

Railway region headers contain exactly one of four hardcoded region values

If this fails: When Railway adds new regions or changes region identifiers, requests route to undefined targetUrl causing panic, taking down the entire proxy service

apps/railway-proxy/main.go:proxy

critical Contract weakly guarded

MonitorResult payloads from regional checkers always include cronTimestamp as Unix milliseconds in the same timezone

If this fails: If checkers send timestamps in different formats or timezones, incident timing becomes corrupted, causing false recovery notifications and incorrect SLA calculations

apps/workflows/src/checker/index.ts:payloadSchema

warning Ordering unguarded

Only one incident can be open per monitor at any given time

If this fails: If multiple incident creation requests race during rapid status changes, duplicate incidents are created but only one gets resolved, leaving phantom open incidents that block future incident creation

apps/workflows/src/checker/index.ts:findOpenIncident

warning Domain weakly guarded

Custom domain hostnames always have exactly 3+ segments separated by dots, with the subdomain as the first segment

If this fails: Status pages hosted on unusual domains (e.g., single-level domains, IPv6 addresses, or domains with multiple subdomain levels) are misrouted, showing wrong status pages or 404 errors to customers

apps/status-page/src/lib/resolve-route.ts:resolveRoute

info Environment weakly guarded

OAuth profile objects from Google and GitHub providers always contain expected fields (given_name, family_name, picture, avatar_url)

If this fails: Authentication succeeds but user profile updates fail silently when OAuth providers change their response schema, leaving users with incomplete profiles and broken avatars

apps/dashboard/src/lib/auth/index.ts:signIn

warning Temporal unguarded

Screenshot filenames using Date.now() are globally unique across all incident captures

If this fails: Simultaneous incident screenshots for the same incident ID overwrite each other in R2 storage, leaving only the last screenshot and losing evidence of the incident progression

apps/screenshot-service/src/index.ts:Date.now

warning Resource unguarded

AXIOM_TOKEN environment variable provides unlimited log ingestion quota

If this fails: When Axiom quota is exceeded, all application logging silently stops without fallback, making debugging production issues impossible during high-traffic periods

apps/server/src/index.ts:configure

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Monitor Status Database (database)
LibSQL database storing monitor configurations, status history, incident records, and user accounts with real-time status updates

Task Scheduler (in-memory)
In-memory task queue managed by github.com/madflojo/tasks that schedules periodic monitor checks with configurable intervals

QStash Queue (queue)
Upstash QStash message queue that buffers screenshot requests during incidents with webhook delivery and retry logic

R2 Screenshot Storage (file-store)
Cloudflare R2 bucket storing incident screenshots with public URLs for viewing in incident reports

Feedback Loops

Monitor Configuration Refresh (polling, balancing) — Trigger: 10-minute timer in MonitorManager main loop. Action: Fetches latest monitor configs from API and updates scheduled checks. Exit: Context cancellation on shutdown.
Status Recovery Detection (self-correction, balancing) — Trigger: Successful health checks after previous failures. Action: Resolves open incidents and sends recovery notifications. Exit: When autoResolved flag is set.
Retry Checker Tasks (retry, balancing) — Trigger: Failed API calls or timeout errors in checker execution. Action: Effect.retry with exponential backoff up to 3 attempts. Exit: Successful execution or max retries exceeded.
Incident Escalation (recursive, reinforcing) — Trigger: Continued failures after incident creation. Action: Escalates notifications and updates incident severity. Exit: Service recovery or manual acknowledgment.

Delays

Monitor Check Intervals (scheduled-job, ~30s to 30m based on monitor configuration) — Controls how quickly service outages are detected - shorter intervals provide faster detection but higher resource usage
Config Refresh Delay (cache-ttl, ~10 minutes) — New monitor configurations take up to 10 minutes to be picked up by checker agents
Screenshot Processing (async-processing, ~5-30 seconds for Playwright browser launch and page capture) — Incident screenshots appear in reports after brief processing delay
QStash Delivery (queue-drain, ~Variable based on queue backlog and webhook processing) — Screenshot requests may be delayed during high incident volume

Control Points

Monitor Periodicity (hyperparameter) — Controls: How frequently monitors are checked (30s/1m/5m/10m/30m). Default: Configurable per monitor
Regional Coverage (architecture-switch) — Controls: Which global regions participate in monitoring checks. Default: 28 regions across 3 cloud providers
Notification Channels (feature-flag) — Controls: Which notification methods are enabled (Slack, Discord, email, webhooks). Default: Configurable per workspace
Screenshot Capture (feature-flag) — Controls: Whether to capture screenshots during incidents. Default: Enabled based on plan tier
Self-Host Mode (runtime-toggle) — Controls: Authentication providers and feature availability in self-hosted deployments. Default: process.env.SELF_HOST

Technology Stack

LibSQL/Turso (database)
Primary database storing monitor configurations, incidents, user data with edge replication

Hono (framework)
HTTP framework for API servers and webhook handlers with TypeScript support

Next.js (framework)
React framework powering dashboard, status pages, and marketing site with SSR/SSG

Go (runtime)
High-performance language for checker agents and proxy services requiring low latency

NextAuth.js (library)
Authentication handling for OAuth providers (GitHub, Google) and session management

Playwright (library)
Browser automation for capturing incident screenshots in screenshot-service

QStash (infra)
Message queue for asynchronous screenshot processing with webhook delivery

Cloudflare R2 (infra)
Object storage for incident screenshots with CDN delivery

OpenTelemetry (infra)
Observability stack providing structured logging, metrics, and distributed tracing

Key Components

MonitorManager (orchestrator) — Coordinates monitor execution by fetching monitor configs from API, scheduling checks with JobRunner, and managing the task scheduler lifecycle apps/checker/cmd/private/main.go
JobRunner (executor) — Executes individual HTTP checks against monitored endpoints, measures response times and status codes, handles timeouts and network errors apps/checker/pkg/job
checkerRoute (processor) — Receives monitoring results from checker agents, detects when services transition between operational/down states, creates and resolves incidents apps/workflows/src/checker/index.ts
upsertMonitorStatus (processor) — Updates monitor status in database and triggers notification workflows when status changes from operational to degraded/down or vice versa apps/workflows/src/checker/alerting.ts
NextAuth handlers (gateway) — Handles OAuth authentication flows with GitHub and Google providers, manages user sessions, updates user profile data on signin apps/dashboard/src/lib/auth/index.ts
resolveRoute (resolver) — Parses incoming requests to determine which status page to display based on hostname or pathname routing, resolves locale preferences apps/status-page/src/lib/resolve-route.ts
screenshot handler (processor) — Receives QStash webhook requests for incident screenshots, launches Playwright browser to capture full-page screenshots, uploads images to R2 storage apps/screenshot-service/src/index.ts
railway proxy (dispatcher) — Routes incoming checker requests to appropriate regional checker instances based on Railway region headers apps/railway-proxy/main.go
Hono app server (gateway) — Main API server that mounts RPC routes, handles public status endpoints, manages request logging with OpenTelemetry, serves OpenAPI documentation apps/server/src/index.ts

Package Structure

checker (app)
Go-based monitoring agent that performs HTTP checks and reports results to the central platform

dashboard (app)
Next.js admin interface for configuring monitors, viewing analytics, and managing incidents

docs (app)
Astro-powered documentation site with API reference and user guides

private-location (app)
Self-hosted Go server that runs monitoring checks from private networks

railway-proxy (app)
Go reverse proxy that routes checker requests based on Railway region headers

screenshot-service (app)
Hono service that captures web page screenshots during incidents using Playwright

server (app)
Main Hono API server handling checker results, user management, and external integrations

ssh-server (app)
Go SSH server that displays service status over SSH connections

status-page (app)
Public Next.js status pages that customers can view to see service health

web (app)
Marketing website and blog built with Next.js and MDX content

workflows (app)
Hono service managing cron jobs for monitoring tasks and notification workflows

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Fullstack Repositories

Frequently Asked Questions

What is openstatus used for?

Runs uptime monitoring checks, status pages, and incident notifications across global regions openstatushq/openstatus is a 9-component fullstack written in TypeScript. Data flows through 8 distinct pipeline stages. The codebase contains 1483 files.

How is openstatus architected?

openstatus is organized into 4 architecture layers: Monitoring Layer, API Gateway, User Interfaces, Support Services. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through openstatus?

Data moves through 8 stages: Schedule monitoring checks → Execute endpoint probes → Aggregate regional results → Detect status transitions → Create and manage incidents → .... The monitoring pipeline starts when distributed checker applications probe monitored endpoints every few minutes from multiple global regions, measuring response times and status codes. These results flow into the central workflows service which detects status changes and creates incident records when services go down. Status changes trigger notification workflows that send alerts via email, Slack, Discord, and other channels. Public status pages query the incident database to display current service health, while the admin dashboard allows users to configure monitors and view analytics. Screenshots are captured during incidents for documentation purposes. This pipeline design reflects a complex multi-stage processing system.

What technologies does openstatus use?

The core stack includes LibSQL/Turso (Primary database storing monitor configurations, incidents, user data with edge replication), Hono (HTTP framework for API servers and webhook handlers with TypeScript support), Next.js (React framework powering dashboard, status pages, and marketing site with SSR/SSG), Go (High-performance language for checker agents and proxy services requiring low latency), NextAuth.js (Authentication handling for OAuth providers (GitHub, Google) and session management), Playwright (Browser automation for capturing incident screenshots in screenshot-service), and 3 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does openstatus have?

openstatus exhibits 4 data pools (Monitor Status Database, Task Scheduler), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and self-correction. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does openstatus use?

5 design patterns detected: Multi-Region Consensus, Event-Driven Incident Management, Domain-Based Routing, Graceful Degradation, Observability First.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.