openstatushq/openstatus
🫖 Status page with uptime monitoring & API monitoring as code 🫖
Runs uptime monitoring checks, status pages, and incident notifications across global regions
The monitoring pipeline starts when distributed checker applications probe monitored endpoints every few minutes from multiple global regions, measuring response times and status codes. These results flow into the central workflows service which detects status changes and creates incident records when services go down. Status changes trigger notification workflows that send alerts via email, Slack, Discord, and other channels. Public status pages query the incident database to display current service health, while the admin dashboard allows users to configure monitors and view analytics. Screenshots are captured during incidents for documentation purposes.
Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 9-component fullstack. 1483 files analyzed. Data flows through 8 distinct pipeline stages.
How Data Flows Through the System
The monitoring pipeline starts when distributed checker applications probe monitored endpoints every few minutes from multiple global regions, measuring response times and status codes. These results flow into the central workflows service which detects status changes and creates incident records when services go down. Status changes trigger notification workflows that send alerts via email, Slack, Discord, and other channels. Public status pages query the incident database to display current service health, while the admin dashboard allows users to configure monitors and view analytics. Screenshots are captured during incidents for documentation purposes.
- Schedule monitoring checks — The cron service in workflows app sends HTTP requests to checker applications across regions every 30s/1m/5m/10m/30m based on monitor configuration, triggering distributed health checks [Monitor configurations → Check triggers] (config: monitors.periodicity, regions.enabled)
- Execute endpoint probes — JobRunner in checker apps makes HTTP requests to monitored URLs, measuring response time and status code, handling timeouts and connection errors with configurable retry logic [Check triggers → MonitorResult] (config: monitors.timeout, monitors.retry_count)
- Aggregate regional results — The checkerRoute handler in workflows service receives MonitorResult payloads from all regions and applies consensus logic to determine overall service status [MonitorResult → Status aggregation] (config: monitors.regions, monitors.degraded_after_failures)
- Detect status transitions — upsertMonitorStatus compares new status against previous status in database, identifying transitions from operational→degraded→down or recovery patterns [Status aggregation → Status change events] (config: monitors.status_threshold)
- Create and manage incidents — When failures are detected, findOpenIncident checks for existing incidents and either creates new IncidentRecord or updates resolution timestamp for recovery [Status change events → IncidentRecord]
- Trigger notifications — triggerNotifications sends alerts to configured channels (Slack, Discord, email, webhooks) when incidents are created or resolved, with customizable message templates [IncidentRecord → Notification messages] (config: notifications.channels, notifications.templates)
- Display public status — Status page apps query incident and monitor tables to render current service health, historical uptime percentages, and incident timeline for public viewing [IncidentRecord → Status page HTML] (config: status_page.theme, status_page.custom_domain)
- Capture incident screenshots — Screenshot service receives QStash webhooks when incidents occur, uses Playwright to capture full-page screenshots, and stores them in R2 bucket with incident ID [ScreenshotRequest → Screenshot URLs] (config: screenshot.enabled, storage.r2_bucket)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
apps/workflows/src/checker/index.tsobject with monitorId: string, statusCode: number, region: monitorRegions enum, cronTimestamp: number, status: monitorStatusSchema, latency: number, message: string
Created by checker agents during endpoint probes, processed by workflows service to detect incidents, stored in database for analytics
apps/workflows/src/checker/index.tsdatabase record with id, monitorId, resolvedAt timestamp, autoResolved boolean flag, and incident metadata
Created when monitor failures are detected, displayed on status pages during outages, resolved when monitors recover
apps/dashboard/src/lib/auth/index.tsNextAuth session object with user id, email, OAuth provider details, and profile information
Generated during OAuth flow with GitHub/Google, persisted in auth adapter, validated on each authenticated request
apps/status-page/src/lib/resolve-route.tsResolvedRoute object with type: 'hostname'|'pathname', prefix: string, locale: Locale, localeExplicit: boolean, rewritePath: string
Resolved from incoming HTTP requests based on hostname or path, used to determine which status page to display and in what language
apps/screenshot-service/src/index.tsZod schema with url: URL, incidentId: number, kind: 'incident'|'recovery' enum
Triggered via QStash when incidents occur, processed by Playwright to capture page screenshots, stored in R2 bucket for incident documentation
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
OPENSTATUS_KEY environment variable contains a valid API key that never expires and has sufficient permissions
If this fails: If the API key is invalid, expired, or lacks permissions, all monitor checks fail silently without alerting operators - the checker appears to run but produces no results
apps/checker/cmd/private/main.go:main
Monitor configuration updates can wait up to 10 minutes to be picked up by checker agents
If this fails: Critical monitors added during outages won't be checked for up to 10 minutes, and disabled monitors continue running unnecessary checks, wasting resources and potentially triggering false alerts
apps/checker/cmd/private/main.go:configRefreshInterval
Container has sufficient memory to launch Chromium browser instances without being killed by OOM
If this fails: Screenshot capture fails silently during high incident volume when multiple Chromium instances exhaust container memory, leaving incidents without visual evidence
apps/screenshot-service/src/index.ts:playwright.chromium.launch
Railway region headers contain exactly one of four hardcoded region values
If this fails: When Railway adds new regions or changes region identifiers, requests route to undefined targetUrl causing panic, taking down the entire proxy service
apps/railway-proxy/main.go:proxy
MonitorResult payloads from regional checkers always include cronTimestamp as Unix milliseconds in the same timezone
If this fails: If checkers send timestamps in different formats or timezones, incident timing becomes corrupted, causing false recovery notifications and incorrect SLA calculations
apps/workflows/src/checker/index.ts:payloadSchema
Only one incident can be open per monitor at any given time
If this fails: If multiple incident creation requests race during rapid status changes, duplicate incidents are created but only one gets resolved, leaving phantom open incidents that block future incident creation
apps/workflows/src/checker/index.ts:findOpenIncident
Custom domain hostnames always have exactly 3+ segments separated by dots, with the subdomain as the first segment
If this fails: Status pages hosted on unusual domains (e.g., single-level domains, IPv6 addresses, or domains with multiple subdomain levels) are misrouted, showing wrong status pages or 404 errors to customers
apps/status-page/src/lib/resolve-route.ts:resolveRoute
OAuth profile objects from Google and GitHub providers always contain expected fields (given_name, family_name, picture, avatar_url)
If this fails: Authentication succeeds but user profile updates fail silently when OAuth providers change their response schema, leaving users with incomplete profiles and broken avatars
apps/dashboard/src/lib/auth/index.ts:signIn
Screenshot filenames using Date.now() are globally unique across all incident captures
If this fails: Simultaneous incident screenshots for the same incident ID overwrite each other in R2 storage, leaving only the last screenshot and losing evidence of the incident progression
apps/screenshot-service/src/index.ts:Date.now
AXIOM_TOKEN environment variable provides unlimited log ingestion quota
If this fails: When Axiom quota is exceeded, all application logging silently stops without fallback, making debugging production issues impossible during high-traffic periods
apps/server/src/index.ts:configure
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
LibSQL database storing monitor configurations, status history, incident records, and user accounts with real-time status updates
In-memory task queue managed by github.com/madflojo/tasks that schedules periodic monitor checks with configurable intervals
Upstash QStash message queue that buffers screenshot requests during incidents with webhook delivery and retry logic
Cloudflare R2 bucket storing incident screenshots with public URLs for viewing in incident reports
Feedback Loops
- Monitor Configuration Refresh (polling, balancing) — Trigger: 10-minute timer in MonitorManager main loop. Action: Fetches latest monitor configs from API and updates scheduled checks. Exit: Context cancellation on shutdown.
- Status Recovery Detection (self-correction, balancing) — Trigger: Successful health checks after previous failures. Action: Resolves open incidents and sends recovery notifications. Exit: When autoResolved flag is set.
- Retry Checker Tasks (retry, balancing) — Trigger: Failed API calls or timeout errors in checker execution. Action: Effect.retry with exponential backoff up to 3 attempts. Exit: Successful execution or max retries exceeded.
- Incident Escalation (recursive, reinforcing) — Trigger: Continued failures after incident creation. Action: Escalates notifications and updates incident severity. Exit: Service recovery or manual acknowledgment.
Delays
- Monitor Check Intervals (scheduled-job, ~30s to 30m based on monitor configuration) — Controls how quickly service outages are detected - shorter intervals provide faster detection but higher resource usage
- Config Refresh Delay (cache-ttl, ~10 minutes) — New monitor configurations take up to 10 minutes to be picked up by checker agents
- Screenshot Processing (async-processing, ~5-30 seconds for Playwright browser launch and page capture) — Incident screenshots appear in reports after brief processing delay
- QStash Delivery (queue-drain, ~Variable based on queue backlog and webhook processing) — Screenshot requests may be delayed during high incident volume
Control Points
- Monitor Periodicity (hyperparameter) — Controls: How frequently monitors are checked (30s/1m/5m/10m/30m). Default: Configurable per monitor
- Regional Coverage (architecture-switch) — Controls: Which global regions participate in monitoring checks. Default: 28 regions across 3 cloud providers
- Notification Channels (feature-flag) — Controls: Which notification methods are enabled (Slack, Discord, email, webhooks). Default: Configurable per workspace
- Screenshot Capture (feature-flag) — Controls: Whether to capture screenshots during incidents. Default: Enabled based on plan tier
- Self-Host Mode (runtime-toggle) — Controls: Authentication providers and feature availability in self-hosted deployments. Default: process.env.SELF_HOST
Technology Stack
Primary database storing monitor configurations, incidents, user data with edge replication
HTTP framework for API servers and webhook handlers with TypeScript support
React framework powering dashboard, status pages, and marketing site with SSR/SSG
High-performance language for checker agents and proxy services requiring low latency
Authentication handling for OAuth providers (GitHub, Google) and session management
Browser automation for capturing incident screenshots in screenshot-service
Message queue for asynchronous screenshot processing with webhook delivery
Object storage for incident screenshots with CDN delivery
Observability stack providing structured logging, metrics, and distributed tracing
Key Components
- MonitorManager (orchestrator) — Coordinates monitor execution by fetching monitor configs from API, scheduling checks with JobRunner, and managing the task scheduler lifecycle
apps/checker/cmd/private/main.go - JobRunner (executor) — Executes individual HTTP checks against monitored endpoints, measures response times and status codes, handles timeouts and network errors
apps/checker/pkg/job - checkerRoute (processor) — Receives monitoring results from checker agents, detects when services transition between operational/down states, creates and resolves incidents
apps/workflows/src/checker/index.ts - upsertMonitorStatus (processor) — Updates monitor status in database and triggers notification workflows when status changes from operational to degraded/down or vice versa
apps/workflows/src/checker/alerting.ts - NextAuth handlers (gateway) — Handles OAuth authentication flows with GitHub and Google providers, manages user sessions, updates user profile data on signin
apps/dashboard/src/lib/auth/index.ts - resolveRoute (resolver) — Parses incoming requests to determine which status page to display based on hostname or pathname routing, resolves locale preferences
apps/status-page/src/lib/resolve-route.ts - screenshot handler (processor) — Receives QStash webhook requests for incident screenshots, launches Playwright browser to capture full-page screenshots, uploads images to R2 storage
apps/screenshot-service/src/index.ts - railway proxy (dispatcher) — Routes incoming checker requests to appropriate regional checker instances based on Railway region headers
apps/railway-proxy/main.go - Hono app server (gateway) — Main API server that mounts RPC routes, handles public status endpoints, manages request logging with OpenTelemetry, serves OpenAPI documentation
apps/server/src/index.ts
Package Structure
Go-based monitoring agent that performs HTTP checks and reports results to the central platform
Next.js admin interface for configuring monitors, viewing analytics, and managing incidents
Astro-powered documentation site with API reference and user guides
Self-hosted Go server that runs monitoring checks from private networks
Go reverse proxy that routes checker requests based on Railway region headers
Hono service that captures web page screenshots during incidents using Playwright
Main Hono API server handling checker results, user management, and external integrations
Go SSH server that displays service status over SSH connections
Public Next.js status pages that customers can view to see service health
Marketing website and blog built with Next.js and MDX content
Hono service managing cron jobs for monitoring tasks and notification workflows
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Fullstack Repositories
Frequently Asked Questions
What is openstatus used for?
Runs uptime monitoring checks, status pages, and incident notifications across global regions openstatushq/openstatus is a 9-component fullstack written in TypeScript. Data flows through 8 distinct pipeline stages. The codebase contains 1483 files.
How is openstatus architected?
openstatus is organized into 4 architecture layers: Monitoring Layer, API Gateway, User Interfaces, Support Services. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through openstatus?
Data moves through 8 stages: Schedule monitoring checks → Execute endpoint probes → Aggregate regional results → Detect status transitions → Create and manage incidents → .... The monitoring pipeline starts when distributed checker applications probe monitored endpoints every few minutes from multiple global regions, measuring response times and status codes. These results flow into the central workflows service which detects status changes and creates incident records when services go down. Status changes trigger notification workflows that send alerts via email, Slack, Discord, and other channels. Public status pages query the incident database to display current service health, while the admin dashboard allows users to configure monitors and view analytics. Screenshots are captured during incidents for documentation purposes. This pipeline design reflects a complex multi-stage processing system.
What technologies does openstatus use?
The core stack includes LibSQL/Turso (Primary database storing monitor configurations, incidents, user data with edge replication), Hono (HTTP framework for API servers and webhook handlers with TypeScript support), Next.js (React framework powering dashboard, status pages, and marketing site with SSR/SSG), Go (High-performance language for checker agents and proxy services requiring low latency), NextAuth.js (Authentication handling for OAuth providers (GitHub, Google) and session management), Playwright (Browser automation for capturing incident screenshots in screenshot-service), and 3 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does openstatus have?
openstatus exhibits 4 data pools (Monitor Status Database, Task Scheduler), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and self-correction. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does openstatus use?
5 design patterns detected: Multi-Region Consensus, Event-Driven Incident Management, Domain-Based Routing, Graceful Degradation, Observability First.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.