Status Page (Public)
A public status page is an out-of-band site that communicates the current and historical health of your services. It summarizes availability, ongoing incidents, planned maintenance, and component-level impact so users know what’s happening without opening a support ticket.
What it shows
- Overall status: “Operational”, “Degraded”, “Partial outage”, etc.
- Components: e.g., Website, Admin, Checkout, API, CDN, Payments, Search.
- Incident timeline: Investigation → Identified → Monitoring → Resolved, with timestamps.
- Maintenance windows: Scheduled tasks with start/end, impact, and regions.
- Metrics & history: Uptime %, latency, error rate; 30–90 day history.
- Subscriptions: Email/RSS/webhooks for incident and maintenance updates.
Why it matters
- Reduces inbound tickets during incidents.
- Builds trust via transparency and clear expectations.
- Helps meet SLA/contractual notification requirements.
- Provides a canonical source for comms and post-incident reviews.
Architecture & data flow
- Out-of-band hosting: Host on a separate provider/region from your main site so it stays up during outages (e.g.,
status.example.com
). - Automated inputs: Pull from uptime checks, synthetic transactions, TLS/DNS monitors, and error rates.
- Manual controls: Let on-call staff create/update incidents and override component states.
- Versioned history: Keep immutable logs for audits and SLA evidence.
Communication best practices
- Post within minutes with clear user impact, not just internal metrics.
- State the next update time (“Next update in 30 minutes”).
- Use consistent severity levels and templates.
- Close with remediation and prevention actions (post-incident summary).
- Localize timezones; keep language plain and non-defensive.
Implementation tips for website support teams
- Model components to match real dependencies (Origin, CDN, Images, Admin, Checkout, Payments).
- Auto-open incidents when quorum failures occur; auto-close only after stability windows.
- Integrate alerts with Slack/Email/SMS; allow customers to subscribe per component.
- Prewrite templates for investigation, identified root cause, and resolution.
- Keep DNS TTL low for
status.
and back it with a static site/CDN for resilience. - Separate access paths (SSO + break-glass account) so you can update during auth/provider issues.
KPIs
- Time to first status update (TTFSU).
- Update cadence during incidents (e.g., every 30 min).
- Subscriber count and incident open rates.
- Ticket deflection (tickets vs. incident views).
- Uptime by component and SLA attainment.
Common pitfalls
- Hosting the status page on the same infrastructure as the product.
- Vague, infrequent updates; no “next update” commitment.
- Marking “All Operational” while components are degraded.
- Hiding history or deleting incidents.
- No mobile layout or accessibility issues (screen-reader labels, contrast).