Uptime Monitoring
Uptime monitoring continuously checks whether your website and its critical services are reachable and responding within acceptable time limits. It verifies availability (HTTP 2xx/3xx vs. 4xx/5xx), performance (latency), and sometimes content (keyword checks) from multiple global locations to detect outages quickly and trigger alerts.
What it measures
- Availability: Is the site responding with a healthy status code?
- Latency: Time to connect, TLS handshake, TTFB, and full response time.
- Content validity: Presence/absence of keywords or regex in the response.
- Dependencies: DNS resolution, TLS/SSL validity and expiry, CDN edge reachability, origin health, APIs, and database/connectivity checks.
- Transactional flows (synthetics): Can users sign in, add to cart, and checkout? (Scripted journeys.)
How it works (under the hood)
- Probes run on a schedule (e.g., every 30–60s) from multiple regions.
- Quorum logic (e.g., “2 of 3 locations failing”) reduces false positives.
- Alert policies fire after N consecutive failures or threshold breaches.
- Integrations push notifications to Slack/Teams/email/SMS/on-call.
- Status aggregation drives dashboards and public status pages.
Best practices
- Multi-region checks: Always test from at least 3 geographically distinct locations.
- Layered health checks: Ping/TCP + HTTP(S) + keyword/content + synthetic flows for critical funnels.
- Maintenance windows: Silence alerts during planned work; auto-unsilence on end.
- Fail-open content checks: If you check for a keyword, ensure the page returns it reliably (avoid A/B tests or geo-variant content).
- TLS & DNS monitoring: Watch certificate expiry/chain issues and DNS changes/propagation.
- Alert routing & escalation: Start with Slack/email; escalate to SMS/call only when quorum + duration thresholds are met.
- SLOs & error budgets: Define targets (e.g., 99.9%) and alert on burn-rate rather than single events.
KPIs & helpful math (per 30-day month)
- 99% uptime ⇒ max downtime ≈ 7h 12m
- 99.9% ⇒ ≈ 43m 12s
- 99.99% ⇒ ≈ 4m 19s
- 99.999% ⇒ ≈ 26s
Track MTTD/MTTR (mean time to detect/recover), p95/p99 latency, and % of successful synthetic runs.
Implementation tips for website support teams
- WordPress/WooCommerce:
- Monitor
GET /
and a lightweight health endpoint (e.g., a custom/health
route that checks DB & object cache quickly). - Add keyword checks for expected text (e.g., site name) and a cart/checkout synthetic for revenue-critical paths.
- Watch wp-cron with a heartbeat (record last run; alert if stale).
- Monitor disk usage and PHP error rate via logs/APM; spikes often precede downtime.
- Monitor
- CDN/Edge + Origin: Probe both the CDN URL and origin to isolate cache vs. origin failures.
- Change management: Tie deployments to a status page notice and temporary alert dampening.
Common pitfalls
- Relying on a single location (false positives from local network issues).
- Treating “200 OK” as healthy when the page shows an error template—use keyword checks.
- Alerting on every probe failure (flapping) instead of using quorums and consecutive-failure rules.
- Forgetting renewals (TLS/SSL, domains) — set explicit expiry alerts.