Runbook — Alarm Triage (Generic Example)

Version: v1.0 — 2026-01-01 · Sanitized portfolio sample — not operational guidance.

Runbook Incident response Triage

These examples are intentionally generic. See Sanitization checklist.

Trigger

  • Alarm fires for a critical component (example: “Subsystem Offline”).
  • Or: multiple related alarms within 5 minutes.

Safety checks (do first)

  1. Confirm you are working in the correct environment (site/system).
  2. Check for any “Do Not Operate” status or active maintenance window.
  3. If there is any safety risk, stop and escalate immediately.

Triage steps

  1. Record timestamp + alarm text + any related alerts.
  2. Check last known healthy status (example: last telemetry time).
  3. Check dependencies (power, communications, upstream service).
  4. Attempt the least-risk recovery action (example: re-poll / reset comms) if permitted.

Decision points

  • If telemetry returns and alarm clears, go to Verification.
  • If telemetry remains offline > 10 minutes, escalate per matrix.
  • If error indicates possible equipment damage, stop remote actions and escalate.

Escalation

  • Escalate to: on-call engineer / field tech / manager (based on severity).
  • Provide: symptom summary, steps attempted, current status, any safety concerns.
Escalation package (copy/paste)
Summary:
- Trigger/alarm:
- First observed (time):
- Scope (single site vs widespread):
- Checks completed:
- Actions attempted:
- Current status:
- Safety concerns:
- Requested next steps / owner:

Verification

  1. Confirm steady-state readings within expected range for 10 minutes.
  2. Confirm alarms cleared and no new related alarms.
  3. Update log/ticket with actions taken and outcome.

Closeout notes (recordkeeping)

  • What happened (1–2 sentences)
  • What you did (bullets)
  • Who you notified
  • What still needs follow-up

Changes

  • v1.0 — Initial published sample.