Runbook — Alarm Triage (Generic Example)
Version: v1.0 — 2026-01-01 · Sanitized portfolio sample — not operational guidance.
Runbook
Incident response
Triage
These examples are intentionally generic. See Sanitization checklist.
Trigger
- Alarm fires for a critical component (example: “Subsystem Offline”).
- Or: multiple related alarms within 5 minutes.
Safety checks (do first)
- Confirm you are working in the correct environment (site/system).
- Check for any “Do Not Operate” status or active maintenance window.
- If there is any safety risk, stop and escalate immediately.
Triage steps
- Record timestamp + alarm text + any related alerts.
- Check last known healthy status (example: last telemetry time).
- Check dependencies (power, communications, upstream service).
- Attempt the least-risk recovery action (example: re-poll / reset comms) if permitted.
Decision points
- If telemetry returns and alarm clears, go to Verification.
- If telemetry remains offline > 10 minutes, escalate per matrix.
- If error indicates possible equipment damage, stop remote actions and escalate.
Escalation
- Escalate to: on-call engineer / field tech / manager (based on severity).
- Provide: symptom summary, steps attempted, current status, any safety concerns.
Escalation package (copy/paste)
Summary:
- Trigger/alarm:
- First observed (time):
- Scope (single site vs widespread):
- Checks completed:
- Actions attempted:
- Current status:
- Safety concerns:
- Requested next steps / owner:
Verification
- Confirm steady-state readings within expected range for 10 minutes.
- Confirm alarms cleared and no new related alarms.
- Update log/ticket with actions taken and outcome.
Closeout notes (recordkeeping)
- What happened (1–2 sentences)
- What you did (bullets)
- Who you notified
- What still needs follow-up
Changes
- v1.0 — Initial published sample.