Runbook — Alarm Triage (Generic Example)

Version: v1.0 — 2026-01-01 · Sanitized portfolio sample — not operational guidance.

Runbook Incident response Triage

These examples are intentionally generic. See Sanitization checklist.

Trigger

Alarm fires for a critical component (example: “Subsystem Offline”).
Or: multiple related alarms within 5 minutes.

Safety checks (do first)

Confirm you are working in the correct environment (site/system).
Check for any “Do Not Operate” status or active maintenance window.
If there is any safety risk, stop and escalate immediately.

Triage steps

Record timestamp + alarm text + any related alerts.
Check last known healthy status (example: last telemetry time).
Check dependencies (power, communications, upstream service).
Attempt the least-risk recovery action (example: re-poll / reset comms) if permitted.

Decision points

If telemetry returns and alarm clears, go to Verification.
If telemetry remains offline > 10 minutes, escalate per matrix.
If error indicates possible equipment damage, stop remote actions and escalate.

Escalation

Escalate to: on-call engineer / field tech / manager (based on severity).
Provide: symptom summary, steps attempted, current status, any safety concerns.

Escalation package (copy/paste)

Summary:
- Trigger/alarm:
- First observed (time):
- Scope (single site vs widespread):
- Checks completed:
- Actions attempted:
- Current status:
- Safety concerns:
- Requested next steps / owner:

Verification

Confirm steady-state readings within expected range for 10 minutes.
Confirm alarms cleared and no new related alarms.
Update log/ticket with actions taken and outcome.

Closeout notes (recordkeeping)

What happened (1–2 sentences)
What you did (bullets)
Who you notified
What still needs follow-up

Changes

v1.0 — Initial published sample.