Incident Response at Scale: Building a Mature SRE Practice

Observability & SRE

8 minute read · By Dalibor Labudovic

Incident response at enterprise scale is a discipline that separates organisations that recover quickly and learn from failures from those that repeat the same incidents indefinitely. The technical components — alerting, runbooks, communication channels — are necessary but insufficient. The organisational components — roles, escalation paths, and a culture of blameless post-mortems — are what actually determine how well an organisation responds to and learns from incidents.

SRE incident response lifecycle — A mature incident response practice treats each stage of the lifecycle as a distinct process with defined responsibilities and success criteria.

The incident commander role

The most impactful structural change an organisation can make to its incident response practice is establishing a dedicated incident commander role. During an incident, the incident commander is responsible for coordination — not investigation. They manage communication, assign investigation tasks, track mitigation progress, and make the call on customer communication timing. They do not diagnose the incident themselves.

This separation of coordination from investigation prevents the most common incident response failure mode: a senior engineer who is simultaneously trying to diagnose the root cause, update stakeholders, coordinate other responders, and decide on mitigation strategy. Splitting these responsibilities across roles reduces mean time to mitigation significantly.

Severity classification

A clear severity classification system is essential for calibrating the response effort to the actual impact. A common enterprise framework uses four levels: SEV-1 for complete service outage or data loss, SEV-2 for significant degradation affecting a meaningful portion of users, SEV-3 for minor degradation or elevated error rates below defined thresholds, and SEV-4 for potential issues not yet causing user impact.

Severity determines response time SLOs, communication requirements, and escalation paths. SEV-1 incidents require immediate all-hands response and executive notification within 15 minutes. SEV-4 incidents can be queued for investigation during business hours. Misclassifying severity — treating a SEV-2 as a SEV-4 — is one of the most common sources of delayed mitigation in organisations without clear classification criteria.

Runbooks that actually work

Runbooks fail in production for two reasons: they are outdated, or they assume knowledge that the on-call engineer does not have at 3am. Effective runbooks are concise, command-level, and verified on real incidents. They do not explain background context — they answer the question: what do I do right now to mitigate this specific alert?

Runbooks should be linked directly from alert definitions. When an alert fires, the on-call engineer should be able to navigate to the runbook in one click. Runbooks that require searching documentation are runbooks that will not be used under pressure.

Blameless post-mortems

The post-mortem is where an organisation either learns from an incident or performs the ritual of documenting it without changing anything. Blameless post-mortems — where the focus is on system and process failures rather than individual errors — produce actionable findings. Post-mortems that identify a person as the root cause produce nothing except a culture of fear that suppresses honest incident reporting.

Every post-mortem should produce a short list of concrete action items with owners and due dates. Action items that are not tracked and completed mean the same incident will recur. Tracking post-mortem action item completion rate is a leading indicator of an organisation’s SRE maturity.

Dalibor Labudovic

IT Enterprise Architect Consultant · Founder & CEO, Axiom Industrial

I help enterprise organisations build mature SRE practices that reduce toil, improve reliability, and create a culture of learning from incidents. If on-call is burning out your team, let’s discuss what a sustainable model looks like.

Get in touch LinkedIn

The incident commander role

Severity classification

Runbooks that actually work

Blameless post-mortems

Leave a Comment Cancel Reply