Estimating a performance regression

The story where p95 doubled last week and the room is voting on the wrong thing.

Performance regressions are diagnosis-heavy and fix-light, most of the time. The team that's already found the N+1 query knows the fix is small. The team that hasn't is voting on a story whose actual content is "spend three days reading flame graphs." Those are different stories.

Two questions decide the estimate. First: is the cause known? If yes, size the fix. If no, size the investigation. Second: what's the SLO the regression broke, and what's the deadline for restoring it? A regression that violates a customer-facing SLO has a different urgency profile from a 10% slowdown nobody noticed.

What gets said in the room

SRE: "p95 went from 200ms to 450ms last Wednesday."

Backend: "Anything obvious in the deploys that day?"

Lead: "Three deploys. None of them touched the slow endpoint."

Backend: "Or didn't look like they touched the slow endpoint."

SRE: "Are we within SLO? Do we have time to investigate properly?"

Questions worth asking before voting

Is the cause known, or are we sizing an investigation?
What's the SLO, and is it currently breached?
What changed between the last good window and the first bad window?
Do we have flame graphs, traces, or just dashboards?
Is rollback a viable mitigation while we investigate?

Like a bug with no repro, the work hides in the diagnosis. Vote what you actually have, not what you wish you had. Open a session when the cause is named.

What gets said in the room

Questions worth asking before voting

Keep reading