Estimating a performance regression

The story where p95 doubled last week and the room is voting on the wrong thing.

Performance regressions are diagnosis-heavy and fix-light, most of the time. The team that's already found the N+1 query knows the fix is small. The team that hasn't is voting on a story whose actual content is "spend three days reading flame graphs." Those are different stories.

Two questions decide the estimate. First: is the cause known? If yes, size the fix. If no, size the investigation. Second: what's the SLO the regression broke, and what's the deadline for restoring it? A regression that violates a customer-facing SLO has a different urgency profile from a 10% slowdown nobody noticed.

What gets said in the room

SRE: "p95 went from 200ms to 450ms last Wednesday."

Backend: "Anything obvious in the deploys that day?"

Lead: "Three deploys. None of them touched the slow endpoint."

Backend: "Or didn't look like they touched the slow endpoint."

SRE: "Are we within SLO? Do we have time to investigate properly?"

Questions worth asking before voting

  • Is the cause known, or are we sizing an investigation?
  • What's the SLO, and is it currently breached?
  • What changed between the last good window and the first bad window?
  • Do we have flame graphs, traces, or just dashboards?
  • Is rollback a viable mitigation while we investigate?

Like a bug with no repro, the work hides in the diagnosis. Vote what you actually have, not what you wish you had. Open a session when the cause is named.