Estimating a flaky test

The story where the cards never converge — because the answer depends on what you find.

A flaky test is two estimates. You don't know which yet.

One version is a one-line fix: a timing assumption, a test order dependency, a missing await. The other version is three days deep into a race condition in production code that the test happened to catch. The team isn't disagreeing about size when the cards spread from 2 to 13 — they're disagreeing about which kind of bug it is, which is the actual unknown.

Estimating to find out how big something is, instead of after, is how this story ends up at 5 in week one and 13 in week three. The fix isn't a sharper number; it's a different shape of work. Time-box the investigation, then estimate the actual fix when the shape is known.

What gets said in the room

Engineer A: "If it's flaky for the reason I think, this is a 2."

Engineer B: "If it's flaky for the reason I think, it's a 13."

Lead: "Has anyone actually run it locally with the seed pinned?"

QA: "It only fails in CI. Never on a dev machine."

Questions worth asking before voting

Does it reproduce locally, or only in CI?
How often does it fail — 1 in 50, 1 in 5, every other run?
When did it start failing? Which commit?
Is the test wrong, or is it catching something real?
Is anyone disabling it in the meantime, and at what cost?

Vote a time-box, not a fix. Half a day to investigate, then a real story for whatever you find.

See estimating bugs at feature precision for the broader pattern. Open a session when the investigation has a result.

What gets said in the room

Questions worth asking before voting

Keep reading