Estimating a rate-limit rollout

The story where the code is half a day and the rollout is two months.

Rate limits are easy to implement and hard to roll out. The middleware is a known pattern. The hard part is picking thresholds nobody complains about, and the only way to pick them is to measure existing usage, announce the limits in advance, watch a dry-run period for who would have been blocked, raise the noisy ones, and only then enforce. That sequence is the work, and it doesn't fit in a sprint.

Teams that estimate the middleware miss the rollout entirely. Teams that estimate the rollout get a much bigger number, ask whether it's actually urgent, and usually decide to phase it across multiple cycles — which is the right answer. Sizing the whole thing at once forces the team into a false binary.

What gets said in the room

Backend: "The middleware is a day. We have a library."

SRE: "What thresholds? Have we looked at p99 of current usage?"

PM: "Who do we need to email before this turns on?"

Support: "What does the 429 response say? Is there a retry-after?"

Lead: "Dry-run mode first, or straight to enforcement?"

Questions worth asking before voting

  • Have we measured current usage — p50, p95, p99 per customer?
  • What thresholds — and how were they chosen?
  • Per-account, per-IP, per-API-key? Combinations?
  • Dry-run period: how long, and what counts as "no surprises"?
  • Customer comms: who do we email, and how far in advance?
  • What does the 429 response look like — message, retry-after, docs link?
  • What's the override path for a customer who needs a higher limit?

Split it: enforcement is one story, dry-run + comms is another, threshold tuning is a third. Each is sizeable on its own. See database migrations for the same shape — small code change, big rollout.