Estimating a rate-limit rollout
The story where the code is half a day and the rollout is two months.
Rate limits are easy to implement and hard to roll out. The middleware is a known pattern. The hard part is picking thresholds nobody complains about, and the only way to pick them is to measure existing usage, announce the limits in advance, watch a dry-run period for who would have been blocked, raise the noisy ones, and only then enforce. That sequence is the work, and it doesn't fit in a sprint.
Teams that estimate the middleware miss the rollout entirely. Teams that estimate the rollout get a much bigger number, ask whether it's actually urgent, and usually decide to phase it across multiple cycles — which is the right answer. Sizing the whole thing at once forces the team into a false binary.
What gets said in the room
Backend: "The middleware is a day. We have a library."
SRE: "What thresholds? Have we looked at p99 of current usage?"
PM: "Who do we need to email before this turns on?"
Support: "What does the 429 response say? Is there a retry-after?"
Lead: "Dry-run mode first, or straight to enforcement?"
Questions worth asking before voting
- Have we measured current usage — p50, p95, p99 per customer?
- What thresholds — and how were they chosen?
- Per-account, per-IP, per-API-key? Combinations?
- Dry-run period: how long, and what counts as "no surprises"?
- Customer comms: who do we email, and how far in advance?
- What does the 429 response look like — message, retry-after, docs link?
- What's the override path for a customer who needs a higher limit?
Split it: enforcement is one story, dry-run + comms is another, threshold tuning is a third. Each is sizeable on its own. See database migrations for the same shape — small code change, big rollout.