Story points and AI coding tools

Copilot, Cursor, Claude Code. The points don't change. The velocity does, and only after the fact.

Copilot doesn't change complexity or uncertainty. Effort drops; velocity rebases empirically. Don't re-point the backlog.

The question shows up in every team that's been using an AI coding assistant for a few months. We're shipping work faster than we used to — should we re-size the backlog? Should the new "5" be what last quarter's "3" was? Should we adjust the team's reference story? The honest answer is no, and the reason is that story points were never measuring the thing that changed.

Points measure relative complexity and uncertainty. Hours measure effort. Story points and hours are not different units of the same thing — see story points vs hours— and the failure mode AI coding tools create is the temptation to collapse the distinction. The tool reduces the keyboard time on a story, sometimes dramatically. The complexity of the story is unchanged. The unknowns are unchanged. The integration risks, the edge cases, the deployment dance — all unchanged. If you re-point the story from a 5 to a 3 because Copilot wrote the boilerplate, you're pointing in hours wearing a points jersey.

What does change

Velocity. The same team shipping the same kind of work with an AI assistant ships more points per sprint than they did six months ago. Not because the points got smaller — because the team got faster at converting points into shipped code. That's the velocity number doing exactly what it's supposed to: reflecting the team's actual rate of delivery, including whatever the team is using to deliver it. Velocity rebases naturally over three or four sprints; no re-pointing required.

The mix of work that fits the team also changes. Stories that used to be "1 day of boilerplate + 1 hour of thinking" are now closer to "10 minutes of generation + 1 hour of thinking." That's a real shift, and it shows up in two places: the team picks up more of those stories per sprint, and the work that takes disproportionate time is now the work where AI doesn't help — novel algorithms, integration with quirky external systems, code that has to be readable by humans without an LLM in the loop.

What doesn't change

Stories with high uncertainty are still high-uncertainty. AI coding tools don't reduce the unknown around a spike; they speed up some of the investigation, but the thing the spike is trying to learn hasn't gotten smaller. Stories with complex integration are still complex; Copilot is bad at the parts of the work where you have to hold three systems' constraints in your head simultaneously. The team's reference story should still be one the team shipped recently — the AI tool was part of how they shipped it, so the calibration already accounts for whatever speed-up exists.

Bugs are the place this gets subtle. Some bugs that used to be 5s are now 2s because the diagnosis is faster when you can paste a stack trace into a chat. Some bugs that used to be 5s are still 5s because the diagnosis was never the bottleneck. The team's reference bug should be one they fixed recently with the tools they currently have. If it isn't, the calibration drifted, and the fix is the usual one — pick a new reference, not re-point the backlog.

What to push back on

"Velocity is up 30% — let's commit to 30% more next sprint." No, velocity is up 30% on the kind of work the team has been doing. The next sprint's work may be the kind AI helps with or the kind it doesn't. The forecast input is "the team's recent average," which already includes the AI-tooling effect; don't manually scale it again. That's double-counting the improvement.

"We should re-estimate the backlog because everything got smaller." No, the backlog stories haven't gotten smaller; the team got faster. The forecast that velocity-times-points produces will already reflect that. Re-estimating breaks the calibration without changing the forecast — see re-estimating for why touching historical points is almost always wrong.

"Our story-point scale is now wrong because AI changes the work." No, story points were never measuring effort directly. They measured how big a story is compared to a reference story. As long as the reference is current, the scale is current. The scale gets "wrong" when nobody on the team remembers the reference, which is a calibration problem AI tools didn't cause and don't worsen.

Let velocity rebase. Pick a recent reference story that was shipped with the tools you have now. Don't re-point what you've already shipped.

Adjacent: story points vs hours for the unit-confusion this question rests on; velocity for the rebasing pattern that does the heavy lifting; re-estimating for what to do (and what not to do) with historical estimates.