Estimating an ML experiment

The story where the model is the easy part and the data is the work.

An ML experiment is research with engineering attached. The model architecture is usually an off-the-shelf choice, and training is a known pattern. The work is upstream: getting clean training data, defining the evaluation metric the team will live or die by, building the pipeline that takes the model from notebook to production. The first 80% of the experiment is data work; the last 20% is the model.

Estimating an ML experiment with the feature deck is the same trap as estimating a research spike — you don't know what you'll find. The estimate has to be a budget for the experiment, not a forecast of the result. If the result is "the model doesn't work", that's still a successful experiment that consumed the budget; the team can't pretend that outcome makes the points wrong.

What gets said in the room

ML engineer: "Training is a day. The model is standard."

Lead: "Where's the training data coming from?"

Data: "We need to label about 5,000 examples first."

PM: "What metric tells us the experiment succeeded?"

SRE: "If it works, what does shipping it look like? Inference latency? Cost?"

Questions worth asking before voting

Is the training data ready, or is labelling part of this story?
What's the evaluation metric, and what's the baseline to beat?
Notebook-only experiment, or end-to-end including a serving path?
What's the time-box, and what's the deliverable at the end of it?
If the model works, what's the path to production — and is that in scope?
Cost of training and inference — does the team have the budget?

Like spikes and prototypes, the deliverable is knowledge first. Treat them the same way. Open a session when the experiment has shipped its result.

What gets said in the room

Questions worth asking before voting

Keep reading