The model is the easy part

A fresh-food operator came to us with a clean question: can you predict tomorrow's demand per dish, per site, accurately enough that we stop binning margin and stop running out at lunch? Eighteen months later the platform forecasts at 98% accuracy in production. People assume the hard part was the machine learning. It wasn't.

The model that does the predicting is a gradient-boosted regressor with some seasonality features bolted on. A competent data scientist can stand up something in that family in an afternoon. We had a respectable first version inside three weeks — accurate enough on the backtest to make everyone in the room nod.

Then we spent five months building everything that turns "an accurate backtest" into "a number an operations manager bets the day on." That gap is the entire job, and almost nobody writes about it because it isn't glamorous. So here it is.

The three-week model

Forecasting demand for a stable, well-recorded business is, frankly, a solved problem. You have a target (units sold), a calendar, and a pile of historical rows. You engineer a few dozen features — day of week, lag windows, rolling means, holidays, a weather join — and you let a boosting library find the interactions. The math is not where projects die.

What the three weeks actually bought us was confidence that the signal existed at all. The backtest told us demand was predictable to within a couple of percentage points given clean inputs. That's the green light. It is not the product.

The trap

A good backtest number is the most dangerous artifact in machine learning. It looks like the finish line and it is barely the starting gun. Backtests run on data that has been cleaned, joined and time-aligned by a human who already knows the answer. Production has none of those luxuries.

Where five months went

Here is the honest accounting of the next five months. None of it is modelling. All of it is what made the model usable.

Ingestion that survives reality. Sales feeds arrive late, doubled, or not at all. A POS reboots and replays yesterday. We built idempotent ingestion that can be re-run safely and that quarantines rows it doesn't trust rather than poisoning the training set.
A feature store with a memory. The features the model trains on must be computable at prediction time with only the data you'd actually have then — no peeking at the future. Enforcing that point-in-time correctness was weeks of work and caught two leaks that had inflated the original backtest.
Backfill and replay. When a site's history was wrong, we needed to rebuild every downstream forecast for that site without taking the live system down. Replay is plumbing nobody demos and everybody needs.
Monitoring before features. We shipped drift and freshness alarms before we shipped half the UI. A silently wrong forecast is worse than a visibly missing one.
The human override. A new site opens, a festival lands, a road closes. The model can't know. Planners needed a sanctioned way to nudge the number and have the system learn from the nudge.

The model answers a question. The platform decides which question, with what data, for whom, and what happens when the answer is wrong.

— On why the wrapper is the work

The data contract

The single highest-leverage thing we built was not a model improvement. It was a data contract: an explicit, validated schema between every upstream source and our pipeline. Column types, allowed ranges, freshness windows, null policies — all declared, all checked at the door.

Before the contract, a forecast could quietly degrade because a POS vendor changed a currency field from cents to dollars and nobody told us. After the contract, that change is rejected at ingestion with a named, paged error — and the last good forecast stays on screen instead of a confidently wrong new one.

contract · sales_daily

# every source is validated at the door, not after it poisons training
sales_daily:
  units:        int  >= 0      # reject negatives — refunds go elsewhere
  revenue:      decimal(10,2)  # cents → flagged in v3, now enforced
  site_id:      fk(sites)      # unknown site → quarantine, page on-call
  recorded_at:  freshness <= 6h # stale feed → hold last good forecast
on_violation: quarantine + alert  # never: silently train on it

This is the unsexy heart of every production ML system we have shipped. The model is a function; the contract is what guarantees the function is fed the inputs it was trained to expect. Skip it and you don't have a forecasting platform — you have a very expensive random number generator that's right most of the time.

Drift is a feature, not a failure

Every model decays. Tastes shift, a new menu lands, a competitor opens across the street. The question is never whether the world will move out from under your model — it's whether you'll find out from a dashboard or from an angry phone call.

We treat drift detection as a first-class product feature. The platform continuously compares live input distributions and live error against training baselines. When either crosses a threshold, it does three things, in order:

It tells someone — a specific human, with the site, the metric, and how far it has moved.
It protects the output — widening confidence bands or falling back to a simpler, more robust baseline rather than trusting a model that's now extrapolating.
It schedules a retrain — with the new data, gated behind the same backtest bar the original had to clear.

Forecast accuracy

Sustained in production, not just on the backtest.

Stockouts

Empty fridges at lunch, cut by more than half.

Excess stock

Margin that used to be binned at end of day.

Notice that the headline number — 98% — is not the interesting one. The interesting numbers are the two next to it, because those are what the business feels. Accuracy is the input; less waste and fewer stockouts are the output. A platform that optimises the first while ignoring the second is a science project.

The dashboard someone checks at 8 a.m.

The forecast is consumed by a kitchen lead at the start of a shift, on a tablet, with coffee, in ninety seconds. That constraint shaped more decisions than the model architecture did.

It meant the answer had to be a quantity, not a probability distribution. It meant "I disagree, here's why" had to be one tap. It meant the screen had to show yesterday's forecast versus what actually happened, because trust is earned by being visibly accountable, not by being confident. A model that can't show its track record to the person relying on it will be quietly ignored within a week.

The real acceptance test

Not the F1 score. Not the RMSE. The acceptance test was a kitchen lead in week two saying "yeah, I just go with what it says now." That sentence is worth more than any offline metric, and you only earn it by designing the last ninety seconds as carefully as the model.

Notes to our past selves

If you're about to start something in this shape, here is what we'd tell the team that began eighteen months ago:

Budget the wrapper, not the model. Assume the model is 15% of the effort and plan the other 85% deliberately. The teams that miss deadlines are the ones who budgeted the reverse.
Write the data contract first. Before a single feature. It will surface a leak in your backtest and save you from shipping a number you can't defend.
Ship monitoring before UI. You cannot operate what you cannot see, and a wrong forecast nobody noticed is the failure mode that loses contracts.
Design the override. Humans will always know things the model can't. Give them a sanctioned lever and learn from it, or they'll route around the whole system in a spreadsheet.
Make the model accountable on screen. Show its history next to its prediction. Trust is a UI decision as much as a math one.

The machine learning was the easy part. We say that not to diminish the model — it's genuinely good — but to point at where the difficulty actually lives. The hype cycle sells the three weeks. The five months are what you're really paying an engineering team for.