Every week we talk to companies who want to "use AI". Sometimes that means a recommendation engine. Sometimes it means predictive churn modelling. Sometimes it means "something with ChatGPT". But in almost every conversation, before we can discuss what's possible with AI, we need to have a harder conversation about the data that would power it.
Because here's the inconvenient truth: most AI projects fail not because of the model, but because of the data. Bad inputs, missing data, inconsistent definitions, untracked pipelines โ these are the real failure points.
"You can't build a skyscraper on sand. And you can't build a reliable AI system on unreliable data."
The Readiness Checklist
Go through this section by section. Be honest. The goal isn't to score perfectly โ it's to know exactly where the gaps are before you start writing a brief for an AI vendor.
12โ24 months of historical data for the target problem
Most ML models need substantial history to find meaningful patterns โ less than a year rarely works for anything seasonal.
Data covers the full range of scenarios (including edge cases)
If your data only captures "normal" operations, your model will fail on outliers โ the exact cases you often care about most.
Data is in a central warehouse โ not scattered in source systems
You need a queryable, consolidated layer. Spreadsheets and API calls don't count.
Key fields have low null rates (<5% for critical features)
High null rates mean the data was never captured or there's a systematic collection problem. Both are serious.
No duplicate records in entity tables (customers, orders, products)
Duplicates inflate counts, distort aggregates, and confuse models significantly. This is the #1 quality issue we encounter.
Categorical fields use consistent values
No "UK", "United Kingdom", "U.K." for the same thing. Inconsistent categories require significant cleaning effort before modelling.
You know who owns each data source
AI models drift when underlying data changes. You need a contact when source systems are updated.
Business definitions are documented and agreed
If your team debates what an "active customer" is, your model will be trained on an ambiguous target โ and nobody will trust the output.
Data pipelines are automated with no manual steps
A model needing fresh data is only as reliable as the pipeline feeding it. Manual steps are failure points.
You can version and store model outputs alongside input data
When a prediction goes wrong, you need to reconstruct exactly what data the model saw. This requires infrastructure, not just code.
How to Read Your Score
The Question We Always Ask
Before a client commits to an AI project, we ask: "Can you currently answer your top five business questions reliably from your data?"
If the answer is no โ if reports take days to produce, if numbers are disputed, if there's no agreed source of truth โ then AI is not the next step. Good analytics is.
The honest truth: A business that can reliably answer its operational questions with clean data will get more value from that capability than from a complex ML model built on shaky foundations. Get the basics right first.
What "Ready" Actually Looks Like
Ready for AI doesn't mean perfect data. It means:
- You have enough clean historical data covering the problem space.
- The data is in a warehouse, automated, and reasonably well governed.
- You have a specific, well-scoped question you want the model to answer.
- There's a business owner who will act on the model's output.
That last point is more important than any technical criteria. The best model in the world produces no value if the organisation isn't set up to act on its predictions.
If you'd like to run through this checklist with your own data, we offer a half-day AI readiness workshop. Practical, honest, and no sales pitch.