AI in Operations
Using AI to run quality control in delivery operations
AI is very good at the part of quality control that humans are worst at — looking at everything, tirelessly, the same way every time. It is dangerous precisely where it is most seductive: when it lets a team stop being accountable for the output. The discipline is in keeping the first and refusing the second.
I am not an AI evangelist, and this is not a piece about transformation. It is a practical account of where machine assistance earns its place in a quality function and where it does not, written from inside operations rather than from a vendor deck.
Start from what quality control actually is: comparing real output against a standard, consistently, at a volume humans cannot sustain. Read that sentence again, because it contains the entire case for AI in QC — and its entire limit. Consistently, at volume is exactly what machines do well. Against a standard is exactly the part that still requires human judgement to define and own.
Where AI genuinely raises the ceiling
There are three places where machine assistance does real work in a delivery operation's quality control.
Total-population review instead of sampling. Human QC samples because humans cannot look at everything. Sampling is a compromise that leaves most of the output unexamined. A well-scoped model can screen the entire population for the specific, well-defined failure modes you care about — turning a 2% sample into 100% coverage for those checks. This is the single biggest gain, and it is unavailable to a human team at any headcount.
Consistency that does not drift. A human reviewer at 4pm on a Friday is not the same reviewer as at 9am on Tuesday. Fatigue, mood, and recency all bend human judgement. A machine applies the same criterion the ten-thousandth time as the first. For high-volume, well-specified checks, that consistency is worth more than the marginal sophistication of any individual human judgement.
Surfacing patterns humans cannot hold in their heads. Defects often cluster — by time, by team, by input type, by upstream source. A human catches the defect; a model is far better at noticing that 60% of a defect class traces to one source. That pattern is the most valuable output of a quality function, because it points directly at the systemic cause worth designing out.
Where AI quietly lowers the floor
The failures are subtler than the wins, which is why they are dangerous.
It is confidently wrong on the edge cases that matter most. A model will pass the unusual, high-stakes case with the same confidence it passes a routine one. The defects that cost the most are frequently the novel ones — and novelty is exactly where statistical systems are weakest. A quality function that trusts the machine on the cases it has never seen has automated its own blind spot.
It moves the standard without telling you. When a model does the checking, the standard quietly becomes "whatever the model checks." Anything the model was not built to catch slips out of the definition of quality entirely — not by decision, but by omission. The standard erodes silently, which is the worst way for a standard to erode.
The danger is never that AI checks badly. It is that AI checks confidently, and a team stops checking the things the machine was never asked to look at.
It dissolves accountability. This is the real risk. When a person signs off on quality, there is an owner. When "the system approved it," there is a diffusion of responsibility that no governance model survives. The moment "the AI passed it" becomes an acceptable answer to "why did this defect reach the client," the quality function has lost the one thing that made it work.
How to deploy it without losing the plot
The pattern that works treats AI as an instrument the quality function operates, not a replacement for it.
- Keep the human as the owner of the standard. The machine checks against the standard; people define, maintain, and are accountable for it. The standard is a human artefact, always.
- Use AI to widen coverage, then route exceptions to people. Let the machine screen the whole population and handle the routine pass/fail; escalate the ambiguous, the novel, and the high-exposure to human judgement. You get total coverage and human attention where it counts.
- Audit the machine the way you audit a team. A model is a reviewer, and reviewers drift. Sample its decisions, measure its false-pass and false-flag rates against the standard, and treat a degrading model exactly as you would a degrading analyst.
- Keep accountability attached to a person. Someone owns the quality outcome of every process, full stop. The AI is a tool that person uses; it is never the answer to "who is responsible."
The honest summary
AI changes the economics of quality control in delivery operations — it makes total-population review affordable and brings a consistency no human team can match. That is a genuine, large gain, and operations that ignore it will be out-competed on both cost and coverage.
But it changes nothing about the fundamentals. Quality is still comparison against a standard, and the standard is still a human responsibility. The operations that get this right will use AI to do what machines do best while keeping people firmly in charge of what only people can own. The ones that get it wrong will hand the machine the standard along with the checking — and discover, usually through an expensive failure, that they automated away the wrong half of the job.
A worked example: where to draw the line
Take any quality check in a delivery operation and you can decide, before deploying anything, whether it belongs to the machine or to a person. The deciding questions are always the same three.
Is the check well-specified? Can you write the pass/fail criterion precisely enough that two people would apply it identically? If yes, a machine can apply it too. If the criterion lives partly in a reviewer's judgement — "does this feel on-brand," "is this the right call for this client" — then the check is not specifiable, and handing it to a machine does not automate the judgement; it discards it.
Is the check high-volume? Is this a comparison a human would otherwise make hundreds or thousands of times? If yes, consistency matters more than nuance, and the machine's tireless sameness is worth more than any single human's marginal sophistication. If the check happens rarely, the case for automating it is weak — a person can simply do it, and will bring context the machine lacks.
Is the check consequential and bounded? Does it catch failures that matter, and does it operate over cases that resemble each other? A machine is strong on consequential, bounded, repetitive checks and weak exactly where cases are novel and high-stakes at once.
Run a check through those three questions and the answer falls out. A high-volume, well-specified, bounded comparison goes to the machine, with exceptions routed to a person. A low-volume or judgement-laden or novel check stays with a person, perhaps with the machine surfacing candidates. The line is not ideological. It is a property of the check.
Do not ask whether AI is good or bad at quality control. Ask whether this specific check is well-specified, high-volume and bounded. The check tells you where it belongs.
How to deploy it: a sequence
Standing this up well is a matter of order. Rushing to automate before the standard is solid is the most common way to automate a mess.
1. Write the standard down first, in human terms. Before any model touches the work, the standard has to exist as a clear, owned artefact — what good looks like, specific enough to score. If the standard is fuzzy, automating against it just encodes the fuzziness at scale. The standard is the foundation, and it is human.
2. Decompose the standard into discrete checks. A standard is rarely one comparison; it is many. Break it into individual checks and sort each one with the three questions above. Now you have two lists: checks the machine can run, and checks that stay with people.
3. Pilot the machine checks against a known answer. Before trusting a machine check, run it over work a human has already graded and compare. You are measuring two error rates: how often it passes something that should have failed (the dangerous one) and how often it flags something that was fine (the expensive-but-safe one). You cannot deploy responsibly without these baselines.
4. Deploy for coverage, route exceptions to people. Let the machine screen the whole population for its checks and handle the routine pass/fail. Send the ambiguous, the novel, and the high-exposure to a person. This is the configuration that buys total coverage and reserves human attention for where it counts.
5. Put the patterns in front of someone who can act. The machine's most valuable output is not individual verdicts; it is the clustering — this defect class traces to that source. That insight only pays if it reaches a person with the authority to design the cause out. Wire the pattern into the governance that already steers the operation.
6. Audit the machine on a cadence, like any reviewer. Schedule regular sampling of the machine's decisions against the standard, exactly as you would review a human analyst's work. A model drifts; the world it was tuned on shifts under it. Treat a degrading model as you would a degrading reviewer.
Common mistakes, and how to avoid them
The ways this goes wrong are consistent, and most are quiet until something expensive happens.
Letting the model become the standard. The subtlest failure. When the machine does the checking, "quality" silently contracts to "whatever the machine checks," and anything outside its scope drops out of the definition entirely — not by decision, by omission. Avoid it by keeping the written standard primary and periodically asking what it requires that the machine does not check. The gap is your blind spot.
Trusting the machine on novelty. A model passes the unusual, high-stakes case with the same confidence it passes a routine one, and the costly defects are disproportionately the novel ones. Never let "the machine passed it" stand for the cases it has never seen. Route novelty to people by design.
Deploying without measuring the error rates. A check whose false-pass rate you have not measured is a check you do not understand. Always baseline against known answers before trusting a machine check, and keep measuring after.
Letting accountability diffuse. The moment "the system approved it" becomes an acceptable answer to "why did this defect reach the client," ownership has evaporated and no governance survives. Keep a named person accountable for the quality outcome of every process. The machine is a tool that person uses; it is never the answer to who is responsible.
Skipping the audit. Teams audit the team and exempt the machine, as though software does not drift. It does. An unaudited model degrades invisibly until a failure exposes it. Audit it on the same cadence you audit people.
What to measure
Hold the machine to numbers, and watch the numbers over time.
- False-pass rate. How often the machine passes work that breaches the standard. This is the dangerous error — the one that reaches the client — and the single most important number to watch. It should be low and stable.
- False-flag rate. How often it flags work that was actually fine. Less dangerous, but it taxes the human reviewers who handle exceptions, and a rising rate erodes trust in the system.
- Coverage. The share of the population the machine actually screened. The promise of AI here is total-population review; coverage gaps are where that promise quietly fails.
- Exception volume. How much work is routed to people, and whether that volume is sustainable. If exceptions balloon, either the checks are mis-scoped or the standard is shifting.
- Pattern yield. Whether the clustering the machine surfaces is actually leading to causes being designed out. The patterns are the highest-value output; if nobody acts on them, you have an expensive screening tool and nothing more.
The number that protects you is the false-pass rate. The number that pays you back is pattern yield. Watch both, because a system can look healthy on one while failing on the other.
Where to start
Do not begin with a platform decision or a vendor conversation. Begin with the standard.
Take one process and write down what good looks like, precisely enough that two reviewers would score it the same way. Decompose that standard into its individual checks. Then pick the single check that is most clearly well-specified, high-volume and bounded — the most obvious candidate, the one a human is currently doing the same way thousands of times. That is your first machine check.
Pilot it against work a person has already graded. Measure how often it passes what it should have failed. If that false-pass rate is low enough to trust, deploy it for total-population screening and route the exceptions to a person. Keep the human as the owner of the standard, audit the machine on a cadence, and put whatever patterns it surfaces in front of someone who can act on them.
That is the entire approach, proven on one check before you widen it. When the first works, you extend to the next clear candidate, and the next — always keeping people in charge of the standard, the exceptions, and the accountability. Coverage rises, consistency rises, and the one thing that made the quality function work in the first place stays exactly where it belongs: with a person who owns the result.