Real-time Machine Learning Inference at Scale
Deploying a machine learning model to production is a different problem from training one. The document verification system we built for a client processes thousands of certificate scans per month. Here is what we learned building the inference pipeline.
The challenge
Certificate forgery is a real problem in Zimbabwe — employers and universities need fast, reliable verification. The client needed sub-second response times and high accuracy on scanned documents of varying quality.
Model serving
We use TensorFlow Serving behind a lightweight FastAPI gateway. Models are versioned and swapped without downtime using a canary deployment pattern — new versions receive 10% of traffic before full rollout.
Retraining
False positives go into a review queue. Confirmed errors become training data. A weekly retraining job runs automatically and the new model is promoted to canary if accuracy metrics improve. This has pushed accuracy up from 94% to 98.2% over six months.