← Back to blog

Real-time Machine Learning Inference at Scale

Deploying a machine learning model to production is a different problem from training one. The document verification system we built for a client processes thousands of certificate scans per month. Here is what we learned building the inference pipeline.

The challenge

Certificate forgery is a real problem in Zimbabwe — employers and universities need fast, reliable verification. The client needed sub-second response times and high accuracy on scanned documents of varying quality.

Model serving

We use TensorFlow Serving behind a lightweight FastAPI gateway. Models are versioned and swapped without downtime using a canary deployment pattern — new versions receive 10% of traffic before full rollout.

Retraining

False positives go into a review queue. Confirmed errors become training data. A weekly retraining job runs automatically and the new model is promoted to canary if accuracy metrics improve. This has pushed accuracy up from 94% to 98.2% over six months.