Distributed ML Model Serving Platform — register, deploy, A/B test, and roll back machine-learning models in seconds, not days.
ML teams produce models faster than infrastructure teams can deploy them. The friction points compound:
- Deployment takes days, not minutes — every model requires a custom service stub, Dockerfile, and SRE review
- No canary or shadow primitives — teams either YOLO 100% rollouts or build bespoke routing logic per project
- Rollback is manual — when a regression slips through, the response time is whatever the on-call PM can manage by Slack
- No drift visibility — silent input distribution shifts go undetected until business metrics tank weeks later
ModelMesh is a self-hostable, framework-agnostic serving platform that turns these problems into one-line operations. It's the open-source primitive that SageMaker, Vertex AI, and Modal sell as managed services.
| # | Result | Value on benchmark stack |
|---|---|---|
| 1 | Deployment time | <60 seconds end-to-end from client.register_model() to first served prediction (vs. typical days of manual setup) |
| 2 | Performance under load | p99 < 80ms at 500 RPS sustained on a single instance (with Redis prediction cache and dynamic batching enabled) |
| 3 | Reliability under failure | Automatic rollback in <30s when canary error rate exceeds threshold (validated via chaos-test fault injection) |
┌────────────────────────────────────┐
│ ModelMesh SDK │
│ client.register_model(...) │
│ client.predict(...) │
│ client.create_canary(...) │
└─────────────────┬──────────────────┘
│ HTTPS
┌─────────────────▼──────────────────┐
│ FastAPI Gateway │
│ Auth · Rate limit · Request ID │
│ Latency histogram (Prometheus) │
└─────────────────┬──────────────────┘
│
┌─────────────────────────────────┼─────────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌─────────▼─────────┐ ┌───────▼────────┐
│ routing/ │ │ inference/ │ │ registry/ │
│ A/B router │ │ Engine + cache │ │ Versioned │
│ Canary auto-rollback │ Dynamic batching │ │ model store │
│ Shadow (diff log) │ P2C load balance │ │ sklearn/torch │
│ P2C load balancer │ │ │ /onnx adapters│
└──────┬──────┘ └─────────┬─────────┘ └───────┬────────┘
│ │ │
└─────────────┬───────────────────┼─────────────────────────────────┘
│ │
┌──────▼──────┐ ┌───────▼────────┐
│ reliability │ │ observability/ │
│ circuit │ │ Prometheus │
│ breaker │ │ Structured │
│ retry │ │ JSON logs │
│ chaos │ │ PSI drift │
└─────────────┘ │ Tracing │
└───────┬────────┘
│
┌──────────────▼──────────────┐
│ PostgreSQL + Redis │
│ Models · Deployments │
│ Experiments · InferenceLog │
└─────────────────────────────┘
| Capability | Implementation |
|---|---|
| Framework-agnostic registry | Pluggable adapters for sklearn (joblib), PyTorch (.pt), ONNX runtime |
| Versioned model store | Semantic versioning + immutable artifacts + lineage tracking |
| A/B routing | Consistent-hashing for stable user-to-variant assignment |
| Canary deployments | Weighted traffic split with sliding-window error tracking and auto-rollback |
| Shadow deployments | Production return value preserved; candidate response diff logged for offline analysis |
| Prediction cache | Redis-backed with TTL per model and request-hash keys |
| Dynamic batching | Collects inflight requests up to max_batch_size or max_latency_ms |
| Load balancing | Power-of-two-choices across replicas |
| Drift detection | Population Stability Index (PSI) + Kolmogorov-Smirnov on rolling input windows |
| Reliability | Circuit breakers, exponential-backoff retries, chaos-test utilities |
| Observability | Prometheus histograms/counters, structured JSON logging, OpenTelemetry-style spans |
| Deployment | Docker Compose for local · Kubernetes manifests with HPA for production |
docker compose up --build
# API → http://localhost:8000/docs
# Prometheus → http://localhost:9090
# Grafana → http://localhost:3000 (admin / admin)from sklearn.ensemble import RandomForestClassifier
from modelmesh.sdk import ModelMeshClient
clf = RandomForestClassifier().fit(X_train, y_train)
client = ModelMeshClient(base_url="http://localhost:8000")
client.register_model(
name="fraud-detector",
model_object=clf,
framework="sklearn",
metadata={"trained_on": "2026-Q1", "auc": 0.94},
)
# Predict
prediction = client.predict("fraud-detector", inputs={"amount": 200, "merchant": "Acme"})
print(prediction)# Train and register v2
client.register_model("fraud-detector", model_object=clf_v2, framework="sklearn")
# Roll out gradually — 5% of traffic to v2, auto-rollback if error rate > 2%
client.create_canary(
name="fraud-detector",
candidate_version="2.0.0",
initial_percentage=5,
auto_rollback_threshold=0.02,
)
# Promote when satisfied
client.promote_canary("fraud-detector")client.create_shadow(
name="fraud-detector",
production_version="1.4.0",
shadow_version="2.0.0",
)
# Every prediction now serves v1 to the user AND fires v2 in the background.
# Diffs are logged. View at: GET /monitoring/shadow/fraud-detector| Endpoint | Purpose |
|---|---|
POST /models |
Register a new model (multipart upload) |
GET /models |
List with pagination + framework filter |
GET /models/{name}/versions |
Version lineage |
POST /predict/{model_name} |
Inference (single) |
POST /predict/{model_name}/batch |
Batch inference (auto-batched server-side too) |
POST /deployments/canary |
Start a canary release |
POST /deployments/canary/{id}/promote |
Promote canary to 100% |
POST /deployments/canary/{id}/rollback |
Manual rollback |
POST /deployments/shadow |
Start a shadow deployment |
GET /monitoring/drift/{name} |
PSI + KS report on input distribution |
GET /monitoring/latency |
p50 / p95 / p99 over windows |
GET /metrics |
Prometheus scrape endpoint |
GET /health |
Liveness + readiness probes |
Full reference in docs/api_reference.md.
Full results in docs/benchmark_results.md.
| Scenario | Result |
|---|---|
| Register → first prediction | 42 seconds end-to-end (sklearn 50MB model) |
| p99 latency @ 500 RPS, cache hit ratio 0.6 | 76 ms |
| Throughput with dynamic batching | 3.2× single-request at same latency budget |
| Rollback latency (chaos test) | 18 seconds mean, 28s p95 |
| PSI drift detection lag | <10 minutes at default 1000-sample window |
modelmesh/
├── src/modelmesh/
│ ├── api/ # FastAPI app + routes (models, inference, deployments, monitoring)
│ ├── registry/ # Versioned store + sklearn/torch/onnx adapters
│ ├── routing/ # A/B, canary, shadow, P2C load balancer
│ ├── inference/ # Engine, Redis cache, dynamic batching
│ ├── observability/ # Prometheus metrics, structured logs, drift (PSI/KS), tracing
│ ├── reliability/ # Circuit breaker, retry, chaos
│ ├── db/ # SQLAlchemy ORM + Alembic migrations
│ └── config.py # Pydantic settings
├── k8s/ # deployment.yaml, service.yaml, hpa.yaml, configmap.yaml
├── prometheus/ # prometheus.yml
├── grafana/dashboards/ # Pre-built ModelMesh dashboard
├── examples/ # End-to-end demos: register, canary, shadow
├── tests/ # API, registry, routing, SDK, reliability
├── docs/ # architecture.md, api_reference.md, operations.md, benchmark_results.md
└── docker-compose.yml # Full local stack (API + Postgres + Redis + Prometheus + Grafana)
- Async everywhere: FastAPI + asyncpg + httpx for the gateway path
- Reproducibility: all randomness controlled via seeded RNGs
- Typing: Pydantic v2 throughout; mypy strict in CI
- Migrations: Alembic-managed schema with
versions/0001_initial.py - Reliability: circuit breakers wrap every model call; configurable failure/recovery thresholds
- Observability: structured JSON logs with request IDs; Prometheus histograms with bucketing tuned for sub-ms to 5s
- Pre-commit: ruff, black, mypy, pytest run in GitHub Actions on every push
ModelMesh Distributed ML Model Serving Platform | Python, FastAPI, PostgreSQL, Redis, Prometheus, Grafana, Docker, Kubernetes, SQLAlchemy, Alembic
- Developed a framework-agnostic ML model serving platform spanning 3,300+ lines of production Python and supporting 3 model formats (sklearn, PyTorch, ONNX), 4 deployment strategies (full, canary, shadow, A/B), and 21 versioned REST endpoints, enabling data science teams to ship models through a single registry without bespoke per-project deployment infrastructure
- Implemented a consistent-hashing A/B router, canary deployments with configurable sliding-window error tracking and auto-rollback, shadow deployments with response-diff logging, Redis-backed prediction cache, dynamic batching, power-of-two-choices load balancing, and PSI + Kolmogorov-Smirnov drift detection across rolling input windows
- Built a scalable async FastAPI gateway instrumented with 7 named Prometheus metrics (http_requests_total, http_request_duration_seconds, predictions_total, prediction_latency, active_deployments, canary_error_rate, cache_events), structured JSON logging, 3-state circuit breakers (CLOSED/OPEN/HALF_OPEN), Alembic-managed Postgres schema, and Kubernetes HPA manifests — load-tested at 594 RPS sustained on a single instance with p95 latency of 109ms, p99 of 160ms, and 100% success rate across 5,000 concurrent requests (reproducible via
python run_load_test.py)
MIT — see LICENSE.
@software{modelmesh2025,
title = {ModelMesh: Distributed ML Model Serving Platform},
author = {ModelMesh Contributors},
year = {2025},
url = {https://github.com/yourorg/modelmesh}
}