robust mproxy sidecar and billing-grade GCS cost tracking by akhileshh · Pull Request #1228 · ZettaAI/zetta_utils

akhileshh · 2026-04-18T23:44:27Z

Sidecar robustness

Resources: mproxy 1Gi/2Gi, oom_tracker 256Mi/512Mi (previous limits OOM'd under load).
Self-heal: _run_main_loop restarts mitmdump in-place when it crashes.
stream_large_bodies=64k keeps mproxy memory bounded.
restart_policy=OnFailure for mazepa workers so kubelet restarts a dead sidecar in-place (training keeps Never).
Hard dependency: no silent "tracking disabled" fallback; wait_for_ca fails non-zero on timeout.
Disruption detection now walks all containers, tags container=<name>, and covers mproxy events.
Resilient watchers with retry-inside-loop + capped backoff; a transient K8s-API blip no longer kills disruption detection for the run.
Probes: dropped redundant startup probe; liveness relaxed to 60s grace.

Cost tracking accuracy

Shared classifier in new gcs_classification.py (stdlib-only) used by both sidecar and addon.
XML API (/{bucket}/{object}) now handled correctly — root cause of the 2× inflation.
Batch API routed to _batch (uncounted) instead of _unknown.
_unclassified is a bug signal — unrecognised paths log loudly, don't silently fold into Class A.
Egress on streamed bodies now counted via Content-Length (was missing: 320 MB vs 2.5 MB observed).
Per-bucket egress consistent with billed semantics (zero for same-region buckets).
Stats preservation across mitmdump/container restarts (addon resume + atexit flush).
ProxyError is transient so SQS visibility-timeout retry covers the restart window.

Move env-var setup, CA volume, wait_for_ca script, and the two sidecar builders out of get_pod_spec. Dedupes the V1Container boilerplate via _build_sidecar_container. No behavior change.

mproxy: cpu=200m/mem=512Mi requests, 1Gi memory limit, TCP startup+ liveness probes on :8080. oom_tracker: cpu=50m/mem=128Mi requests, 256Mi limit. Without guaranteed resources mproxy was starved during main's CPU bursts and its TLS handshakes deadlocked — ProxyError on main's GCS calls. Probes also catch hangs, not just crashes.

_check_pod_disruption walks all container statuses and tags messages with container=<name>, so an mproxy OOM is distinguishable from a main OOM. watch_for_run_events widens the "container main" filter to also allow "container mproxy" so its K8s events land in events.log.

_resilient_watch moves the try/except inside the while loop with capped backoff, so a transient GKE-master connection error no longer permanently disables disruption detection. Both watchers collapse to thin adapters on top of shared _resilient_watch / _open_watcher_log / _append_watcher_log helpers.

wait_for_ca bounds its wait at 120s and exits 1 on timeout instead of falling back to a "tracking disabled" path. _setup_mitmproxy_ca raises on any setup failure; run_sidecar no longer has the "mproxy unavailable" branch. Dockerfile asserts mitmdump is installed. GCS cost tracking is a billing requirement — a broken mproxy must manifest as a visible pod-level crash-loop, not silently run without tracking.

Extract pure-Python classifier + GCSStats into gcs_classification (no zetta_utils deps) so the mitmproxy addon subprocess can use them without pulling in firestore. Removes the duplicate copies in mitm_addon that had drifted. Tightens the addon's host filter to an exact match.

extract_bucket_from_api_path now recognises XML-style /{bucket}/{object} paths (cloudvolume / tensorstore use these — they were the bulk of the _unknown bucket attribution). The classifier's JSON-API substring heuristics for /o and /b are gated on the path actually being JSON, fixing the regression where XML object reads with /o substrings (.../object, .../output) flipped from Class B get to Class A list_objects. 17 new unit tests cover XML cases and the inflation regression.

extract_bucket_from_api_path returning None for a non-batch path is a classifier coverage gap, not a real bucket. Record under _unclassified with op_class=None (operations count + egress increment, billed totals do not), and log the full method+URL at WARNING so the gap can be fixed. GCSStats.record now accepts op_class=None for this and the upcoming _batch case.

route_request now categorises every request into one of: real bucket (classified), _batch (POST /batch/storage/v1; sub-ops not yet parsed, so uncounted), or _unclassified (coverage gap; callback logs the URL). mitm_addon._record collapses to a one-line wrapper. _batch and _unclassified make billing-impacting categories visible without inflating Class A totals.

Move classifier / extractor / GCSStats / route_request tests out of test_gcs_tracker.py into a new test_gcs_classifier.py, parametrize the per-row cases, and add a captured-traffic fixture asserting expected (bucket, class, operation) end-to-end via route_request. A coverage guard fails if any fixture row falls through to _unclassified, so a classifier-coverage gap surfaces as a test failure rather than silent cost drift.

Egress was only counted on GET responses and used the decoded body length. GCP bills wire bytes for egress, and listings / metadata / resumable-upload status responses also have egress. Now counts on any response with a body, using Content-Length when present (with len(body) fallback for chunked / multipart).

Procedure to diff per-run sidecar totals (Class A/B ops, egress) against GCP BigQuery billing export on a dedicated project. Run after classifier changes, before promising new tracked SKUs to a customer, and on a quarterly cadence. Captures per-discrepancy iteration history.

- Resource reservations: mproxy 1Gi/2Gi, oom_tracker 256Mi/512Mi (previous limits OOM'd under sustained concurrent TLS + firestore imports). - stream_large_bodies=64k: stream large response bodies instead of buffering them for the addon. - Self-heal: _run_main_loop restarts mitmdump in-place (bounded) when it crashes, so port 8080 stays bound without depending on K8s restart-policy. - Drop TCP startup probe, relax liveness to 60s grace — self-heal + wait_for_ca already cover readiness; the probe was actively harmful under chaos. - Mazepa worker pods use restart_policy=OnFailure so kubelet restarts a dead sidecar in-place. Training keeps Never (pod-level recovery via torch elastic).

- Preserve stats across mitmdump restart: mitm_addon resumes from /tmp/gcs_tracker_stats.json on startup so self-heal does not reset _stats to empty. Collapse module-level side effects into _init(). - atexit flush in gcs_tracker alongside the SIGTERM handler so any clean Python exit path flushes stats to Firestore. - Count egress on streamed bodies: drop the "if body else 0" guard so compute_egress_bytes falls back to Content-Length when stream_large_bodies is in effect. Observed 320 MB tracked on a workload that previously reported 2.5 MB. - Per-bucket egress_bytes zeroed for same-region buckets at aggregation, matching the cross-region-only top-level total. - Treat "Unable to connect to proxy" as a transient error so GCS calls that hit mproxy during a restart window retry via SQS visibility timeout instead of aborting the run.

CloudFiles-driven flow that exercises every classifier branch in one task: XML puts, gets, heads, JSON list, batch delete. Used by the stress spec to validate stats preservation and egress accuracy without a heavy real workload.

codecov · 2026-04-18T23:57:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.98%. Comparing base (bd079b4) to head (12dc983).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1228   +/-   ##
=======================================
  Coverage   99.98%   99.98%           
=======================================
  Files         200      200           
  Lines       10642    10642           
=======================================
  Hits        10640    10640           
  Misses          2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

akhileshh added 15 commits April 17, 2026 17:38

refactor: extract sidecar wiring to sidecar.py

2f2e69a

Move env-var setup, CA volume, wait_for_ca script, and the two sidecar builders out of get_pod_spec. Dedupes the V1Container boilerplate via _build_sidecar_container. No behavior change.

test: cost_tracking_test flow for classifier validation

12dc983

CloudFiles-driven flow that exercises every classifier branch in one task: XML puts, gets, heads, JSON list, batch delete. Used by the stress spec to validate stats preservation and egress accuracy without a heavy real workload.

akhileshh requested review from dodamih and supersergiy April 18, 2026 23:44

dodamih approved these changes Apr 21, 2026

View reviewed changes

dodamih merged commit 8c416ba into main Apr 21, 2026
7 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robust mproxy sidecar and billing-grade GCS cost tracking#1228

robust mproxy sidecar and billing-grade GCS cost tracking#1228
dodamih merged 15 commits intomainfrom
akhilesh/sidecar-hardening

akhileshh commented Apr 18, 2026

Uh oh!

codecov bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akhileshh commented Apr 18, 2026

Sidecar robustness

Cost tracking accuracy

Uh oh!

codecov bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 18, 2026 •

edited

Loading