Skip to content

robust mproxy sidecar and billing-grade GCS cost tracking#1228

Merged
dodamih merged 15 commits intomainfrom
akhilesh/sidecar-hardening
Apr 21, 2026
Merged

robust mproxy sidecar and billing-grade GCS cost tracking#1228
dodamih merged 15 commits intomainfrom
akhilesh/sidecar-hardening

Conversation

@akhileshh
Copy link
Copy Markdown
Contributor

Sidecar robustness

  • Resources: mproxy 1Gi/2Gi, oom_tracker 256Mi/512Mi (previous limits OOM'd under load).
  • Self-heal: _run_main_loop restarts mitmdump in-place when it crashes.
  • stream_large_bodies=64k keeps mproxy memory bounded.
  • restart_policy=OnFailure for mazepa workers so kubelet restarts a dead sidecar in-place (training keeps Never).
  • Hard dependency: no silent "tracking disabled" fallback; wait_for_ca fails non-zero on timeout.
  • Disruption detection now walks all containers, tags container=<name>, and covers mproxy events.
  • Resilient watchers with retry-inside-loop + capped backoff; a transient K8s-API blip no longer kills disruption detection for the run.
  • Probes: dropped redundant startup probe; liveness relaxed to 60s grace.

Cost tracking accuracy

  • Shared classifier in new gcs_classification.py (stdlib-only) used by both sidecar and addon.
  • XML API (/{bucket}/{object}) now handled correctly — root cause of the 2× inflation.
  • Batch API routed to _batch (uncounted) instead of _unknown.
  • _unclassified is a bug signal — unrecognised paths log loudly, don't silently fold into Class A.
  • Egress on streamed bodies now counted via Content-Length (was missing: 320 MB vs 2.5 MB observed).
  • Per-bucket egress consistent with billed semantics (zero for same-region buckets).
  • Stats preservation across mitmdump/container restarts (addon resume + atexit flush).
  • ProxyError is transient so SQS visibility-timeout retry covers the restart window.

Move env-var setup, CA volume, wait_for_ca script, and the two sidecar
builders out of get_pod_spec. Dedupes the V1Container boilerplate via
_build_sidecar_container. No behavior change.
mproxy: cpu=200m/mem=512Mi requests, 1Gi memory limit, TCP startup+
liveness probes on :8080. oom_tracker: cpu=50m/mem=128Mi requests,
256Mi limit. Without guaranteed resources mproxy was starved during
main's CPU bursts and its TLS handshakes deadlocked — ProxyError on
main's GCS calls. Probes also catch hangs, not just crashes.
_check_pod_disruption walks all container statuses and tags messages
with container=<name>, so an mproxy OOM is distinguishable from a main
OOM. watch_for_run_events widens the "container main" filter to also
allow "container mproxy" so its K8s events land in events.log.
_resilient_watch moves the try/except inside the while loop with
capped backoff, so a transient GKE-master connection error no longer
permanently disables disruption detection. Both watchers collapse to
thin adapters on top of shared _resilient_watch / _open_watcher_log /
_append_watcher_log helpers.
wait_for_ca bounds its wait at 120s and exits 1 on timeout instead of
falling back to a "tracking disabled" path. _setup_mitmproxy_ca raises
on any setup failure; run_sidecar no longer has the "mproxy unavailable"
branch. Dockerfile asserts mitmdump is installed. GCS cost tracking is
a billing requirement — a broken mproxy must manifest as a visible
pod-level crash-loop, not silently run without tracking.
Extract pure-Python classifier + GCSStats into gcs_classification (no
zetta_utils deps) so the mitmproxy addon subprocess can use them without
pulling in firestore. Removes the duplicate copies in mitm_addon that
had drifted. Tightens the addon's host filter to an exact match.
extract_bucket_from_api_path now recognises XML-style /{bucket}/{object}
paths (cloudvolume / tensorstore use these — they were the bulk of the
_unknown bucket attribution). The classifier's JSON-API substring
heuristics for /o and /b are gated on the path actually being JSON,
fixing the regression where XML object reads with /o substrings
(.../object, .../output) flipped from Class B get to Class A list_objects.
17 new unit tests cover XML cases and the inflation regression.
extract_bucket_from_api_path returning None for a non-batch path is a
classifier coverage gap, not a real bucket. Record under _unclassified
with op_class=None (operations count + egress increment, billed totals
do not), and log the full method+URL at WARNING so the gap can be
fixed. GCSStats.record now accepts op_class=None for this and the
upcoming _batch case.
route_request now categorises every request into one of: real bucket
(classified), _batch (POST /batch/storage/v1; sub-ops not yet parsed,
so uncounted), or _unclassified (coverage gap; callback logs the URL).
mitm_addon._record collapses to a one-line wrapper. _batch and
_unclassified make billing-impacting categories visible without
inflating Class A totals.
Move classifier / extractor / GCSStats / route_request tests out of
test_gcs_tracker.py into a new test_gcs_classifier.py, parametrize the
per-row cases, and add a captured-traffic fixture asserting expected
(bucket, class, operation) end-to-end via route_request. A coverage
guard fails if any fixture row falls through to _unclassified, so a
classifier-coverage gap surfaces as a test failure rather than silent
cost drift.
Egress was only counted on GET responses and used the decoded body
length. GCP bills wire bytes for egress, and listings / metadata /
resumable-upload status responses also have egress. Now counts on any
response with a body, using Content-Length when present (with
len(body) fallback for chunked / multipart).
Procedure to diff per-run sidecar totals (Class A/B ops, egress) against
GCP BigQuery billing export on a dedicated project. Run after classifier
changes, before promising new tracked SKUs to a customer, and on a
quarterly cadence. Captures per-discrepancy iteration history.
- Resource reservations: mproxy 1Gi/2Gi, oom_tracker 256Mi/512Mi
  (previous limits OOM'd under sustained concurrent TLS + firestore
  imports).
- stream_large_bodies=64k: stream large response bodies instead of
  buffering them for the addon.
- Self-heal: _run_main_loop restarts mitmdump in-place (bounded) when
  it crashes, so port 8080 stays bound without depending on K8s
  restart-policy.
- Drop TCP startup probe, relax liveness to 60s grace — self-heal +
  wait_for_ca already cover readiness; the probe was actively harmful
  under chaos.
- Mazepa worker pods use restart_policy=OnFailure so kubelet restarts
  a dead sidecar in-place. Training keeps Never (pod-level recovery
  via torch elastic).
- Preserve stats across mitmdump restart: mitm_addon resumes from
  /tmp/gcs_tracker_stats.json on startup so self-heal does not reset
  _stats to empty. Collapse module-level side effects into _init().
- atexit flush in gcs_tracker alongside the SIGTERM handler so any
  clean Python exit path flushes stats to Firestore.
- Count egress on streamed bodies: drop the "if body else 0" guard
  so compute_egress_bytes falls back to Content-Length when
  stream_large_bodies is in effect. Observed 320 MB tracked on a
  workload that previously reported 2.5 MB.
- Per-bucket egress_bytes zeroed for same-region buckets at
  aggregation, matching the cross-region-only top-level total.
- Treat "Unable to connect to proxy" as a transient error so GCS
  calls that hit mproxy during a restart window retry via SQS
  visibility timeout instead of aborting the run.
CloudFiles-driven flow that exercises every classifier branch in one
task: XML puts, gets, heads, JSON list, batch delete. Used by the
stress spec to validate stats preservation and egress accuracy
without a heavy real workload.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.98%. Comparing base (bd079b4) to head (12dc983).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1228   +/-   ##
=======================================
  Coverage   99.98%   99.98%           
=======================================
  Files         200      200           
  Lines       10642    10642           
=======================================
  Hits        10640    10640           
  Misses          2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dodamih dodamih merged commit 8c416ba into main Apr 21, 2026
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants