fix(e2e): eliminate docker network address-pool exhaustion by joshua-temple · Pull Request #125 · stablekernel/cascade

joshua-temple · 2026-06-12T02:15:14Z

Problem

The e2e suite fails non-deterministically with all predefined address pools have been fully subnetted. Each scenario creates its own docker network and the previous teardown removed it fire-and-forget. Across the serial scenario suite a leaked network drains the daemon's small default address pool until a late scenario cannot allocate one and dies at setup. It is not a timeout and not a defect in the failing scenario.

A secondary symptom: a few scenarios exhausted all of the transient act-exec retries under heavy contention, even though the retry recovered several others on a later attempt.

Fix

Synchronous, verified network teardown in the harness Cleanup(). Removal now waits for and checks the result, with a short bounded retry so a container that is still detaching cannot leave the network behind. The scenario-retry layer already defers Cleanup() per attempt, so every attempt (including a failed one) releases its network. The invariant: running many scenarios does not grow the docker network count.
CI daemon address-pool headroom. A setup step writes a generous default-address-pools (10.99.0.0/16 carved into /24 subnets, 256 networks) and restarts docker before the tests. Belt and suspenders with the harness fix so brief cleanup lag cannot exhaust the pool.
Retry cap raised from 3 to 5 with a short inter-attempt backoff and a best-effort network prune between attempts. The transient-vs-real classification is untouched: only transient act-exec failures retry; real failures and expect_failure never do.

Verification

go build ./..., go vet ./... (e2e module), and golangci-lint run ./e2e/... all clean.
Retry/cleanup logic covered by unit tests (no docker), including prune-between-attempts and the new attempt count.
Network-leak proof: recorded docker network ls count before and after running the multi-repo scenario plus multi-step scenarios at -parallel 1; the count returns to baseline rather than growing.

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

joshua-temple added 4 commits June 11, 2026 22:14

fix(e2e): remove docker network leak with synchronous verified teardown

cf5a154

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

fix(e2e): raise scenario retry cap to 5 with backoff and network prune

0091d7b

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

ci: widen docker address pool for e2e network headroom

d7c137f

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

fix(e2e): reap act-spawned job containers blocking network removal

869998e

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

joshua-temple merged commit 91610eb into main Jun 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): eliminate docker network address-pool exhaustion#125

fix(e2e): eliminate docker network address-pool exhaustion#125
joshua-temple merged 4 commits into
mainfrom
fix/e2e-network-leak

joshua-temple commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshua-temple commented Jun 12, 2026

Problem

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant