Skip to content

fix(e2e): eliminate docker network address-pool exhaustion#125

Merged
joshua-temple merged 4 commits into
mainfrom
fix/e2e-network-leak
Jun 12, 2026
Merged

fix(e2e): eliminate docker network address-pool exhaustion#125
joshua-temple merged 4 commits into
mainfrom
fix/e2e-network-leak

Conversation

@joshua-temple

Copy link
Copy Markdown
Collaborator

Problem

The e2e suite fails non-deterministically with all predefined address pools have been fully subnetted. Each scenario creates its own docker network and the previous teardown removed it fire-and-forget. Across the serial scenario suite a leaked network drains the daemon's small default address pool until a late scenario cannot allocate one and dies at setup. It is not a timeout and not a defect in the failing scenario.

A secondary symptom: a few scenarios exhausted all of the transient act-exec retries under heavy contention, even though the retry recovered several others on a later attempt.

Fix

  1. Synchronous, verified network teardown in the harness Cleanup(). Removal now waits for and checks the result, with a short bounded retry so a container that is still detaching cannot leave the network behind. The scenario-retry layer already defers Cleanup() per attempt, so every attempt (including a failed one) releases its network. The invariant: running many scenarios does not grow the docker network count.
  2. CI daemon address-pool headroom. A setup step writes a generous default-address-pools (10.99.0.0/16 carved into /24 subnets, 256 networks) and restarts docker before the tests. Belt and suspenders with the harness fix so brief cleanup lag cannot exhaust the pool.
  3. Retry cap raised from 3 to 5 with a short inter-attempt backoff and a best-effort network prune between attempts. The transient-vs-real classification is untouched: only transient act-exec failures retry; real failures and expect_failure never do.

Verification

  • go build ./..., go vet ./... (e2e module), and golangci-lint run ./e2e/... all clean.
  • Retry/cleanup logic covered by unit tests (no docker), including prune-between-attempts and the new attempt count.
  • Network-leak proof: recorded docker network ls count before and after running the multi-repo scenario plus multi-step scenarios at -parallel 1; the count returns to baseline rather than growing.

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
@joshua-temple joshua-temple merged commit 91610eb into main Jun 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant