Skip to content

fix: harden e2e against gitea throttling and container contention#121

Merged
joshua-temple merged 1 commit into
mainfrom
fix/e2e-transient-contention-flake
Jun 11, 2026
Merged

fix: harden e2e against gitea throttling and container contention#121
joshua-temple merged 1 commit into
mainfrom
fix/e2e-transient-contention-flake

Conversation

@joshua-temple

Copy link
Copy Markdown
Collaborator

Problem

The act-heavy e2e scenarios pass individually but fail intermittently when many run concurrently (CI -parallel 2 on a 4-core runner; local full suite at default GOMAXPROCS). Two transient symptoms recur under load:

  • gitea returns 405 Method Not Allowed - {"message":"Please try again later"} (throttling under load), notably on PR merge.
  • act runs error transiently under container/gitea pressure, which also cascades into downstream assertion mismatches when a prior step's effect does not land.

These are resource-contention flakes, not product or scenario defects.

Fix

  1. Serial e2e in CI. build-cli.yaml now runs go test -parallel 1 -timeout 60m, with a 70m job timeout-minutes. e2e.yaml's dispatch default parallelism drops from 2 to 1. Serial execution removes the container contention that is the root cause; the longer timeout covers the slower wall-clock.
  2. Bounded retry on transient gitea responses. The gitea REST calls observed to throttle (merge, create-pr, create-branch, change-files, label create/apply) are wrapped in a small retry (5 attempts, short linear backoff) that retries only on 405 "try again later" and on 5xx. Real 4xx client errors are surfaced immediately, so expect-failure assertions stay deterministic. The retry is safe because gitea returns these throttle responses before applying any state change, so a re-issue cannot double-apply a mutation.

No retry was added at the act-run layer: a bare retry of an act run that may already have mutated gitea state is unsafe, and the act exit code does not reliably distinguish a transient infra failure from a genuine job-failure conclusion (which legitimate expect-failure scenarios produce). The gitea-client retry covers the transient setup-step throttle that is the usual upstream cause, and the serial execution removes the contention itself.

Verification

  • go build ./..., go vet ./... (e2e), and golangci-lint run ./e2e/... all clean.
  • New retry unit tests pass (transient classification plus retry/no-retry/exhaustion/cancellation behaviour over an httptest server).
  • One gitea+act scenario (Hotfix Clean Apply, which exercises the retried merge path) passes locally with the change.

Note: the Promote force scenario fails deterministically on current main in isolation (a separate pre-existing scenario defect, reproduced on a clean checkout with no concurrency), so it is tracked separately and is not the contention flake this PR targets.

Run the act-heavy e2e scenarios serially (-parallel 1) in CI and raise the
go test timeout to 60m. Under concurrent execution the 4-core runner's
gitea + act + job containers contend, throttling gitea (405 'try again
later') and destabilising act runs, which surfaces as intermittent,
product-unrelated failures. Serial execution removes that contention.

Wrap the gitea REST calls that have been observed to throttle (merge,
create-pr, create-branch, change-files, label create/apply) with a bounded
retry on transient 405 'try again later' and 5xx responses. The retry is
safe: gitea returns these before applying any state change, so re-issuing
cannot double-apply a mutation. Real 4xx client errors are surfaced
immediately so expect-failure assertions stay deterministic.

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
@joshua-temple joshua-temple merged commit 1d10818 into main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant