Skip to content

fix: retry whole scenarios on transient act/docker execution flake#124

Merged
joshua-temple merged 3 commits into
mainfrom
fix/e2e-scenario-transient-retry
Jun 12, 2026
Merged

fix: retry whole scenarios on transient act/docker execution flake#124
joshua-temple merged 3 commits into
mainfrom
fix/e2e-scenario-transient-retry

Conversation

@joshua-temple

Copy link
Copy Markdown
Collaborator

Problem

Even serialized at -parallel 1 with the gitea-API retry from #121, the act-heavy
multi-step e2e scenarios still occasionally fail with a transient
workflow execution failed on an act run. This is an act/docker exec hiccup, not a
gitea throttle and not a real assertion failure. At roughly one transient per full
suite run across about 30 scenarios, a full run is only ~40-60% likely to be fully
green, which makes three consecutive green Orchestrate runs unlikely.

Retrying a partial, state-mutating act run in place is unsafe, so #121 deliberately
left act-run retry out.

Fix

Each multi-step scenario already runs against a fresh per-scenario gitea repo and
fresh act containers, so re-running an entire scenario from scratch is a clean slate
with no carried-over mutation. That is the safe layer to retry at.

  • Tag the act/docker exec class precisely. A non-zero act exit sets a new ExecError
    flag on the workflow result; the orchestrate, promote, and hotfix failure paths
    return an error wrapping a sentinel (errTransientWorkflow) only in that case. A
    real job-level failure conclusion, a missing-workflow failure, and every
    assertion mismatch stay plain errors.
  • Wrap each scenario in a bounded retry (up to 3 attempts) that re-runs the whole
    scenario from a fresh harness, retrying only on the sentinel. Real assertion
    failures, expect_failure mismatches, and genuine job-level failures fail on the
    first attempt with no retry, so a flake can never mask a regression. Every retry is
    logged so flakes stay visible.

Verification

  • go build ./..., go vet ./..., and golangci-lint run ./... clean (root and the
    e2e module).
  • New unit tests cover the transient classification and the retry decision loop
    (passes first attempt, recovers after a transient, fails immediately on an
    assertion error, does not retry a real job failure, exhausts the bound on a
    persistent transient). Full e2e short suite green; new tests pass under -race.
  • Ran the act-heavy scenarios that have flaked (Hotfix_Stacked,
    Four_Environment_Cascade_Promotion, and the promote rollback scenario) live under
    Docker. One run reproduced the exact transient at the promote baseline step and the
    retry recovered it on the second attempt; a second clean run passed all three with
    no retry needed.

Tag a non-zero act exit as an execution-layer hiccup (ExecError) distinct from a real job-level failure conclusion, and return a sentinel-wrapped error from the orchestrate/promote/hotfix failure paths so callers can tell a retryable infrastructure flake from a deterministic outcome.

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
Wrap each multi-step scenario in a bounded retry that re-runs the entire scenario from a fresh gitea repo and fresh act containers when, and only when, a step fails with a transient act/docker execution error. Real assertion mismatches, expect_failure mismatches, and genuine job-level failures fail on the first attempt with no retry, so a flake never masks a real regression.

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>
@joshua-temple joshua-temple merged commit a76b35c into main Jun 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant