fix: retry whole scenarios on transient act/docker execution flake by joshua-temple · Pull Request #124 · stablekernel/cascade

joshua-temple · 2026-06-12T00:42:01Z

Problem

Even serialized at -parallel 1 with the gitea-API retry from #121, the act-heavy
multi-step e2e scenarios still occasionally fail with a transient
workflow execution failed on an act run. This is an act/docker exec hiccup, not a
gitea throttle and not a real assertion failure. At roughly one transient per full
suite run across about 30 scenarios, a full run is only ~40-60% likely to be fully
green, which makes three consecutive green Orchestrate runs unlikely.

Retrying a partial, state-mutating act run in place is unsafe, so #121 deliberately
left act-run retry out.

Fix

Each multi-step scenario already runs against a fresh per-scenario gitea repo and
fresh act containers, so re-running an entire scenario from scratch is a clean slate
with no carried-over mutation. That is the safe layer to retry at.

Tag the act/docker exec class precisely. A non-zero act exit sets a new ExecError
flag on the workflow result; the orchestrate, promote, and hotfix failure paths
return an error wrapping a sentinel (errTransientWorkflow) only in that case. A
real job-level failure conclusion, a missing-workflow failure, and every
assertion mismatch stay plain errors.
Wrap each scenario in a bounded retry (up to 3 attempts) that re-runs the whole
scenario from a fresh harness, retrying only on the sentinel. Real assertion
failures, expect_failure mismatches, and genuine job-level failures fail on the
first attempt with no retry, so a flake can never mask a regression. Every retry is
logged so flakes stay visible.

Verification

go build ./..., go vet ./..., and golangci-lint run ./... clean (root and the
e2e module).
New unit tests cover the transient classification and the retry decision loop
(passes first attempt, recovers after a transient, fails immediately on an
assertion error, does not retry a real job failure, exhausts the bound on a
persistent transient). Full e2e short suite green; new tests pass under -race.
Ran the act-heavy scenarios that have flaked (Hotfix_Stacked,
Four_Environment_Cascade_Promotion, and the promote rollback scenario) live under
Docker. One run reproduced the exact transient at the promote baseline step and the
retry recovered it on the second attempt; a second clean run passed all three with
no retry needed.

Tag a non-zero act exit as an execution-layer hiccup (ExecError) distinct from a real job-level failure conclusion, and return a sentinel-wrapped error from the orchestrate/promote/hotfix failure paths so callers can tell a retryable infrastructure flake from a deterministic outcome. Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

Wrap each multi-step scenario in a bounded retry that re-runs the entire scenario from a fresh gitea repo and fresh act containers when, and only when, a step fails with a transient act/docker execution error. Real assertion mismatches, expect_failure mismatches, and genuine job-level failures fail on the first attempt with no retry, so a flake never masks a real regression. Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

joshua-temple added 3 commits June 11, 2026 20:35

fix: classify act job-level failures as non-transient, not exec errors

be1f884

Signed-off-by: Joshua Temple <joshua.temple@stablekernel.com>

joshua-temple merged commit a76b35c into main Jun 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry whole scenarios on transient act/docker execution flake#124

fix: retry whole scenarios on transient act/docker execution flake#124
joshua-temple merged 3 commits into
mainfrom
fix/e2e-scenario-transient-retry

joshua-temple commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshua-temple commented Jun 12, 2026

Problem

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant