Skip to content

feat(backoff): implement full-jitter in ExponentialBackoff jitter parameter#51

Merged
BlindMaster24 merged 1 commit intomainfrom
devin/1776940627-backoff-jitter
Apr 24, 2026
Merged

feat(backoff): implement full-jitter in ExponentialBackoff jitter parameter#51
BlindMaster24 merged 1 commit intomainfrom
devin/1776940627-backoff-jitter

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

ExponentialBackoff::new(initial, max, factor, _jitter) previously
accepted the jitter parameter but ignored it entirely: the
schedule was hard-coded to full jitter (random_range(0..=cap)) and
callers had no way to trade safety for determinism. The parameter
is now wired end-to-end as an AWS-style jitter knob j in [0, 1]:

cap_n = min(initial * factor^n, max)
delay = cap_n * (1 - j) + rand(0, cap_n * j)

with the extremes:

  • j = 1.0 — full jitter: delay = rand(0, cap_n) (preserves
    pre-change default behaviour; safe thundering-herd avoidance).
  • j = 0.0 — no jitter: delay = cap_n (deterministic; useful
    for tests and tight SLAs where ordering matters).
  • j = 0.5 — equal jitter: delay in [cap_n/2, cap_n].

Values outside [0, 1] (including NaN) are silently clamped via
f32::clamp. A jitter() accessor is added so callers can read
the clamped value.

Hardened cap computation (bug fixes uncovered while wiring jitter)

The previous code computed the cap via
Duration::from_secs_f32(base.as_secs_f32() * factor.powf(n)),
which had two real problems:

  1. Panics on overflow / NaN. With factor = 2.0 and n > ~30
    the float product overflows and Duration::from_secs_f32 panics
    with "cannot convert float seconds to Duration". Tests that ran
    next_delay() many times were susceptible.
  2. Precision drift on small millisecond bases. With
    base = 10 ms and factor = 2.0, the schedule at attempt 2 was
    39 ms instead of 40 ms because f32 round-trips through
    as_secs_f32() lose a millisecond.
  3. initial > max produced delay > max. The property test
    backoff_respects_max with initial_ms = 3, max_ms = 1
    exposed this: the cap was never clamped to max on attempt 0.

The rewrite iterates integer u128 millis with an f64 multiply,
clamps to max_millis on each step (including on the very first
attempt), and exits early on factor <= 1.0, non-finite factor,
and saturation. This removes the panic path, fixes the
10 ms ×2^n = 40 ms precision issue, and enforces
delay <= max for every input.

Tests

  • crates/teamtalk/tests/backoff_jitter_tests.rs adds 9 new
    integration tests covering:
    • jitter == 0 is deterministic and always returns the cap.
    • jitter == 1 spans most of the [0, cap] range over
      1 000 samples (loose bounds to stay non-flaky).
    • jitter == 0.5 stays in [cap/2, cap].
    • Out-of-range jitter (5.0, -0.5, NaN) is clamped.
    • Exact cap progression 10, 20, 40, ... up to max.
    • reset() clears current_delay and attempts.
    • Default is full jitter.
  • Existing utils_tests.rs and utils_property_tests.rs pass
    unchanged.

Local verification

  • cargo fmt --all --check clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings clean
  • cargo test --workspace --all-features — 296 passed / 0 failed
    (previously 287; +9 from backoff_jitter_tests.rs).

Review & Testing Checklist for Human

  • Spot-check the formula and the clamping semantics
    f32::clamp(NaN, 0, 1) -> 0.0 on the current toolchain
    (Rust documents that clamp returns min when the value is
    NaN; the jitter_nan_is_clamped_to_zero test covers this).
  • Confirm that full-jitter being the default preserves the
    pre-change behaviour seen by all existing callers (the old
    code hard-coded rand(0..=cap) regardless of the unused
    parameter; the new default picks jitter = 1.0 in
    Default::default()).
  • Confirm the integer-millis cap loop is acceptable vs the
    previous float-only cap: both are exact on typical
    initial ∈ [1 ms, 10 s], factor ∈ [1.1, 2.0],
    max ∈ [10 ms, 10 min] schedules, and the new code removes
    the panic path on overflow.

Notes

First P1 API improvement after the P0 structural refactor queue
(#44#50). Branches from clean main, independent of the other
PRs in flight, so can be merged in any order.

Next up: StreamTypes newtype (PR #53), SdkErrorCode mapping
for Error::CommandFailed (PR #54), TimeoutKind enum for
Error::Timeout (PR #55), SecretString/zeroize for passwords
(PR #56), indexed EventBus::dispatch / Router::dispatch via
HashMap (PR #57), then a big coverage-gap test-fill PR driven by
scripts/audit_teamtalk_coverage.py.

The pre-existing semver CI gate is expected to remain red;
release-plz handles the eventual version bump.

Link to Devin session: https://app.devin.ai/sessions/71fdd6196cb74723a2e277bb81993a9c
Requested by: @BlindMaster24

…ameter

Previously `ExponentialBackoff::new(initial, max, factor, _jitter)`
accepted the `jitter` parameter but ignored it entirely: the
schedule was hard-coded to full jitter
(`random_range(0..=cap)`) and callers had no way to trade safety
for determinism.

This wires the parameter end-to-end as an AWS-style jitter knob
`j in [0, 1]`:

  cap_n = min(initial * factor^n, max)
  delay = cap_n * (1 - j) + rand(0, cap_n * j)

with the extremes:

* `j = 1.0` — full jitter: `delay = rand(0, cap_n)` (preserves
  pre-change default behaviour; safe thundering-herd avoidance).
* `j = 0.0` — no jitter: `delay = cap_n` (deterministic; useful
  for tests and for tight SLAs where ordering matters).
* `j = 0.5` — equal jitter: `delay in [cap_n/2, cap_n]`.

Values outside `[0, 1]` (including NaN) are silently clamped via
`f32::clamp`. A `jitter()` accessor is added so callers can read
the clamped value.

Also hardens the cap computation:

* The previous code computed `cap` via
  `Duration::from_secs_f32(base.as_secs_f32() * factor.powf(n))`,
  which panics on overflow / NaN and drifts on small millisecond
  bases (e.g. 10ms * 2^2 was observed as 39ms instead of 40ms).
* New code iterates integer `u128` millis with a f64 multiply and
  clamps to `max_millis` on each step. It exits early on
  `factor <= 1.0`, non-finite factor, and saturation. This also
  makes `initial > max` produce `delay <= max` (a real property-
  test failure before this change).

Tests:

* `crates/teamtalk/tests/backoff_jitter_tests.rs` adds 9 new
  integration tests covering jitter == 0 / 0.5 / 1.0 / out-of-range /
  NaN, exact cap progression, reset, and the default being full
  jitter.
* Existing `utils_tests.rs` and `utils_property_tests.rs` pass
  unchanged.

Local verification:

* cargo fmt --all --check clean
* cargo clippy --workspace --all-targets --all-features -- -D warnings clean
* cargo test --workspace --all-features -> 296 passed / 0 failed
  (previously 287; +9 from the new backoff_jitter_tests.rs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant