fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend by danbraunai-goodfire · Pull Request #561 · goodfire-ai/param-decomp

danbraunai-goodfire · 2026-06-12T12:46:39Z

Description

Adds 408 to the status_forcelist of the retrying HTTP backend installed by configure_hf_http_retries (and updates the module docstring to match).

Related Issue

Follow-up to #557.

Motivation and Context

Job 703812 (16-rank multi-node pd-lm) died ~2 minutes in, at dataloader setup: a ranged parquet read of a Pile shard got a 408 Request Time-out from the HF CDN, one rank raised HfHubHTTPError, and torchrun tore down the whole job (the TCPStore broken-pipe spam in the log is just the other ranks reacting).

The #557 retry backend was active (Configured huggingface_hub HTTP retries (total=5) in the log) but its status_forcelist only covered 429/5xx, so urllib3 handed the 408 back unretried and hf_raise_for_status raised. datasets' own read_with_retries doesn't retry HfHubHTTPError either, so nothing caught it.

408 is exactly the transient-timeout class that backend exists for, and the retry config already restricts itself to idempotent methods (GET/HEAD/OPTIONS), so retrying a ranged read is safe.

How Has This Been Tested?

basedpyright + ruff pass. Behavior change is a one-element addition to urllib3's Retry forcelist; the retry machinery itself is unchanged from #557 and has been exercised on cluster since.

Does this PR introduce a breaking change?

No.

🤖 Generated with Claude Code

Job 703812 died at dataloader setup when a ranged parquet read got a 408 (Request Time-out) from the HF CDN. The #557 retry backend was active but its status_forcelist only covered 429/5xx, so urllib3 returned the 408 unretried and hf_raise_for_status raised, tearing down all 16 ranks. 408 is the same transient-timeout class that backend exists for, and only idempotent methods are retried, so adding it is safe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561

fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561
danbraunai-goodfire wants to merge 1 commit into
mainfrom
fix/hf-retry-408

danbraunai-goodfire commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danbraunai-goodfire commented Jun 12, 2026

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant