Skip to content

fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561

Open
danbraunai-goodfire wants to merge 1 commit into
mainfrom
fix/hf-retry-408
Open

fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561
danbraunai-goodfire wants to merge 1 commit into
mainfrom
fix/hf-retry-408

Conversation

@danbraunai-goodfire

Copy link
Copy Markdown
Collaborator

Description

Adds 408 to the status_forcelist of the retrying HTTP backend installed by configure_hf_http_retries (and updates the module docstring to match).

Related Issue

Follow-up to #557.

Motivation and Context

Job 703812 (16-rank multi-node pd-lm) died ~2 minutes in, at dataloader setup: a ranged parquet read of a Pile shard got a 408 Request Time-out from the HF CDN, one rank raised HfHubHTTPError, and torchrun tore down the whole job (the TCPStore broken-pipe spam in the log is just the other ranks reacting).

The #557 retry backend was active (Configured huggingface_hub HTTP retries (total=5) in the log) but its status_forcelist only covered 429/5xx, so urllib3 handed the 408 back unretried and hf_raise_for_status raised. datasets' own read_with_retries doesn't retry HfHubHTTPError either, so nothing caught it.

408 is exactly the transient-timeout class that backend exists for, and the retry config already restricts itself to idempotent methods (GET/HEAD/OPTIONS), so retrying a ranged read is safe.

How Has This Been Tested?

basedpyright + ruff pass. Behavior change is a one-element addition to urllib3's Retry forcelist; the retry machinery itself is unchanged from #557 and has been exercised on cluster since.

Does this PR introduce a breaking change?

No.

🤖 Generated with Claude Code

Job 703812 died at dataloader setup when a ranged parquet read got a 408
(Request Time-out) from the HF CDN. The #557 retry backend was active but
its status_forcelist only covered 429/5xx, so urllib3 returned the 408
unretried and hf_raise_for_status raised, tearing down all 16 ranks.
408 is the same transient-timeout class that backend exists for, and only
idempotent methods are retried, so adding it is safe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant