fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561
Open
danbraunai-goodfire wants to merge 1 commit into
Open
fix(lm): retry HTTP 408 from the HF CDN in the hub retry backend#561danbraunai-goodfire wants to merge 1 commit into
danbraunai-goodfire wants to merge 1 commit into
Conversation
Job 703812 died at dataloader setup when a ranged parquet read got a 408 (Request Time-out) from the HF CDN. The #557 retry backend was active but its status_forcelist only covered 429/5xx, so urllib3 returned the 408 unretried and hf_raise_for_status raised, tearing down all 16 ranks. 408 is the same transient-timeout class that backend exists for, and only idempotent methods are retried, so adding it is safe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds
408to thestatus_forcelistof the retrying HTTP backend installed byconfigure_hf_http_retries(and updates the module docstring to match).Related Issue
Follow-up to #557.
Motivation and Context
Job 703812 (16-rank multi-node pd-lm) died ~2 minutes in, at dataloader setup: a ranged parquet read of a Pile shard got a 408 Request Time-out from the HF CDN, one rank raised
HfHubHTTPError, and torchrun tore down the whole job (the TCPStore broken-pipe spam in the log is just the other ranks reacting).The #557 retry backend was active (
Configured huggingface_hub HTTP retries (total=5)in the log) but itsstatus_forcelistonly covered 429/5xx, so urllib3 handed the 408 back unretried andhf_raise_for_statusraised.datasets' ownread_with_retriesdoesn't retryHfHubHTTPErroreither, so nothing caught it.408 is exactly the transient-timeout class that backend exists for, and the retry config already restricts itself to idempotent methods (GET/HEAD/OPTIONS), so retrying a ranged read is safe.
How Has This Been Tested?
basedpyright+ruffpass. Behavior change is a one-element addition to urllib3'sRetryforcelist; the retry machinery itself is unchanged from #557 and has been exercised on cluster since.Does this PR introduce a breaking change?
No.
🤖 Generated with Claude Code