Skip to content

fix(dist): propagate v2 manifest checksum failure#4830

Open
cachebag wants to merge 2 commits intorust-lang:mainfrom
cachebag:main
Open

fix(dist): propagate v2 manifest checksum failure#4830
cachebag wants to merge 2 commits intorust-lang:mainfrom
cachebag:main

Conversation

@cachebag
Copy link
Copy Markdown
Contributor

Our nightly CI in a project I am working on started failing intermittently after we bumped MSRV. a bunch of agents were still on the old toolchain even though rustup update stable was reporting unchanged and exiting 0.

On one agent, rustc --version was a full release behind what our Artifactory mirror was serving. See below:

info: syncing channel updates for stable-x86_64-unknown-linux-gnu
warn: checksum failed for 'https://blahblahblah/rust-dist/dist/channel-rust-stable.toml',
      expected: '7f3c91ad024a1cbe1e8915f35aff0a3fbbdf8d293ad48ab8f31e3b0440c581f9',
      calculated: '821ff14e4c4a1cbe1e8915f35aff0a3fbbdf8d293ad48ab8f31e3b0440c581f9'
info: this might indicate an issue with the third-party release server 'https://blahblahblah/rust-dist/dist'

  stable-x86_64-unknown-linux-gnu unchanged - rustc 1.94.0 (4a4ef493e 2026-03-02)

This PR drops the arm in try_update_from_dist_ so the error propagates.

…g "unchanged"


When the v2 channel manifest's `.toml` did not match its published `.sha256`, `try_update_from_dist_` mapped `RustupError::ChecksumFailed` to `Ok(None)`. Callers interpret `Ok(None)` as "no update available", so `rustup update` printed only a `warn:` line, reported the toolchain as `unchanged`, and exited 0, even though the manifest could not be
integrity-checked.
Copy link
Copy Markdown
Contributor

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. Any chance you can write up a test for this so that we don't regress it?

View changes since this review

@cachebag
Copy link
Copy Markdown
Contributor Author

Reposting for grammar/spelling errors:

Worth noting that this also tightens behaviour against static.rust-lang.org itself, where the v2 swallow was historically there to absorb brief mismatches during releases (cf. #3390).

If you prefer, I think it would make sense to gate this on dl_cfg.tmp_cx.dist_server != DEFAULT_DIST_SERVER.

@cachebag cachebag requested a review from djc April 24, 2026 18:36
@rami3l
Copy link
Copy Markdown
Member

rami3l commented Apr 24, 2026

@cachebag Thanks for your investigation!

As you have investigated already, I think the original behavior is somewhat intentional in that if the manifest update failed with checksum mismatch, it usually indicates an inconsistent state within the release server itself and in most of the cases it should be considered that the new Rust release is still in progress in the official release server.|

I think it would make sense to gate this on dl_cfg.tmp_cx.dist_server != DEFAULT_DIST_SERVER.

... and no, I don't want to make official release server that different in this sense; if the release server were to cause rustup to fail, the type of release server shouldn't have mattered.

Swallowing the error in that routine completely is absolutely not ideal, but this does looks like the right thing to do so maybe we can try it first and see how it goes... In the worst of the cases we can downgrade that to a warning 🤔

@ChrisDenton
Copy link
Copy Markdown
Member

Could we talk to infra about if such inconsistent states still happen and if they could be mitigated on their side somehow?

@rami3l
Copy link
Copy Markdown
Member

rami3l commented Apr 24, 2026

Could we talk to infra about if such inconsistent states still happen and if they could be mitigated on their side somehow?

@ChrisDenton I think we can try (cc @marcoieni) but since the error has been swallowed since a long time ago I doubt we can get any real feedback since.

@djc
Copy link
Copy Markdown
Contributor

djc commented Apr 24, 2026

Could/should we retry it a few times and really fail if it still fails?

@rami3l
Copy link
Copy Markdown
Member

rami3l commented Apr 24, 2026

@djc I think we could add retries but if this is mostly not a client-side problem then it should not help very much...

Thinking about it, if each time we get the same "wrong" checksum then we do get more sure that this is a server-side problem, so retrying does have some benefits after all.

@cachebag
Copy link
Copy Markdown
Contributor Author

cachebag commented Apr 24, 2026

@rami3l
We already have retry logic in download_file_with_resume for network failures; I could wrap the download_and_check call in a small retry loop?

Also, I just want to mention that retries wouldn't have helped my case. The mirror was consistently serving stale content, so all retries would fail identically.

@rami3l
Copy link
Copy Markdown
Member

rami3l commented Apr 24, 2026

@cachebag I suggest that we wait for a bit and see how the infra people relies.

IMHO as you said this is most likely a server-side problem so it is hard to say whether issuing multiple retries will actually turns a failure into a success, what I added was just that the user could be more sure of the server-side problem if they happen to get the same shasum every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants