Skip to content

feat(bootstrapper): run phase items in parallel when possible#1150

Closed
dhellmann wants to merge 2 commits into
python-wheel-build:mainfrom
dhellmann:bootstrap-parallel-resolve
Closed

feat(bootstrapper): run phase items in parallel when possible#1150
dhellmann wants to merge 2 commits into
python-wheel-build:mainfrom
dhellmann:bootstrap-parallel-resolve

Conversation

@dhellmann
Copy link
Copy Markdown
Member

@dhellmann dhellmann commented May 13, 2026

Summary

Restructure the bootstrap-parallel command's main loop to run independent work items concurrently using a ThreadPoolExecutor.

How it works

The bootstrap loop maintains a LIFO stack of WorkItem objects. Each iteration collects a batch from the top of the stack and submits it to the executor. BootstrapPhase.can_parallelize controls batch size: parallel phases collect all consecutive same-phase items (running them concurrently); serial phases pop only the single top item (allowing other phases to interleave while still processing one at a time). All items — serial or parallel — go through the same executor.

Phase breakdown after this change

Phase Parallelism Reason
RESOLVE parallel resolver is thread-safe; independent per-package
START parallel graph writes are locked
PREPARE_SOURCE parallel downloads/unpacking are independent
PREPARE_BUILD serial installs build system deps; must complete before backend queries
GET_BUILD_DEPS parallel backend queries are independent (each package has its own build env)
INSTALL_BUILD_DEPS serial installs build backend/sdist deps; dep's wheel must exist in mirror first
BUILD parallel compilation is independent per package
UPDATE_BUILD_SEQUENCE serial serialises writes to build-order.json
PROCESS_INSTALL_DEPS parallel hook + dep extraction are independent
COMPLETE parallel cleanup is independent

Thread-safety changes

  • why dependency-chain stack converted to a contextvars.ContextVar; ThreadPoolExecutor copies the caller's context automatically for each task
  • _state_lock guards all check-then-set write sites (_seen_requirements, _build_requirements, _build_stack, failed_packages, _failed_versions, dependency graph, progress bar)
  • _wheel_dir_lock guards wheel directory operations to prevent TOCTOU races between _find_cached_wheel and update_wheel_mirror
  • Progress bar updates via add_done_callback so items tick off as they complete rather than at batch boundaries

Testing

New e2e test test_bootstrap_parallel_multiple_versions bootstraps flit-core 3.0.0--3.12.0 with 5 workers, verifying that all buildable versions are produced and that each parallel phase logged at least one parallel batch.

Closes: #1149

@dhellmann dhellmann requested a review from a team as a code owner May 13, 2026 00:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 553f3aa6-84c5-43fe-a7c2-8f48e5580d63

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements parallel execution for the iterative bootstrap loop. The Bootstrapper now batches same-phase WorkItems and processes them concurrently via ThreadPoolExecutor. A new BootstrapPhase.UPDATE_BUILD_SEQUENCE separates build-order recording from install-dependency processing. Thread-safety is enforced via _state_lock (dependency graph, build-order, progress) and _wheel_dir_lock (wheel discovery/mirror updates). The requirement resolver's session cache is protected with threading.Lock. Command-level integration passes --max-workers through to the bootstrap subcommand, with bootstrap enforcing serial execution. Comprehensive unit tests validate phase parallelization, ordering constraints, lock acquisition, and test-mode behavior. A new e2e test confirms multi-version bootstrap with parallel batching.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: enabling parallel execution of bootstrap phase items where safe.
Linked Issues check ✅ Passed All code changes implement the objective from #1149: determining which bootstrap phases can run concurrently and implementing parallel execution via ThreadPoolExecutor with appropriate synchronization.
Out of Scope Changes check ✅ Passed All changes are scoped to #1149: thread-safety additions, batching/parallel execution logic, new UPDATE_BUILD_SEQUENCE phase, test coverage, and CLI wiring for max_workers.
Description check ✅ Passed The PR description clearly explains the parallel executor restructuring, phase breakdown table, thread-safety changes, and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dhellmann dhellmann marked this pull request as draft May 13, 2026 00:21
@mergify mergify Bot added the ci label May 13, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/fromager/commands/bootstrap.py (1)

193-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add max_workers parameter to bootstrap() function signature and pass it through to Bootstrapper.

bootstrap_parallel() forwards max_workers via ctx.invoke(bootstrap, ..., max_workers=max_workers, ...), but the bootstrap() function does not declare this parameter, so the forwarded value is rejected or ignored. Additionally, the Bootstrapper is hardcoded with max_workers=1 at line 200, preventing any configuration of parallel bootstrap execution. Either add max_workers: int | None to the bootstrap() signature and pass it through to the Bootstrapper constructor instead of hardcoding 1, or remove the forwarded kwarg from bootstrap_parallel().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/fromager/commands/bootstrap.py` around lines 193 - 202, The bootstrap
function is missing a max_workers parameter and always constructs Bootstrapper
with max_workers=1; update the bootstrap(...) function signature to accept
max_workers: int | None (default as needed) and pass that value into the
Bootstrapper constructor instead of the hardcoded 1 so ctx.invoke(bootstrap,
..., max_workers=max_workers, ...) from bootstrap_parallel() is honored; ensure
the parameter name matches exactly (max_workers) and propagate it through any
intermediate calls or default handling in bootstrap to the Bootstrapper
instantiation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e/test_bootstrap_parallel_multiple_versions.sh`:
- Around line 70-110: The current test inspects "starting batch: phase=${phase}
items=N" which is logged before work begins and therefore doesn't prove
concurrency; change the check in the phase loop to detect actual overlapping
execution by parsing per-item start/finish logs with timestamps (or
worker/thread IDs) in bootstrap.log and asserting that for at least one phase an
item start timestamp occurs before another item's finish timestamp (or two
different worker IDs run concurrently) for that phase; update the grep/sed logic
that currently looks for "starting batch: phase=${phase} items=" and instead
search for per-item "start"/"finish" log lines (or worker-id tokens) and
implement an overlap detection routine to set PHASES_WITHOUT_PARALLEL_BATCH when
no overlap is found for a phase (keep using PARALLEL_PHASES and
PHASES_WITHOUT_PARALLEL_BATCH identifiers).

In `@src/fromager/bootstrapper.py`:
- Around line 716-720: The lock _wheel_dir_lock currently protects
update_wheel_mirror calls from _download_prebuilt() but not the identical
server.update_wheel_mirror(self.ctx) invocation inside _build_wheel(), leaving a
TOCTOU race with _find_cached_wheel; fix by acquiring the same _wheel_dir_lock
around the server.update_wheel_mirror(self.ctx) call in _build_wheel() (and any
other places that move files out of wheels_build) so that _download_prebuilt(),
_build_wheel(), and _find_cached_wheel() serialize on the same lock and
eliminate the race window.
- Around line 457-465: The ThreadPoolExecutor block with futures and run_one
allows started workers to perform irreversible writes even if one future fails;
modify run_one to avoid side effects (return intended mutations/data instead of
writing wheels/state), collect all_new_items from futures, then perform a single
commit/apply step that writes to disk only if every future succeeded;
alternatively, if run_one cannot be made pure, detect the phase and fall back to
sequential execution (no ThreadPoolExecutor) so writes are atomic per-item or
wrap mutations behind an explicit commit function (e.g., apply_batch_changes)
invoked only after all futures complete successfully.

In `@tests/test_bootstrapper.py`:
- Around line 614-662: The tests mock _dispatch_phase away so _phase_resolve
never runs (first test) and the second only exercises a single RESOLVE (no
siblings), so neither actually exercises parallel RESOLVE batching; update the
tests to let _phase_resolve run by removing or changing the _dispatch_phase
patch and instead arrange for a RESOLVE phase with multiple items (e.g., patch
_dispatch_phase to return a list containing two RESOLVE work items or call
bootstrap with two top-level Requirement objects) so bt._resolver.resolve is
invoked concurrently twice; for the test_mode failure case, likewise ensure two
sibling RESOLVE items are present so the first resolve raises (via the existing
mock_resolve) and the second succeeds, validating bt.failed_packages records the
failure while the sibling completes.

---

Outside diff comments:
In `@src/fromager/commands/bootstrap.py`:
- Around line 193-202: The bootstrap function is missing a max_workers parameter
and always constructs Bootstrapper with max_workers=1; update the bootstrap(...)
function signature to accept max_workers: int | None (default as needed) and
pass that value into the Bootstrapper constructor instead of the hardcoded 1 so
ctx.invoke(bootstrap, ..., max_workers=max_workers, ...) from
bootstrap_parallel() is honored; ensure the parameter name matches exactly
(max_workers) and propagate it through any intermediate calls or default
handling in bootstrap to the Bootstrapper instantiation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2d93f827-752d-448c-a6c8-055c4f8afa79

📥 Commits

Reviewing files that changed from the base of the PR and between 2132ef3 and 502ee11.

📒 Files selected for processing (7)
  • e2e/ci_bootstrap_parallel_suite.sh
  • e2e/test_bootstrap_parallel_multiple_versions.sh
  • src/fromager/bootstrap_requirement_resolver.py
  • src/fromager/bootstrapper.py
  • src/fromager/commands/bootstrap.py
  • tests/test_bootstrapper.py
  • tests/test_bootstrapper_iterative.py

Comment thread e2e/test_bootstrap_parallel_multiple_versions.sh Outdated
Comment thread src/fromager/bootstrapper.py Outdated
Comment thread src/fromager/bootstrapper.py
Comment thread tests/test_bootstrapper.py
Comment thread src/fromager/bootstrap_requirement_resolver.py Outdated
Comment thread src/fromager/bootstrapper.py Outdated
Comment thread src/fromager/bootstrapper.py
Comment thread src/fromager/bootstrapper.py Outdated
Comment thread src/fromager/bootstrapper.py Outdated
dhellmann added a commit to dhellmann/fromager that referenced this pull request May 13, 2026
Create the executor once per `bootstrap()` call rather than creating
and destroying it on every batch iteration. Serial (non-parallelizable)
phases now run `run_one()` directly in the main thread instead of going
through a single-worker pool, eliminating unnecessary thread-pool
overhead for those phases. Wrap the executor block in `try/finally`
to guarantee the `why` stack is restored even if an exception escapes.

Addresses tiran's review comment on PR python-wheel-build#1150.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
dhellmann added a commit to dhellmann/fromager that referenced this pull request May 13, 2026
Register an `add_done_callback` on each submitted future so the
progress bar updates as individual items complete rather than waiting
for the entire batch. Serial phases update inline after each item.
The callback holds `_state_lock` around progress bar calls for
thread safety.

Addresses tiran's review comment on PR python-wheel-build#1150.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Make all shared state in `Bootstrapper` and `BootstrapRequirementResolver`
thread-safe, then restructure the `bootstrap()` main loop to process work
in uniform batches: collect consecutive same-phase items → run via
`ThreadPoolExecutor` → restore results to stack in original order.

`BootstrapPhase.can_parallelize` controls whether a batch collects
multiple items or just one. All work goes through the executor;
batch size determines effective parallelism.

Phases that install packages are marked as needing to be run in serial
to ensure that they are processed one at a time after all dependencies
are actually built. This allows most work to happen in parallel, at
least somewhat, based on the content of the work item stack.

Pass `ctx.settings.max_jobs` as `max_workers` to `ThreadPoolExecutor`
so the number of parallel threads respects the `--jobs` flag.  `None`
(no flag given) uses the executor's default; a positive int caps
the thread count.

Thread-safety changes:
- Use contextvars.ContextVar for the `why`
  dependency chain stack. ContextVar is safe across both threads and
  asyncio contexts. ThreadPoolExecutor copies the current context for
  each submitted task (Python 3.7+), so each task naturally inherits
  the caller's context without manual snapshot/restore overhead.
- `self._state_lock` added to `Bootstrapper`; held at all check-then-set
  write sites: `_phase_start` (seen dedup), `_add_to_build_order`,
  `_add_to_graph`, `_handle_phase_error` (multiple-versions block),
  `_record_test_mode_failure`, and progressbar calls in
  `_handle_build_requirements`
- Add `_wheel_dir_lock` to `Bootstrapper.__init__` and hold it around
  the operations that can race.
- Extract `UPDATE_BUILD_SEQUENCE` as a separate phase so it can be run
  serially to ensure items are added in the right order.
- The `bootstrap` command always sets the worker pool size to 1, so even
  though tasks run in a thread they are serialized.
- Register an `add_done_callback` on each submitted future so the
  progress bar updates as individual items complete rather than waiting
  for the entire batch. Serial phases update inline after each item.
  The callback holds `_state_lock` around progress bar calls for
  thread safety.
- Add `GET_BUILD_DEPS` as a new parallel phase after the existing `PREPARE_BUILD`
  phase. `PREPARE_BUILD` now only installs build system deps (must be serial to
  avoid races between packages sharing build deps). `GET_BUILD_DEPS` queries
  build backend and sdist deps, which are independent per-package and safe to
  run concurrently.
- Split `INSTALL_BUILD_DEPS` phase out of the `BUILD` phase to
  install build backend/sdist deps into the build environment;
  guaranteed safe because only one package runs this at a time, so
  every dep's wheel is in the mirror before installation.

Closes: python-wheel-build#1149

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
@dhellmann dhellmann force-pushed the bootstrap-parallel-resolve branch from 470f7dd to 45a2420 Compare May 14, 2026 16:19
@dhellmann
Copy link
Copy Markdown
Member Author

After experimenting further, this approach isn't going to work. It's too hard to combine a stack with a topology-aware prioritization system. I keep running into issues because we're marking something seen when we start it, even though it's not built.

I'm going to salvage an idea or two and experiment with other approaches.

@dhellmann dhellmann closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

investigate running some phases of bootstrap in parallel

2 participants