Skip to content

Implement instance pipeline#3636

Open
r4victor wants to merge 49 commits intomasterfrom
issue_3551_instance_pipeline
Open

Implement instance pipeline#3636
r4victor wants to merge 49 commits intomasterfrom
issue_3551_instance_pipeline

Conversation

@r4victor
Copy link
Collaborator

@r4victor r4victor commented Mar 5, 2026

Part of #3551

This PR:

  • Refactors and fixes scheduled_tasks/instances.py for easier migration to pipelines.
  • Implements InstancePipeline.
  • Updates FleetPipeline to respect instance locks.
  • Updates API and provisioning logic to respect instance locks.
  • Introduces FleetModel.current_master_instance_id column for cloud clusters. Implements current master instance selection logic in FleetPipeline. Updates instance provisioning logic to use FleetModel.current_master_instance_id. (This fixes a race condition with updating siblings in case of master failure and allows re-selecting master.)

Benchmarks (SQLite, one server replica):

  • Provisioning 50 AWS instances (from submission to idle) goes from ~3-4 min to ~2m, which is now about the same as provisioning one instance.
  • Provisioning 200 AWS instances (from submission to idle) goes from ~11 min to ~4m, which is now mostly limited by AWS RunInstances rate limits (5 rps).
  • Other instance processing was previously capped at 75 instances/min. It's now workers_num * processing_rate, e.g. typically 600-1200 instances/min for check instances depending on check network latency. With many resources, all processing including termination becomes ~10x faster.
  • Pipeline throughput can further be improved by tweaking pipeline parameters, e.g. workers_num.

Notes:

  • No lock_timeout errors occur on SQLite even with very high write load due to pipelines using quick sessions only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant