feat(cluster): auto cluster creation on server threshold by phot0n · Pull Request #6537 · frappe/press

phot0n · 2026-05-28T04:59:32Z

No description provided.

- Queue a background job when a cluster's running VM count reaches max_servers (checked on every VM transition to Running state) - Use auto_cluster_triggered flag (set atomically via db_set + commit) to prevent duplicate job queuing when multiple VMs go Running simultaneously - create_auto_cluster provisions a new peer cluster with: * 1 Proxy Server * N public App + DB Server pairs (N = auto_cluster_app_server_count, default 2, configured on the root cluster) - Naming: root mumbai -> mumbai-1 -> mumbai-2; re-trigger after archival -> mumbai-3 (global max, never reuses a number) - derived_cluster_from field stores the root cluster on all children, enabling sibling counting with a single indexed DB query - auto_cluster_app_server_count field (Int, default 2) shown only on root clusters, hidden on derived children - Validate that new servers cannot be added to a cluster where auto_cluster_triggered is set; users are directed to the new cluster Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-05-28T05:04:24Z

Greptile Summary

This PR adds automatic cluster provisioning when the number of running VMs in a cluster crosses a configurable threshold. A new ClusterCreation DocType orchestrates the workflow: copying cloud credentials to a derived cluster, optionally syncing VM images, and then standing up a proxy server and N app+DB server pairs in the new cluster.

VirtualMachine.on_update now triggers _create_derived_cluster_if_hit_threshold on every status → Running transition, using for_update=True to serialize concurrent calls and db_set to atomically set the guard flag within the same transaction.
ClusterCreation.create_cluster_record copies configuration from the source cluster but omits all cloud-provider networking fields (vpc_id, subnet_id, subnet_cidr_block, security_group_id, proxy_security_group_id, nat_security_group_id, route_table_id, etc.), which are required by the VirtualMachine fetch_from mechanism and the cloud-API provisioning calls for every non-Generic provider.

Confidence Score: 3/5

Not safe to merge until the derived cluster is seeded with cloud-provider networking fields — without them, every server provisioning call in the new cluster will fail at the cloud API layer for AWS EC2, Hetzner, DigitalOcean, and OCI.

The core triggering and guard logic (for_update lock, db_set flag, whitelist method) is sound, but create_cluster_record omits the networking fields (vpc_id, subnet_id, subnet_cidr_block, security_group_id, proxy_security_group_id, etc.) that every subsequent VM provisioning step relies on via fetch_from. The derived cluster will be created successfully, yet every create_server call inside it will fail — making the entire auto-provisioning feature non-functional for all major cloud providers in production.

press/press/doctype/cluster_creation/cluster_creation.py — the create_cluster_record task needs the cloud-provider networking fields added to cluster_data

Important Files Changed

Filename	Overview
press/press/doctype/cluster_creation/cluster_creation.py	New DocType that drives the auto-cluster workflow; missing cloud-provider networking fields in create_cluster_record will cause server provisioning to fail for every non-Generic cloud provider
press/press/doctype/virtual_machine/virtual_machine.py	Adds _create_derived_cluster_if_hit_threshold triggered on status→Running; uses for_update=True to serialize concurrent calls and db_set to write the flag atomically within the save transaction
press/press/doctype/cluster/cluster.py	Adds get_running_vm_count and create_derived_cluster methods; create_derived_cluster is @frappe.whitelist() without a write-permission guard (flagged in a prior thread)
press/press/doctype/server/server.py	Adds threshold guard in validate_cluster that blocks new server creation when auto_cluster_triggered=1; error string is not wrapped in _() for i18n (flagged in prior thread)
press/press/doctype/cluster_creation/cluster_creation.json	New DocType definition with standard fields; schema looks correct
press/press/doctype/cluster/cluster.json	Adds five new Cluster fields for auto-cluster configuration; field definitions are well-formed

Sequence Diagram

sequenceDiagram
    participant VM as VirtualMachine
    participant CC as Cluster (source)
    participant CCr as ClusterCreation
    participant NC as Cluster (derived)
    participant PS as Proxy Server
    participant SP as App+DB Server Pairs

    VM->>VM: on_update (status → Running)
    VM->>CC: "get_doc(for_update=True)"
    CC-->>VM: cluster doc (row-locked)
    VM->>VM: get_running_vm_count()
    alt "count >= max_servers and not triggered"
        VM->>CC: "db_set(auto_cluster_triggered=1)"
        VM->>CC: create_derived_cluster()
        CC->>CCr: insert ClusterCreation
        CCr->>CCr: after_insert → execute.run_as_workflow()
        CCr->>NC: create_cluster_record() [copies config, NOT networking fields]
        CCr->>NC: copy_images_if_needed()
        CCr->>PS: create_proxy_server()
        CCr->>CCr: wait_for_proxy_server()
        loop for each server pair
            CCr->>SP: create_server_pairs(i)
            CCr->>CCr: wait_for_server_pairs(i)
        end
        alt workflow success
            CCr->>CCr: "on_workflow_success → status=Success"
        else workflow failure
            CCr->>CCr: "on_workflow_failure → status=Failure + Incident"
        end
    end

_{Reviews (3): Last reviewed commit: "feat(cluster-creation): wait for server ..." | Re-trigger Greptile}

codecov · 2026-05-28T05:06:12Z

Codecov Report

❌ Patch coverage is 11.88811% with 126 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.06%. Comparing base (d27648a) to head (7f4a969).
⚠️ Report is 112 commits behind head on develop.

Files with missing lines	Patch %	Lines
...press/doctype/cluster_creation/cluster_creation.py	0.00%	114 Missing ⚠️
...s/press/doctype/virtual_machine/virtual_machine.py	50.00%	7 Missing ⚠️
press/press/doctype/cluster/cluster.py	50.00%	4 Missing ⚠️
press/press/doctype/server/server.py	75.00%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (11.88%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #6537       +/-   ##
============================================
+ Coverage    32.02%   50.06%   +18.04%     
============================================
  Files          919      958       +39     
  Lines        76179    79470     +3291     
  Branches       369      375        +6     
============================================
+ Hits         24393    39789    +15396     
+ Misses       51760    39653    -12107     
- Partials        26       28        +2

Flag	Coverage Δ
dashboard	`61.02% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…her than keeping the logic inline

… and take a lock on cluster doc

greptile-apps · 2026-05-29T08:09:39Z

+	def on_workflow_failure(self, workflow: "PressWorkflow"):
+		self.db_set({"status": "Failure", "end": now_datetime()})


Workflow failure leaves auto_cluster_triggered permanently set

on_workflow_failure only updates this document's status; it never resets auto_cluster_triggered to 0 on the source cluster. Once a ClusterCreation job fails for any reason (network error, quota exceeded, image copy timeout), the guard in _create_derived_cluster_if_hit_threshold will always short-circuit with return, making it impossible for the auto-cluster mechanism to fire again for that cluster without manual admin intervention. The handler must also call frappe.db.set_value("Cluster", self.source_cluster, "auto_cluster_triggered", 0) so that the next VM-reaches-threshold event can trigger a fresh attempt.

greptile-apps · 2026-05-29T08:09:39Z

+		for pair_index in range(1, num_pairs + 1):
+			db_server, _ = cluster.create_server(
+				"Database Server",
+				f"Public DB Server {pair_index}",
+				plan=db_plan,
+			)
+			cluster.database_server = db_server.name
+			cluster.create_server(
+				"Server",
+				f"Public App Server {pair_index}",
+				plan=app_plan,
+				public=True,
+			)
+			frappe.db.commit()


Non-idempotent task: partial commits cause duplicate server pairs on retry

frappe.db.commit() is called after each pair inside the loop. If the task fails after committing pair 1 but before committing pair 2, the workflow engine will retry the whole create_server_pairs task from the beginning — but there is no record of which pairs were already created. The retry starts pair_index at 1 again and creates a duplicate DB + App server for pair 1, resulting in orphaned infrastructure that was never tracked by the ClusterCreation record.

Either track completed pairs (e.g., store them in self.kv) and skip already-created ones at the start of each iteration, or remove the mid-loop commits and let Frappe commit the whole task atomically.

greptile-apps · 2026-05-29T08:09:41Z

+	@frappe.whitelist()
+	def create_derived_cluster(self):
+		doc = frappe.get_doc(
+			{
+				"doctype": "Cluster Creation",
+				"source_cluster": self.name,
+				"status": "Pending",
+			}
+		).insert(ignore_permissions=True)
+		return f"Queued {doc.name}"


@frappe.whitelist() without write-permission check on infrastructure-creating method

Frappe's doc-method whitelist mechanism only verifies that the caller has read permission on the Cluster document. Any authenticated user who can read a Cluster record — including non-admin operators — can POST to frappe.client.run_doc_method and trigger full cluster provisioning. The insert(ignore_permissions=True) inside then bypasses all permission checks on the resulting ClusterCreation document, so even the creation of that record escapes normal access control. Add an explicit check such as frappe.has_permission("Cluster", "write", self.name, throw=True) at the top of the method, or restrict the whitelist to an admin role with @frappe.whitelist(allow_guest=False) combined with a role check.

…on faliure

greptile-apps · 2026-06-01T07:28:05Z

+		cluster_data: dict = {
+			"doctype": "Cluster",
+			"name": new_name,
+			"title": source.title or root_name,
+			"status": "Active",
+			"cloud_provider": source.cloud_provider,
+			"region": source.region,
+			"availability_zone": source.availability_zone,
+			"ssh_key": source.ssh_key,
+			"public": source.public,
+			"hybrid": source.hybrid,
+			"beta": source.beta,
+			"team": source.team,
+			"country": source.country,
+			"default_app_server_plan": source.default_app_server_plan,
+			"default_db_server_plan": source.default_db_server_plan,
+			"default_app_server_plan_type": source.default_app_server_plan_type,
+			"default_db_server_plan_type": source.default_db_server_plan_type,
+			"by_default_select_unified_mode": source.by_default_select_unified_mode,
+			"has_arm_support": source.has_arm_support,
+			"has_unified_server_support": source.has_unified_server_support,
+			"has_add_on_storage_support": source.has_add_on_storage_support,
+			"enable_autoscaling": source.enable_autoscaling,
+			"enable_periodic_flush_table": source.enable_periodic_flush_table,
+			"flush_table_execution_hour": source.flush_table_execution_hour,
+			"disable_public_ips_for_servers": source.disable_public_ips_for_servers,
+			"enable_auto_cluster": source.enable_auto_cluster,
+			"max_servers": source.max_servers,
+			"derived_cluster_from": root_name,
+			"auto_cluster_app_server_count": source.auto_cluster_app_server_count or 1,
+		}
+
+		if source.cloud_provider == "AWS EC2":
+			cluster_data["aws_access_key_id"] = source.aws_access_key_id
+			cluster_data["aws_secret_access_key"] = source.get_password("aws_secret_access_key")
+		elif source.cloud_provider == "Hetzner":
+			cluster_data["hetzner_api_token"] = source.get_password("hetzner_api_token")
+		elif source.cloud_provider == "OCI":
+			cluster_data["oci_user"] = source.oci_user
+			cluster_data["oci_tenancy"] = source.oci_tenancy
+			cluster_data["oci_public_key"] = source.oci_public_key
+			cluster_data["oci_private_key"] = source.get_password("oci_private_key")
+		elif source.cloud_provider == "DigitalOcean":
+			cluster_data["digital_ocean_api_token"] = source.get_password("digital_ocean_api_token")
+		elif source.cloud_provider == "Frappe Compute":
+			cluster_data["frappe_compute_base_url"] = source.frappe_compute_base_url
+			cluster_data["frappe_compute_api_key"] = source.frappe_compute_api_key
+			cluster_data["frappe_compute_api_secret"] = source.get_password("frappe_compute_api_secret")
+
+		new_cluster = frappe.get_doc(cluster_data).insert(ignore_permissions=True)
+		self.db_set("new_cluster", new_cluster.name)


Missing cloud-provider networking fields breaks server provisioning in derived cluster

create_cluster_record copies application-level settings from the source cluster but omits every cloud-provider networking field: vpc_id, subnet_id, subnet_cidr_block, cidr_block, security_group_id, proxy_security_group_id, nat_security_group_id, network_acl_id, route_table_id, and monitoring_password. These are not provisioned by any subsequent task in the execute flow.

VirtualMachine declares subnet_id and subnet_cidr_block as fetch_from: "cluster.subnet_id" / fetch_from: "cluster.subnet_cidr_block" fields. On a programmatic insert(), Frappe resolves those fetch-from values from the linked Cluster record at save time. When the cluster fields are empty the VM ends up with empty subnet_id and empty subnet_cidr_block. The private-IP calculation at line 198 of virtual_machine.py then raises TypeError: argument should be a str…, not None, and the AWS run_instances call receives "SubnetId": "", causing an InvalidSubnetID.NotFound error from the API. The same pattern blocks provisioning on Hetzner (vpc_id), DigitalOcean (vpc_id), and OCI (security_group_id).

The derived cluster lives in the same AZ as the source and should reuse its existing VPC/subnet/security-group infrastructure. The fix is to add the missing fields to the cluster_data dict, analogous to how cloud-credential fields are already handled per cloud_provider.

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

phot0n added 2 commits May 29, 2026 13:33

refactor(cluster): create a auto derived cluster creation doctype rat…

7be33aa

…her than keeping the logic inline

refactor(virtual-machine): move derived cluster creation to on_update…

269ce6b

… and take a lock on cluster doc

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

feat(cluster-creation): wait for server creation and create incident …

7f4a969

…on faliure

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): auto cluster creation on server threshold#6537

feat(cluster): auto cluster creation on server threshold#6537
phot0n wants to merge 4 commits into
frappe:developfrom
phot0n:auto-cluster

phot0n commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Greptile Summary

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

greptile-apps Bot May 29, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def on_workflow_failure(self, workflow: "PressWorkflow"):
		self.db_set({"status": "Failure", "end": now_datetime()})

Conversation

phot0n commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phot0n commented May 28, 2026 •

edited

Loading

greptile-apps Bot commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading