feat(cluster): auto cluster creation on server threshold#6537
Conversation
- Queue a background job when a cluster's running VM count reaches
max_servers (checked on every VM transition to Running state)
- Use auto_cluster_triggered flag (set atomically via db_set + commit)
to prevent duplicate job queuing when multiple VMs go Running
simultaneously
- create_auto_cluster provisions a new peer cluster with:
* 1 Proxy Server
* N public App + DB Server pairs (N = auto_cluster_app_server_count,
default 2, configured on the root cluster)
- Naming: root mumbai -> mumbai-1 -> mumbai-2; re-trigger after
archival -> mumbai-3 (global max, never reuses a number)
- derived_cluster_from field stores the root cluster on all children,
enabling sibling counting with a single indexed DB query
- auto_cluster_app_server_count field (Int, default 2) shown only on
root clusters, hidden on derived children
- Validate that new servers cannot be added to a cluster where
auto_cluster_triggered is set; users are directed to the new cluster
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
| Filename | Overview |
|---|---|
| press/press/doctype/cluster_creation/cluster_creation.py | New DocType that drives the auto-cluster workflow; missing cloud-provider networking fields in create_cluster_record will cause server provisioning to fail for every non-Generic cloud provider |
| press/press/doctype/virtual_machine/virtual_machine.py | Adds _create_derived_cluster_if_hit_threshold triggered on status→Running; uses for_update=True to serialize concurrent calls and db_set to write the flag atomically within the save transaction |
| press/press/doctype/cluster/cluster.py | Adds get_running_vm_count and create_derived_cluster methods; create_derived_cluster is @frappe.whitelist() without a write-permission guard (flagged in a prior thread) |
| press/press/doctype/server/server.py | Adds threshold guard in validate_cluster that blocks new server creation when auto_cluster_triggered=1; error string is not wrapped in _() for i18n (flagged in prior thread) |
| press/press/doctype/cluster_creation/cluster_creation.json | New DocType definition with standard fields; schema looks correct |
| press/press/doctype/cluster/cluster.json | Adds five new Cluster fields for auto-cluster configuration; field definitions are well-formed |
Sequence Diagram
sequenceDiagram
participant VM as VirtualMachine
participant CC as Cluster (source)
participant CCr as ClusterCreation
participant NC as Cluster (derived)
participant PS as Proxy Server
participant SP as App+DB Server Pairs
VM->>VM: on_update (status → Running)
VM->>CC: "get_doc(for_update=True)"
CC-->>VM: cluster doc (row-locked)
VM->>VM: get_running_vm_count()
alt "count >= max_servers and not triggered"
VM->>CC: "db_set(auto_cluster_triggered=1)"
VM->>CC: create_derived_cluster()
CC->>CCr: insert ClusterCreation
CCr->>CCr: after_insert → execute.run_as_workflow()
CCr->>NC: create_cluster_record() [copies config, NOT networking fields]
CCr->>NC: copy_images_if_needed()
CCr->>PS: create_proxy_server()
CCr->>CCr: wait_for_proxy_server()
loop for each server pair
CCr->>SP: create_server_pairs(i)
CCr->>CCr: wait_for_server_pairs(i)
end
alt workflow success
CCr->>CCr: "on_workflow_success → status=Success"
else workflow failure
CCr->>CCr: "on_workflow_failure → status=Failure + Incident"
end
end
Reviews (3): Last reviewed commit: "feat(cluster-creation): wait for server ..." | Re-trigger Greptile
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (11.88%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #6537 +/- ##
============================================
+ Coverage 32.02% 50.06% +18.04%
============================================
Files 919 958 +39
Lines 76179 79470 +3291
Branches 369 375 +6
============================================
+ Hits 24393 39789 +15396
+ Misses 51760 39653 -12107
- Partials 26 28 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…her than keeping the logic inline
… and take a lock on cluster doc
| def on_workflow_failure(self, workflow: "PressWorkflow"): | ||
| self.db_set({"status": "Failure", "end": now_datetime()}) |
There was a problem hiding this comment.
Workflow failure leaves
auto_cluster_triggered permanently set
on_workflow_failure only updates this document's status; it never resets auto_cluster_triggered to 0 on the source cluster. Once a ClusterCreation job fails for any reason (network error, quota exceeded, image copy timeout), the guard in _create_derived_cluster_if_hit_threshold will always short-circuit with return, making it impossible for the auto-cluster mechanism to fire again for that cluster without manual admin intervention. The handler must also call frappe.db.set_value("Cluster", self.source_cluster, "auto_cluster_triggered", 0) so that the next VM-reaches-threshold event can trigger a fresh attempt.
| for pair_index in range(1, num_pairs + 1): | ||
| db_server, _ = cluster.create_server( | ||
| "Database Server", | ||
| f"Public DB Server {pair_index}", | ||
| plan=db_plan, | ||
| ) | ||
| cluster.database_server = db_server.name | ||
| cluster.create_server( | ||
| "Server", | ||
| f"Public App Server {pair_index}", | ||
| plan=app_plan, | ||
| public=True, | ||
| ) | ||
| frappe.db.commit() |
There was a problem hiding this comment.
Non-idempotent task: partial commits cause duplicate server pairs on retry
frappe.db.commit() is called after each pair inside the loop. If the task fails after committing pair 1 but before committing pair 2, the workflow engine will retry the whole create_server_pairs task from the beginning — but there is no record of which pairs were already created. The retry starts pair_index at 1 again and creates a duplicate DB + App server for pair 1, resulting in orphaned infrastructure that was never tracked by the ClusterCreation record.
Either track completed pairs (e.g., store them in self.kv) and skip already-created ones at the start of each iteration, or remove the mid-loop commits and let Frappe commit the whole task atomically.
| @frappe.whitelist() | ||
| def create_derived_cluster(self): | ||
| doc = frappe.get_doc( | ||
| { | ||
| "doctype": "Cluster Creation", | ||
| "source_cluster": self.name, | ||
| "status": "Pending", | ||
| } | ||
| ).insert(ignore_permissions=True) | ||
| return f"Queued {doc.name}" |
There was a problem hiding this comment.
@frappe.whitelist() without write-permission check on infrastructure-creating method
Frappe's doc-method whitelist mechanism only verifies that the caller has read permission on the Cluster document. Any authenticated user who can read a Cluster record — including non-admin operators — can POST to frappe.client.run_doc_method and trigger full cluster provisioning. The insert(ignore_permissions=True) inside then bypasses all permission checks on the resulting ClusterCreation document, so even the creation of that record escapes normal access control. Add an explicit check such as frappe.has_permission("Cluster", "write", self.name, throw=True) at the top of the method, or restrict the whitelist to an admin role with @frappe.whitelist(allow_guest=False) combined with a role check.
| cluster_data: dict = { | ||
| "doctype": "Cluster", | ||
| "name": new_name, | ||
| "title": source.title or root_name, | ||
| "status": "Active", | ||
| "cloud_provider": source.cloud_provider, | ||
| "region": source.region, | ||
| "availability_zone": source.availability_zone, | ||
| "ssh_key": source.ssh_key, | ||
| "public": source.public, | ||
| "hybrid": source.hybrid, | ||
| "beta": source.beta, | ||
| "team": source.team, | ||
| "country": source.country, | ||
| "default_app_server_plan": source.default_app_server_plan, | ||
| "default_db_server_plan": source.default_db_server_plan, | ||
| "default_app_server_plan_type": source.default_app_server_plan_type, | ||
| "default_db_server_plan_type": source.default_db_server_plan_type, | ||
| "by_default_select_unified_mode": source.by_default_select_unified_mode, | ||
| "has_arm_support": source.has_arm_support, | ||
| "has_unified_server_support": source.has_unified_server_support, | ||
| "has_add_on_storage_support": source.has_add_on_storage_support, | ||
| "enable_autoscaling": source.enable_autoscaling, | ||
| "enable_periodic_flush_table": source.enable_periodic_flush_table, | ||
| "flush_table_execution_hour": source.flush_table_execution_hour, | ||
| "disable_public_ips_for_servers": source.disable_public_ips_for_servers, | ||
| "enable_auto_cluster": source.enable_auto_cluster, | ||
| "max_servers": source.max_servers, | ||
| "derived_cluster_from": root_name, | ||
| "auto_cluster_app_server_count": source.auto_cluster_app_server_count or 1, | ||
| } | ||
|
|
||
| if source.cloud_provider == "AWS EC2": | ||
| cluster_data["aws_access_key_id"] = source.aws_access_key_id | ||
| cluster_data["aws_secret_access_key"] = source.get_password("aws_secret_access_key") | ||
| elif source.cloud_provider == "Hetzner": | ||
| cluster_data["hetzner_api_token"] = source.get_password("hetzner_api_token") | ||
| elif source.cloud_provider == "OCI": | ||
| cluster_data["oci_user"] = source.oci_user | ||
| cluster_data["oci_tenancy"] = source.oci_tenancy | ||
| cluster_data["oci_public_key"] = source.oci_public_key | ||
| cluster_data["oci_private_key"] = source.get_password("oci_private_key") | ||
| elif source.cloud_provider == "DigitalOcean": | ||
| cluster_data["digital_ocean_api_token"] = source.get_password("digital_ocean_api_token") | ||
| elif source.cloud_provider == "Frappe Compute": | ||
| cluster_data["frappe_compute_base_url"] = source.frappe_compute_base_url | ||
| cluster_data["frappe_compute_api_key"] = source.frappe_compute_api_key | ||
| cluster_data["frappe_compute_api_secret"] = source.get_password("frappe_compute_api_secret") | ||
|
|
||
| new_cluster = frappe.get_doc(cluster_data).insert(ignore_permissions=True) | ||
| self.db_set("new_cluster", new_cluster.name) |
There was a problem hiding this comment.
Missing cloud-provider networking fields breaks server provisioning in derived cluster
create_cluster_record copies application-level settings from the source cluster but omits every cloud-provider networking field: vpc_id, subnet_id, subnet_cidr_block, cidr_block, security_group_id, proxy_security_group_id, nat_security_group_id, network_acl_id, route_table_id, and monitoring_password. These are not provisioned by any subsequent task in the execute flow.
VirtualMachine declares subnet_id and subnet_cidr_block as fetch_from: "cluster.subnet_id" / fetch_from: "cluster.subnet_cidr_block" fields. On a programmatic insert(), Frappe resolves those fetch-from values from the linked Cluster record at save time. When the cluster fields are empty the VM ends up with empty subnet_id and empty subnet_cidr_block. The private-IP calculation at line 198 of virtual_machine.py then raises TypeError: argument should be a str…, not None, and the AWS run_instances call receives "SubnetId": "", causing an InvalidSubnetID.NotFound error from the API. The same pattern blocks provisioning on Hetzner (vpc_id), DigitalOcean (vpc_id), and OCI (security_group_id).
The derived cluster lives in the same AZ as the source and should reuse its existing VPC/subnet/security-group infrastructure. The fix is to add the missing fields to the cluster_data dict, analogous to how cloud-credential fields are already handled per cloud_provider.
No description provided.