Skip to content

feat(cluster): auto cluster creation on server threshold#6537

Draft
phot0n wants to merge 4 commits into
frappe:developfrom
phot0n:auto-cluster
Draft

feat(cluster): auto cluster creation on server threshold#6537
phot0n wants to merge 4 commits into
frappe:developfrom
phot0n:auto-cluster

Conversation

@phot0n

@phot0n phot0n commented May 28, 2026

Copy link
Copy Markdown
Member

No description provided.

- Queue a background job when a cluster's running VM count reaches
  max_servers (checked on every VM transition to Running state)
- Use auto_cluster_triggered flag (set atomically via db_set + commit)
  to prevent duplicate job queuing when multiple VMs go Running
  simultaneously
- create_auto_cluster provisions a new peer cluster with:
    * 1 Proxy Server
    * N public App + DB Server pairs (N = auto_cluster_app_server_count,
      default 2, configured on the root cluster)
- Naming: root mumbai -> mumbai-1 -> mumbai-2; re-trigger after
  archival -> mumbai-3 (global max, never reuses a number)
- derived_cluster_from field stores the root cluster on all children,
  enabling sibling counting with a single indexed DB query
- auto_cluster_app_server_count field (Int, default 2) shown only on
  root clusters, hidden on derived children
- Validate that new servers cannot be added to a cluster where
  auto_cluster_triggered is set; users are directed to the new cluster

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds automatic cluster provisioning when the number of running VMs in a cluster crosses a configurable threshold. A new ClusterCreation DocType orchestrates the workflow: copying cloud credentials to a derived cluster, optionally syncing VM images, and then standing up a proxy server and N app+DB server pairs in the new cluster.

  • VirtualMachine.on_update now triggers _create_derived_cluster_if_hit_threshold on every status → Running transition, using for_update=True to serialize concurrent calls and db_set to atomically set the guard flag within the same transaction.
  • ClusterCreation.create_cluster_record copies configuration from the source cluster but omits all cloud-provider networking fields (vpc_id, subnet_id, subnet_cidr_block, security_group_id, proxy_security_group_id, nat_security_group_id, route_table_id, etc.), which are required by the VirtualMachine fetch_from mechanism and the cloud-API provisioning calls for every non-Generic provider.

Confidence Score: 3/5

Not safe to merge until the derived cluster is seeded with cloud-provider networking fields — without them, every server provisioning call in the new cluster will fail at the cloud API layer for AWS EC2, Hetzner, DigitalOcean, and OCI.

The core triggering and guard logic (for_update lock, db_set flag, whitelist method) is sound, but create_cluster_record omits the networking fields (vpc_id, subnet_id, subnet_cidr_block, security_group_id, proxy_security_group_id, etc.) that every subsequent VM provisioning step relies on via fetch_from. The derived cluster will be created successfully, yet every create_server call inside it will fail — making the entire auto-provisioning feature non-functional for all major cloud providers in production.

press/press/doctype/cluster_creation/cluster_creation.py — the create_cluster_record task needs the cloud-provider networking fields added to cluster_data

Important Files Changed

Filename Overview
press/press/doctype/cluster_creation/cluster_creation.py New DocType that drives the auto-cluster workflow; missing cloud-provider networking fields in create_cluster_record will cause server provisioning to fail for every non-Generic cloud provider
press/press/doctype/virtual_machine/virtual_machine.py Adds _create_derived_cluster_if_hit_threshold triggered on status→Running; uses for_update=True to serialize concurrent calls and db_set to write the flag atomically within the save transaction
press/press/doctype/cluster/cluster.py Adds get_running_vm_count and create_derived_cluster methods; create_derived_cluster is @frappe.whitelist() without a write-permission guard (flagged in a prior thread)
press/press/doctype/server/server.py Adds threshold guard in validate_cluster that blocks new server creation when auto_cluster_triggered=1; error string is not wrapped in _() for i18n (flagged in prior thread)
press/press/doctype/cluster_creation/cluster_creation.json New DocType definition with standard fields; schema looks correct
press/press/doctype/cluster/cluster.json Adds five new Cluster fields for auto-cluster configuration; field definitions are well-formed

Sequence Diagram

sequenceDiagram
    participant VM as VirtualMachine
    participant CC as Cluster (source)
    participant CCr as ClusterCreation
    participant NC as Cluster (derived)
    participant PS as Proxy Server
    participant SP as App+DB Server Pairs

    VM->>VM: on_update (status → Running)
    VM->>CC: "get_doc(for_update=True)"
    CC-->>VM: cluster doc (row-locked)
    VM->>VM: get_running_vm_count()
    alt "count >= max_servers and not triggered"
        VM->>CC: "db_set(auto_cluster_triggered=1)"
        VM->>CC: create_derived_cluster()
        CC->>CCr: insert ClusterCreation
        CCr->>CCr: after_insert → execute.run_as_workflow()
        CCr->>NC: create_cluster_record() [copies config, NOT networking fields]
        CCr->>NC: copy_images_if_needed()
        CCr->>PS: create_proxy_server()
        CCr->>CCr: wait_for_proxy_server()
        loop for each server pair
            CCr->>SP: create_server_pairs(i)
            CCr->>CCr: wait_for_server_pairs(i)
        end
        alt workflow success
            CCr->>CCr: "on_workflow_success → status=Success"
        else workflow failure
            CCr->>CCr: "on_workflow_failure → status=Failure + Incident"
        end
    end
Loading

Reviews (3): Last reviewed commit: "feat(cluster-creation): wait for server ..." | Re-trigger Greptile

Comment thread press/press/doctype/cluster/cluster.py Outdated
Comment thread press/press/doctype/virtual_machine/virtual_machine.py Outdated
Comment thread press/press/doctype/virtual_machine/virtual_machine.py Outdated
Comment thread press/press/doctype/cluster/cluster.py
Comment thread press/press/doctype/cluster/cluster.py Outdated
Comment thread press/press/doctype/cluster/cluster.py Outdated
Comment thread press/press/doctype/server/server.py
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 11.88811% with 126 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.06%. Comparing base (d27648a) to head (7f4a969).
⚠️ Report is 112 commits behind head on develop.

Files with missing lines Patch % Lines
...press/doctype/cluster_creation/cluster_creation.py 0.00% 114 Missing ⚠️
...s/press/doctype/virtual_machine/virtual_machine.py 50.00% 7 Missing ⚠️
press/press/doctype/cluster/cluster.py 50.00% 4 Missing ⚠️
press/press/doctype/server/server.py 75.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (11.88%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #6537       +/-   ##
============================================
+ Coverage    32.02%   50.06%   +18.04%     
============================================
  Files          919      958       +39     
  Lines        76179    79470     +3291     
  Branches       369      375        +6     
============================================
+ Hits         24393    39789    +15396     
+ Misses       51760    39653    -12107     
- Partials        26       28        +2     
Flag Coverage Δ
dashboard 61.02% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +46 to +47
def on_workflow_failure(self, workflow: "PressWorkflow"):
self.db_set({"status": "Failure", "end": now_datetime()})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Workflow failure leaves auto_cluster_triggered permanently set

on_workflow_failure only updates this document's status; it never resets auto_cluster_triggered to 0 on the source cluster. Once a ClusterCreation job fails for any reason (network error, quota exceeded, image copy timeout), the guard in _create_derived_cluster_if_hit_threshold will always short-circuit with return, making it impossible for the auto-cluster mechanism to fire again for that cluster without manual admin intervention. The handler must also call frappe.db.set_value("Cluster", self.source_cluster, "auto_cluster_triggered", 0) so that the next VM-reaches-threshold event can trigger a fresh attempt.

Comment on lines +155 to +168
for pair_index in range(1, num_pairs + 1):
db_server, _ = cluster.create_server(
"Database Server",
f"Public DB Server {pair_index}",
plan=db_plan,
)
cluster.database_server = db_server.name
cluster.create_server(
"Server",
f"Public App Server {pair_index}",
plan=app_plan,
public=True,
)
frappe.db.commit()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-idempotent task: partial commits cause duplicate server pairs on retry

frappe.db.commit() is called after each pair inside the loop. If the task fails after committing pair 1 but before committing pair 2, the workflow engine will retry the whole create_server_pairs task from the beginning — but there is no record of which pairs were already created. The retry starts pair_index at 1 again and creates a duplicate DB + App server for pair 1, resulting in orphaned infrastructure that was never tracked by the ClusterCreation record.

Either track completed pairs (e.g., store them in self.kv) and skip already-created ones at the start of each iteration, or remove the mid-loop commits and let Frappe commit the whole task atomically.

Comment on lines +1795 to +1804
@frappe.whitelist()
def create_derived_cluster(self):
doc = frappe.get_doc(
{
"doctype": "Cluster Creation",
"source_cluster": self.name,
"status": "Pending",
}
).insert(ignore_permissions=True)
return f"Queued {doc.name}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security @frappe.whitelist() without write-permission check on infrastructure-creating method

Frappe's doc-method whitelist mechanism only verifies that the caller has read permission on the Cluster document. Any authenticated user who can read a Cluster record — including non-admin operators — can POST to frappe.client.run_doc_method and trigger full cluster provisioning. The insert(ignore_permissions=True) inside then bypasses all permission checks on the resulting ClusterCreation document, so even the creation of that record escapes normal access control. Add an explicit check such as frappe.has_permission("Cluster", "write", self.name, throw=True) at the top of the method, or restrict the whitelist to an admin role with @frappe.whitelist(allow_guest=False) combined with a role check.

Comment on lines +76 to +126
cluster_data: dict = {
"doctype": "Cluster",
"name": new_name,
"title": source.title or root_name,
"status": "Active",
"cloud_provider": source.cloud_provider,
"region": source.region,
"availability_zone": source.availability_zone,
"ssh_key": source.ssh_key,
"public": source.public,
"hybrid": source.hybrid,
"beta": source.beta,
"team": source.team,
"country": source.country,
"default_app_server_plan": source.default_app_server_plan,
"default_db_server_plan": source.default_db_server_plan,
"default_app_server_plan_type": source.default_app_server_plan_type,
"default_db_server_plan_type": source.default_db_server_plan_type,
"by_default_select_unified_mode": source.by_default_select_unified_mode,
"has_arm_support": source.has_arm_support,
"has_unified_server_support": source.has_unified_server_support,
"has_add_on_storage_support": source.has_add_on_storage_support,
"enable_autoscaling": source.enable_autoscaling,
"enable_periodic_flush_table": source.enable_periodic_flush_table,
"flush_table_execution_hour": source.flush_table_execution_hour,
"disable_public_ips_for_servers": source.disable_public_ips_for_servers,
"enable_auto_cluster": source.enable_auto_cluster,
"max_servers": source.max_servers,
"derived_cluster_from": root_name,
"auto_cluster_app_server_count": source.auto_cluster_app_server_count or 1,
}

if source.cloud_provider == "AWS EC2":
cluster_data["aws_access_key_id"] = source.aws_access_key_id
cluster_data["aws_secret_access_key"] = source.get_password("aws_secret_access_key")
elif source.cloud_provider == "Hetzner":
cluster_data["hetzner_api_token"] = source.get_password("hetzner_api_token")
elif source.cloud_provider == "OCI":
cluster_data["oci_user"] = source.oci_user
cluster_data["oci_tenancy"] = source.oci_tenancy
cluster_data["oci_public_key"] = source.oci_public_key
cluster_data["oci_private_key"] = source.get_password("oci_private_key")
elif source.cloud_provider == "DigitalOcean":
cluster_data["digital_ocean_api_token"] = source.get_password("digital_ocean_api_token")
elif source.cloud_provider == "Frappe Compute":
cluster_data["frappe_compute_base_url"] = source.frappe_compute_base_url
cluster_data["frappe_compute_api_key"] = source.frappe_compute_api_key
cluster_data["frappe_compute_api_secret"] = source.get_password("frappe_compute_api_secret")

new_cluster = frappe.get_doc(cluster_data).insert(ignore_permissions=True)
self.db_set("new_cluster", new_cluster.name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing cloud-provider networking fields breaks server provisioning in derived cluster

create_cluster_record copies application-level settings from the source cluster but omits every cloud-provider networking field: vpc_id, subnet_id, subnet_cidr_block, cidr_block, security_group_id, proxy_security_group_id, nat_security_group_id, network_acl_id, route_table_id, and monitoring_password. These are not provisioned by any subsequent task in the execute flow.

VirtualMachine declares subnet_id and subnet_cidr_block as fetch_from: "cluster.subnet_id" / fetch_from: "cluster.subnet_cidr_block" fields. On a programmatic insert(), Frappe resolves those fetch-from values from the linked Cluster record at save time. When the cluster fields are empty the VM ends up with empty subnet_id and empty subnet_cidr_block. The private-IP calculation at line 198 of virtual_machine.py then raises TypeError: argument should be a str…, not None, and the AWS run_instances call receives "SubnetId": "", causing an InvalidSubnetID.NotFound error from the API. The same pattern blocks provisioning on Hetzner (vpc_id), DigitalOcean (vpc_id), and OCI (security_group_id).

The derived cluster lives in the same AZ as the source and should reuse its existing VPC/subnet/security-group infrastructure. The fix is to add the missing fields to the cluster_data dict, analogous to how cloud-credential fields are already handled per cloud_provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant