openhcl_boot: handle multi numa private pool fallback#3312
openhcl_boot: handle multi numa private pool fallback#3312chris-oo wants to merge 7 commits intomicrosoft:mainfrom
Conversation
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Pull request overview
This PR updates OpenHCL boot-time private pool allocation to support multi-NUMA systems by falling back to allocating the VTL2 GPA pool across NUMA nodes when a single-node allocation (node 0) can’t satisfy the request. It also introduces a command-line option to force NUMA-split behavior and adds VMM tests to exercise the multi-range pool behavior across boot and servicing.
Changes:
- Add multi-NUMA private pool allocation logic with a “try node 0 first, else split” strategy.
- Extend persisted-state handling to accept multiple VTL2 GPA pool ranges (instead of a single range).
- Add new tests covering multi-NUMA pool split boot + servicing scenarios and add a new cmdline knob to force split mode.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
openhcl/openhcl_boot/src/host_params/dt/mod.rs |
Implements the multi-NUMA fallback allocation behavior for the VTL2 private pool and updates persisted-state pool-range handling. |
openhcl/openhcl_boot/src/memory.rs |
Updates the address space builder to accept multiple existing pool ranges and adds unit tests around multi-range pool reservation/allocation behavior. |
openhcl/openhcl_boot/src/cmdline.rs |
Adds parsing for OPENHCL_VTL2_GPA_POOL_NUMA=split to force NUMA-split pool allocation. |
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_uefi.rs |
Adds a boot test that forces NUMA split and validates the VM reaches VTL2-ready and supports inspect. |
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs |
Adds a servicing test that forces NUMA split and validates servicing works with split pool configuration. |
| let num_nodes = numa_nodes.len() as u64; | ||
| let per_node_size = (pool_size_bytes / num_nodes) & !(HV_PAGE_SIZE - 1); | ||
| // Give the remainder (due to rounding) to the last node. | ||
| let last_node_size = pool_size_bytes - per_node_size * (num_nodes - 1); | ||
| let mut remaining = pool_size_bytes; | ||
|
|
||
| for (i, vnode) in numa_nodes.iter().enumerate() { | ||
| if remaining == 0 { | ||
| break; | ||
| } | ||
|
|
||
| let is_last = i == numa_nodes.len() - 1; |
There was a problem hiding this comment.
allocate_private_pool can hit edge cases where numa_nodes is empty (division by zero in pool_size_bytes / num_nodes) or where per_node_size becomes 0 after page-aligning (e.g., pool smaller than num_nodes * HV_PAGE_SIZE). In the latter case, intermediate nodes get alloc_size == 0, and AddressSpaceManager::allocate() returns None for len == 0, causing an unnecessary panic. Consider guarding num_nodes > 0 and distributing in pages (e.g., cap node count to min(nodes, pool_pages) and ensure every attempted allocation is at least one page), or skipping zero-sized allocations.
| let num_nodes = numa_nodes.len() as u64; | |
| let per_node_size = (pool_size_bytes / num_nodes) & !(HV_PAGE_SIZE - 1); | |
| // Give the remainder (due to rounding) to the last node. | |
| let last_node_size = pool_size_bytes - per_node_size * (num_nodes - 1); | |
| let mut remaining = pool_size_bytes; | |
| for (i, vnode) in numa_nodes.iter().enumerate() { | |
| if remaining == 0 { | |
| break; | |
| } | |
| let is_last = i == numa_nodes.len() - 1; | |
| if numa_nodes.is_empty() { | |
| panic!( | |
| "failed to allocate VTL2 pool: no NUMA nodes found in VTL2 RAM \ | |
| (pool total {pool_size_bytes:#x} bytes, \ | |
| enable_vtl2_gpa_pool={enable_vtl2_gpa_pool:?}, \ | |
| device_dma_page_count={device_dma_page_count:#x?}, \ | |
| vp_count={vp_count}, mem_size={mem_size:#x})" | |
| ); | |
| } | |
| // Split only across enough nodes to keep each non-final allocation at | |
| // least one page. For pools smaller than a page, use a single node. | |
| let pool_pages = pool_size_bytes / HV_PAGE_SIZE; | |
| let split_nodes = if pool_pages == 0 { | |
| 1usize | |
| } else { | |
| core::cmp::min(numa_nodes.len(), pool_pages as usize) | |
| }; | |
| let num_nodes = split_nodes as u64; | |
| let per_node_size = if split_nodes == 1 { | |
| 0 | |
| } else { | |
| (pool_size_bytes / num_nodes) & !(HV_PAGE_SIZE - 1) | |
| }; | |
| // Give the remainder (due to rounding) to the last node. | |
| let last_node_size = pool_size_bytes - per_node_size * (num_nodes - 1); | |
| let mut remaining = pool_size_bytes; | |
| for (i, vnode) in numa_nodes.iter().take(split_nodes).enumerate() { | |
| if remaining == 0 { | |
| break; | |
| } | |
| let is_last = i == split_nodes - 1; |
| } else if arg.starts_with(VTL2_GPA_POOL_NUMA) { | ||
| if let Some((_, arg)) = arg.split_once('=') { | ||
| match arg { | ||
| "split" => self.vtl2_gpa_pool_numa_split = true, | ||
| _ => log::warn!("Unknown value for OPENHCL_VTL2_GPA_POOL_NUMA: {arg}"), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
New command-line parsing for OPENHCL_VTL2_GPA_POOL_NUMA=split isn’t covered by the existing unit tests in this module. Adding a small test case that asserts vtl2_gpa_pool_numa_split toggles to true for split (and stays false for unknown values) would help prevent regressions in this parsing behavior.
|
This is blocked waiting on openvmm to at least report different numa nodes for memory so we can test this. looking at that now. |
989ce01 to
34b59f2
Compare
34b59f2 to
89a29d5
Compare
| } else { | ||
| per_node_size | ||
| }; | ||
|
|
There was a problem hiding this comment.
allocate_private_pool can compute per_node_size == 0 when pool_size_bytes < num_nodes * HV_PAGE_SIZE (or when num_nodes is large). In that case the non-last iterations pass alloc_size = 0 to address_space.allocate(), which returns None for zero-length allocations and triggers the panic path even though the total pool could still be allocated (e.g., entirely on the last node). Consider guarding against this by ensuring each attempted allocation is at least one page (or skipping zero-sized per-node allocations / falling back to a single-node allocation when the pool is too small to split).
| if alloc_size == 0 { | |
| continue; | |
| } |
| // Choose the memory layout of the VM. | ||
| let mem_layout = MemoryLayout::new( | ||
| cfg.memory.mem_size, | ||
| &cfg.memory.mmio_gaps, | ||
| &cfg.memory.pci_ecam_gaps, | ||
| &cfg.memory.pci_mmio_gaps, | ||
| vtl2_range, | ||
| ) | ||
| // | ||
| // When numa_mem_sizes is set, distribute guest RAM across vNUMA nodes | ||
| // for ACPI SRAT / FDT reporting. | ||
| // | ||
| // TODO: The vNUMA nodes reported are meant for test usage only, as they | ||
| // are not aligned to any physical NUMA node. There is more work to do | ||
| // to support useful vNUMA reporting. | ||
| let mem_layout = if let Some(ref sizes) = cfg.memory.numa_mem_sizes { | ||
| MemoryLayout::new_with_numa( | ||
| sizes, | ||
| &cfg.memory.mmio_gaps, | ||
| &cfg.memory.pci_ecam_gaps, | ||
| &cfg.memory.pci_mmio_gaps, | ||
| vtl2_range, | ||
| ) | ||
| } else { | ||
| MemoryLayout::new( | ||
| cfg.memory.mem_size, | ||
| &cfg.memory.mmio_gaps, | ||
| &cfg.memory.pci_ecam_gaps, | ||
| &cfg.memory.pci_mmio_gaps, | ||
| vtl2_range, | ||
| ) | ||
| } |
There was a problem hiding this comment.
When cfg.memory.numa_mem_sizes is set, MemoryLayout::new_with_numa() derives the total RAM size from the sum of sizes, but other parts of this function still use cfg.memory.mem_size (e.g., earlier VTL2 range selection for Vtl2BaseAddressType::MemoryLayout, and the mem_size log). If mem_size and sum(numa_mem_sizes) diverge (possible via non-CLI config sources), the VM can end up with an inconsistent memory configuration. It would be safer to validate sum(sizes) == cfg.memory.mem_size here (or treat mem_size as derived and ignore the caller-provided value) and return an error on mismatch.
| /// per-NUMA-node guest RAM sizes (comma-separated, e.g. "2G,2G"). | ||
| /// Distributes memory across vNUMA nodes for ACPI SRAT / FDT reporting. | ||
| /// Mutually exclusive with --memory. | ||
| /// | ||
| /// TODO: This is informational topology only — backing pages are not pinned | ||
| /// to host NUMA nodes. This should change once we implement real numa | ||
| /// support. | ||
| #[clap(long, value_name = "SIZES", value_parser = parse_numa_memory, conflicts_with = "memory")] | ||
| pub numa_memory: Option<Vec<u64>>, | ||
|
|
There was a problem hiding this comment.
The PR introduces a new user-facing --numa-memory CLI flag, but the CLI reference doc in Guide/src/reference/openvmm/management/cli.md currently documents --memory and does not mention --numa-memory. Please update the Guide (or explicitly note in the PR description that docs will be updated in a follow-up) so users can discover and correctly use the new option.
89a29d5 to
47b3fbf
Compare
47b3fbf to
9521145
Compare
| /// Test servicing an OpenHCL VM with a multi-NUMA pool split. | ||
| /// The pool is split across NUMA nodes via `OPENHCL_VTL2_GPA_POOL_NUMA=split`, | ||
| /// and pool ranges must be preserved identically across the service boundary. | ||
| #[openvmm_test(openhcl_linux_direct_x64[LATEST_LINUX_DIRECT_TEST_X64])] | ||
| async fn servicing_numa_private_pool<T: PetriVmmBackend>( | ||
| config: PetriVmBuilder<T>, | ||
| (igvm_file,): (ResolvedArtifact<impl petri_artifacts_common::tags::IsOpenhclIgvm>,), | ||
| ) -> anyhow::Result<()> { | ||
| let mut flags = config.default_servicing_flags(); | ||
| flags.override_version_checks = true; | ||
| openhcl_servicing_core( | ||
| config | ||
| .with_processor_topology(ProcessorTopology { | ||
| vp_count: 4, | ||
| vps_per_socket: Some(2), | ||
| ..Default::default() | ||
| }) | ||
| .with_memory(MemoryConfig { | ||
| numa_mem_sizes: Some(vec![2 * 1024 * 1024 * 1024, 2 * 1024 * 1024 * 1024]), | ||
| ..Default::default() | ||
| }) | ||
| .with_openhcl_command_line( | ||
| "OPENHCL_VTL2_GPA_POOL_NUMA=split OPENHCL_ENABLE_VTL2_GPA_POOL=512 OPENHCL_DISABLE_NVME_KEEP_ALIVE=0", | ||
| ), |
There was a problem hiding this comment.
The doc comment says this test validates that pool ranges are preserved identically across the service boundary, but the test body only exercises openhcl_servicing_core() (which checks inspect serialization) and doesn’t assert anything about private_pool_ranges before/after servicing. Either add an explicit check of the pool ranges across a restart, or soften the comment to match what the test actually verifies (e.g., that servicing succeeds when the pool is split).
On some systems, there is not enough ram in numa node 0 to fufill the private pool allocation request. Change the code to support falling back to allocating the private pool across numa nodes if we cannot match the old behavior. This also opens up the possibility of adding numa awareness to dma_manager and drivers in the future.
Note that previous versions of openhcl_boot only support a single range for the private pool, but those systems shouldn't have been able to boot at all, so this change should be safe.