[WIP] vhdx: add pure-Rust VHDX parser and disk layer#3347
[WIP] vhdx: add pure-Rust VHDX parser and disk layer#3347jstarks wants to merge 15 commits intomicrosoft:mainfrom
Conversation
Add a pure-Rust VHDX implementation with full read/write support, write-ahead logging, crash consistency, free space management, trim, differencing disk chains, and a disk_backend integration layer. This replaces the Windows-only vhdmp kernel driver for VHDX files, enabling cross-platform VHDX support.
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a cross-platform, pure-Rust VHDX implementation (parser + read/write + WAL pipeline) and wires it into OpenVMM’s layered disk stack so .vhdx works without the Windows-only kernel driver.
Changes:
- Introduces a new
vhdxcrate with VHDX format parsing, metadata/region/header handling, and a WAL-backed commit/apply pipeline. - Adds a new
disklayer_vhdxcrate implementingLayerIo+ resolver, including parent-locator chain walking. - Updates OpenVMM resource registration and disk-opening logic to use the new VHDX backend for
.vhdx.
Reviewed changes
Copilot reviewed 34 out of 41 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| vm/devices/storage/vhdx/src/sector_bitmap.rs | Sector bitmap cache + partially-present read resolution; adds unit tests. |
| vm/devices/storage/vhdx/src/region.rs | Region table parsing/validation + rewrite via WAL; adds tests. |
| vm/devices/storage/vhdx/src/metadata.rs | Metadata table parsing + item reads; adds tests. |
| vm/devices/storage/vhdx/src/lsn_watermark.rs | Async LSN/FSN watermark primitive with poisoning + tests. |
| vm/devices/storage/vhdx/src/log_task.rs | WAL logging task plumbing and tail advancement + tests. |
| vm/devices/storage/vhdx/src/log_permits.rs | Failable semaphore for pipeline backpressure + tests. |
| vm/devices/storage/vhdx/src/locator.rs | Parent locator parsing (UTF-16 KV table) + tests. |
| vm/devices/storage/vhdx/src/lib.rs | Crate root + module exports + AsyncFile trait. |
| vm/devices/storage/vhdx/src/known_meta.rs | Parses well-known metadata items + required-item validation + tests. |
| vm/devices/storage/vhdx/src/header.rs | Dual-header parsing + serialized header updates + tests. |
| vm/devices/storage/vhdx/src/format.rs | On-disk structs/constants + checksums + GUIDs + layout tests. |
| vm/devices/storage/vhdx/src/flush.rs | Flush sequencer implementation + tests. |
| vm/devices/storage/vhdx/src/error.rs | Error and corruption enums used across the crate. |
| vm/devices/storage/vhdx/src/create.rs | VHDX creation logic (headers/regions/metadata/BAT) + tests. |
| vm/devices/storage/vhdx/src/apply_task.rs | Apply task writing logged pages to final offsets. |
| vm/devices/storage/vhdx/Cargo.toml | Defines the new vhdx crate and dependencies. |
| vm/devices/storage/disklayer_vhdx/src/resolver.rs | Resource resolver for VhdxDiskLayerHandle. |
| vm/devices/storage/disklayer_vhdx/src/lib.rs | LayerIo implementation bridging vhdx ranges to buffers. |
| vm/devices/storage/disklayer_vhdx/src/io.rs | BlockingFile implementing vhdx::AsyncFile via blocking pool. |
| vm/devices/storage/disklayer_vhdx/src/chain.rs | Helpers to open single VHDX or auto-walk differencing chains. |
| vm/devices/storage/disklayer_vhdx/Cargo.toml | Defines the new disklayer crate + deps/tests. |
| vm/devices/storage/disk_backend_resources/src/layer.rs | Adds VhdxDiskLayerHandle resource type. |
| openvmm/openvmm_resources/src/lib.rs | Registers the new VHDX disklayer resolver. |
| openvmm/openvmm_resources/Cargo.toml | Adds disklayer_vhdx dependency. |
| openvmm/openvmm_helpers/src/disk.rs | Routes .vhdx open to the pure-Rust VHDX chain implementation. |
| openvmm/openvmm_helpers/Cargo.toml | Adds disklayer_vhdx dependency. |
| Cargo.toml | Adds workspace members + crc32c dependency. |
| let sector_number = virtual_offset / self.logical_sector_size as u64; | ||
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | ||
| let sector_count = length as u64 / self.logical_sector_size as u64; |
There was a problem hiding this comment.
sector_count is computed with integer division, which silently truncates if length is not a multiple of logical_sector_size. That would cause resolve_partial_block_read to return ranges shorter than the requested length without error. Consider validating length % logical_sector_size == 0 here and returning VhdxIoErrorInner::UnalignedIo (or asserting this invariant if it is guaranteed by the caller) to avoid silent data loss.
| let sector_number = virtual_offset / self.logical_sector_size as u64; | |
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | |
| let sector_count = length as u64 / self.logical_sector_size as u64; | |
| let logical_sector_size = self.logical_sector_size as u64; | |
| if (length as u64) % logical_sector_size != 0 { | |
| return Err(VhdxIoErrorInner::UnalignedIo.into()); | |
| } | |
| let sector_number = virtual_offset / logical_sector_size; | |
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | |
| let sector_count = length as u64 / logical_sector_size; |
| let sector_number = virtual_offset / self.logical_sector_size as u64; | ||
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | ||
| let sector_count = length as u64 / self.logical_sector_size as u64; |
There was a problem hiding this comment.
set_sector_bitmap_bits also truncates sector_count via integer division. If length is not sector-aligned, the tail bytes of the request will be ignored and the bitmap will be updated for fewer sectors than intended. Consider enforcing alignment (returning an error) or asserting this is always sector-aligned at the call site to prevent subtle metadata/data mismatches.
| let sector_number = virtual_offset / self.logical_sector_size as u64; | |
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | |
| let sector_count = length as u64 / self.logical_sector_size as u64; | |
| let logical_sector_size = self.logical_sector_size as u64; | |
| assert!( | |
| virtual_offset % logical_sector_size == 0, | |
| "virtual_offset {virtual_offset} must be aligned to logical sector size {logical_sector_size}" | |
| ); | |
| assert!( | |
| length as u64 % logical_sector_size == 0, | |
| "length {length} must be aligned to logical sector size {logical_sector_size}" | |
| ); | |
| let sector_number = virtual_offset / logical_sector_size; | |
| let chunk_number = (sector_number / SECTORS_PER_CHUNK) as u32; | |
| let sector_count = length as u64 / logical_sector_size; |
| // Reasonable depth limit to detect cycles or absurdly long chains. | ||
| const MAX_CHAIN_DEPTH: usize = 1024; | ||
|
|
||
| let mut paths: Vec<std::path::PathBuf> = vec![path.to_path_buf()]; | ||
| let mut current_path = path.to_path_buf(); | ||
|
|
||
| loop { | ||
| if paths.len() > MAX_CHAIN_DEPTH { | ||
| anyhow::bail!( | ||
| "vhdx chain exceeds maximum depth of {} — possible cycle", | ||
| MAX_CHAIN_DEPTH | ||
| ); | ||
| } |
There was a problem hiding this comment.
Cycle detection relies solely on a max-depth limit, so a short cycle (e.g., A→B→A) will loop until depth 1024, doing repeated IO and producing a delayed, less actionable error. Consider tracking visited canonicalized paths (or raw paths if canonicalization is undesirable) in a HashSet and failing immediately on repeat to make cycle failures faster and more diagnosable.
| // (for newly-allocated block padding). | ||
| let zeros = vec![0u8; length as usize]; | ||
| self.file | ||
| .write_at(file_offset, &zeros) | ||
| .await | ||
| .map_err(DiskError::Io)?; |
There was a problem hiding this comment.
This allocates a fresh Vec<u8> sized to length for every WriteRange::Zero. For large padding ranges, this can create significant transient allocations and memory pressure. Consider using a small reusable zero buffer (e.g., 64 KiB) and writing in chunks, or providing a zero-write helper in BlockingFile that streams zeros without allocating length bytes.
| // (for newly-allocated block padding). | |
| let zeros = vec![0u8; length as usize]; | |
| self.file | |
| .write_at(file_offset, &zeros) | |
| .await | |
| .map_err(DiskError::Io)?; | |
| // (for newly-allocated block padding) without | |
| // allocating a buffer for the entire range. | |
| let zeros = [0u8; 64 * 1024]; | |
| let mut remaining = length as usize; | |
| let mut current_offset = file_offset; | |
| while remaining != 0 { | |
| let chunk_len = remaining.min(zeros.len()); | |
| self.file | |
| .write_at(current_offset, &zeros[..chunk_len]) | |
| .await | |
| .map_err(DiskError::Io)?; | |
| remaining -= chunk_len; | |
| current_offset += chunk_len as u64; | |
| } |
| while remaining_sectors > 0 { | ||
| // Recompute bitmap page coordinates for current position. | ||
| let cur_sector = current_virtual_offset / self.logical_sector_size as u64; | ||
| let cur_chunk_sector = cur_sector % SECTORS_PER_CHUNK; | ||
| let cur_page_number = cur_chunk_sector / SECTORS_PER_BITMAP_PAGE; | ||
| let start_bit = cur_chunk_sector % SECTORS_PER_BITMAP_PAGE; | ||
| let bits_in_this_page = | ||
| std::cmp::min(start_bit + remaining_sectors, SECTORS_PER_BITMAP_PAGE); |
There was a problem hiding this comment.
When a data/unmapped run spans a bitmap-page boundary, this implementation will always end the run at the page boundary and start a new ReadRange in the next iteration. That can inflate the number of ranges (especially for large reads) and increase downstream overhead. Consider carrying the previous run state across iterations and coalescing adjacent runs of the same type when guest_offset is contiguous (and file_offset is contiguous for Data).
| ranges.push(ReadRange::Unmapped { | ||
| guest_offset: current_virtual_offset, | ||
| length: unmapped_bytes as u32, | ||
| }); |
There was a problem hiding this comment.
When a data/unmapped run spans a bitmap-page boundary, this implementation will always end the run at the page boundary and start a new ReadRange in the next iteration. That can inflate the number of ranges (especially for large reads) and increase downstream overhead. Consider carrying the previous run state across iterations and coalescing adjacent runs of the same type when guest_offset is contiguous (and file_offset is contiguous for Data).
| ranges.push(ReadRange::Data { | ||
| guest_offset: current_virtual_offset, | ||
| length: data_bytes as u32, | ||
| file_offset, | ||
| }); |
There was a problem hiding this comment.
When a data/unmapped run spans a bitmap-page boundary, this implementation will always end the run at the page boundary and start a new ReadRange in the next iteration. That can inflate the number of ranges (especially for large reads) and increase downstream overhead. Consider carrying the previous run state across iterations and coalescing adjacent runs of the same type when guest_offset is contiguous (and file_offset is contiguous for Data).
| // Advance to next page. | ||
| let sectors_processed = bits_in_this_page - start_bit; | ||
| remaining_sectors -= sectors_processed; | ||
| } |
There was a problem hiding this comment.
When a data/unmapped run spans a bitmap-page boundary, this implementation will always end the run at the page boundary and start a new ReadRange in the next iteration. That can inflate the number of ranges (especially for large reads) and increase downstream overhead. Consider carrying the previous run state across iterations and coalescing adjacent runs of the same type when guest_offset is contiguous (and file_offset is contiguous for Data).
| parser. Fixed VHD1 images also work on all platforms. Dynamic and | ||
| differencing VHD1 files are **not** supported — convert them to VHDX | ||
| first: | ||
|
|
There was a problem hiding this comment.
The Guide style typically prefers explicit shell language tags (e.g., bash) over shell for code fences to improve highlighting consistency. Consider changing the fence to bash (and ensuring line-wrapping stays within the style guide’s preferred width).
| let sbm_mapping = self.bat.get_sector_bitmap_mapping(chunk_number); | ||
| assert!( | ||
| sbm_mapping.bat_state() == BatEntryState::FullyPresent, | ||
| "SBM for chunk {chunk_number} must be allocated for PartiallyPresent block" | ||
| ); |
There was a problem hiding this comment.
These assert! checks can panic when opening/reading a malformed or adversarial VHDX file, which violates the project’s trust-boundary requirement of not panicking on untrusted input. Prefer returning a typed error (e.g., a corruption/runtime I/O error) instead of panicking, and apply the same change to the identical assert! in set_sector_bitmap_bits().
| /// caller's FSN after the in-progress flush finishes, a new flush is | ||
| /// issued. | ||
| /// | ||
| /// Returns the FSN that was assigned to this flush request. |
There was a problem hiding this comment.
The docs say flush() returns “the FSN that was assigned”, but flush_until(..., None) returns the resolved/completed FSN (which can be higher than the caller’s assigned FSN under concurrency). Please update the doc comment (and/or method naming) to match the actual semantics, since callers may rely on the return value for ordering.
| /// Returns the FSN that was assigned to this flush request. | |
| /// Returns the completed FSN of the flush that satisfied this request. | |
| /// Under concurrency this may be greater than the FSN originally assigned | |
| /// to this caller, because a single completed flush can cover multiple | |
| /// requests. |
| // Read from the VHDX file into a temporary buffer. | ||
| let mut data = vec![0u8; length as usize]; | ||
| self.file | ||
| .read_at(file_offset, &mut data) | ||
| .await | ||
| .map_err(DiskError::Io)?; |
There was a problem hiding this comment.
This creates an allocation per ReadRange::Data and then calls BlockingFile::read_at, which itself allocates internally (due to blocking::unblock needing 'static), resulting in avoidable double allocation/copy per range. Consider adding an internal BlockingFile API that returns an owned buffer (so the internal allocation can be reused instead of allocating here), or restructure the read path to minimize per-range allocations (e.g., reuse a scratch buffer for small reads).
| // Read data from the request buffers. | ||
| let mut data = vec![0u8; length as usize]; | ||
| buffers | ||
| .subrange(buf_offset, length as usize) | ||
| .reader() | ||
| .read(&mut data)?; | ||
|
|
||
| // Write to the VHDX file at the resolved offset. | ||
| self.file | ||
| .write_at(file_offset, &data) | ||
| .await | ||
| .map_err(DiskError::Io)?; |
There was a problem hiding this comment.
Similar to reads, this allocates a fresh Vec for every WriteRange::Data. On top of that, BlockingFile::write_at() currently clones the provided slice into another Vec to move into blocking::unblock, doubling memory traffic. Consider adding a BlockingFile::write_at_owned(Vec<u8>) (or similar) so this path can move data into the blocking closure without an extra copy.
| - The crate currently provides a low-level I/O resolution API. A | ||
| higher-level `DiskIo` integration layer is needed to use it as a | ||
| drop-in storage backend in the OpenVMM pipeline. | ||
| - Differencing disk chains require the caller to manage parent | ||
| resolution (the crate parses parent locators but does not open | ||
| parent files). |
There was a problem hiding this comment.
This section appears outdated relative to the PR: disklayer_vhdx provides an OpenVMM integration layer, and disklayer_vhdx::chain::open_vhdx_chain implements parent locator walking. Please update these bullets to reflect the new integration and clarify what remains the responsibility of the vhdx crate vs. the OpenVMM disk layer.
| - The crate currently provides a low-level I/O resolution API. A | |
| higher-level `DiskIo` integration layer is needed to use it as a | |
| drop-in storage backend in the OpenVMM pipeline. | |
| - Differencing disk chains require the caller to manage parent | |
| resolution (the crate parses parent locators but does not open | |
| parent files). | |
| - The `vhdx` crate provides the low-level VHDX format | |
| implementation and I/O resolution API. For OpenVMM integration, | |
| `disklayer_vhdx` supplies the `DiskIo`-compatible backend layer | |
| used in the storage pipeline. | |
| - For differencing disks, the `vhdx` crate parses parent locator | |
| metadata, while the OpenVMM disk layer can walk and open parent | |
| chains via `disklayer_vhdx::chain::open_vhdx_chain`. |
| platforms via the pure-Rust [`vhdx`](../../reference/backends/vhdx.md) | ||
| parser. Fixed VHD1 images also work on all platforms. Dynamic and | ||
| differencing VHD1 files are **not** supported — convert them to VHDX | ||
| first: |
There was a problem hiding this comment.
This claims OpenVMM uses the pure-Rust vhdx parser on all platforms, but the code changes in openvmm_helpers/src/disk.rs indicate Windows still uses the kernel-mode VHD path for .vhdx while Linux uses disklayer_vhdx. Please either adjust the docs to match the current platform behavior, or update the Windows open path to also use disklayer_vhdx so the statement is accurate.
| platforms via the pure-Rust [`vhdx`](../../reference/backends/vhdx.md) | |
| parser. Fixed VHD1 images also work on all platforms. Dynamic and | |
| differencing VHD1 files are **not** supported — convert them to VHDX | |
| first: | |
| platforms. On Linux, VHDX support uses the pure-Rust | |
| [`vhdx`](../../reference/backends/vhdx.md) parser via | |
| `disklayer_vhdx`. On Windows, `.vhdx` files currently use the native | |
| VHD open path instead. Fixed VHD1 images also work on all platforms. | |
| Dynamic and differencing VHD1 files are **not** supported — convert | |
| them to VHDX first: |
Add a pure-Rust VHDX implementation with full read/write support, write-ahead logging, crash consistency, free space management, trim, differencing disk chains, and a disk_backend integration layer.
This replaces the Windows-only vhdmp kernel driver for VHDX files, enabling cross-platform VHDX support.