nvme_driver: Added logging and updated panics to GuestIOFailure when there isn't enough memory by gurasinghMS · Pull Request #3359 · microsoft/openvmm

gurasinghMS · 2026-04-22T21:06:23Z

We saw some crashes surfacing because there wasn't enough memory in the persistent reservations when issue_out is invoked. The current theory is that the device reports a really large size when Namespace::reservation_report_extended function is invoked. We then attempt to get enough memory to read from the device dynamically going beyond the default of 1 page. Since the requested size cannot fit in the PR memory, we crash.
This PR does 2 things:

Adds logging when the data buffer is resized so that we can track the resized data buffer size.
Changes behavior from panic to guest IO failure when the reservation report cannot be read.

Copilot

Pull request overview

This PR aims to prevent crashes in the NVMe disk backend when persistent reservation report handling requires more DMA-buffer memory than is available, by adding diagnostics and converting a previous panic into a recoverable error surfaced to the guest.

Changes:

Added RequestError::InsufficientMemory and used it instead of panicking when DMA pool allocation fails in issue_in/issue_out.
Added an info log when the reservation report buffer is resized to a larger size.
Plumbed the new NVMe driver error up into disk_backend::DiskError via disk_nvme error mapping.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
vm/devices/storage/disk_nvme/src/lib.rs	Maps the new NVMe driver `InsufficientMemory` error to a disk-backend error.
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs	Introduces `RequestError::InsufficientMemory` and returns it on allocation failure instead of panicking.
vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs	Logs when the reservation report buffer is resized.
vm/devices/storage/disk_backend/src/lib.rs	Adds `DiskError::InsufficientMemory` (but currently also introduces a duplicate variant).

gurasinghMS · 2026-04-22T21:21:43Z

            .alloc_bytes(data.len())
            .await
-            .expect("pool cap is >= 1 page");
+            .unwrap_or(return Err(RequestError::InsufficientMemory));


@mattkur If we start reporting this as a guest io error instead, I am worried it might just slip under the radar in the future (as in no tickets will be generated when this is hit). If the device is not lying about the reservation report size then future commands will continue to fail. What's our feedback loop in that case so that we can investigate the device / why the report size was so large.

This failure occurs when we're out of memory and are unable to process commands. You are right that this may become more hidden, but we can expect two avenues:

(1) please work internally to enable monitoring of this error in production. This means we will receive incidents when this is sent in prod.
(2) if this breaks customers, they will complain too. Obviously it's much better if we notice this ourselves, but we will also get customer signals. Those customers are going to be much happier with IO errors than they will be with VM crashes.

mattkur · 2026-04-22T21:28:06Z

            .alloc_bytes(data.len())
            .await
-            .expect("pool cap is >= 1 page");
+            .unwrap_or(return Err(RequestError::InsufficientMemory));


Why not include the amount of requested memory here?

(the log above will do a good job if we are in this path because we needed more than one page, but not if we fail to alloc a single page )

Good point! Looks like only the PageAllocator has access to the max number of pages and data.len() is also converted to num pages. I added a tracing::warn! Ok to leave the enum empty in that case?

Can we even fail to alloc a single page? alloc_pages only fails under one condition: When the number of pages requested is more than the max. Once past that it either never returns at all or returns properly.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Copilot · 2026-04-22T23:03:01Z

-            return None;
+            return Err(PageAllocationError {
+                requested: n,
+                available: self.max,


The new allocator error info is useful, but available is set to self.max even though alloc_pages explicitly reserves one page (so the largest satisfiable request is self.max - 1). Consider reporting the actual maximum-requestable pages (or rename the field to capacity_pages) to avoid misleading logs like max_pages = e.available.

Suggested change

available: self.max,

available: self.max.saturating_sub(1),

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Added logging and updated panics to GuestIOFailure

d28e804

Copilot AI review requested due to automatic review settings April 22, 2026 21:06

Copilot started reviewing on behalf of gurasinghMS April 22, 2026 21:06 View session

Removed doubled up enum value

cdbd661

Copilot AI reviewed Apr 22, 2026

View reviewed changes

gurasinghMS commented Apr 22, 2026

View reviewed changes

mattkur reviewed Apr 22, 2026

View reviewed changes

gurasinghMS added 2 commits April 22, 2026 14:28

Now using ok_or

c84a2f9

Added a warning

2d89736

Copilot AI review requested due to automatic review settings April 22, 2026 21:52

Copilot started reviewing on behalf of gurasinghMS April 22, 2026 21:53 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated

Comment thread vm/devices/user_driver/src/page_allocator.rs Outdated

Comment thread vm/devices/user_driver/src/page_allocator.rs Outdated

Comment thread vm/devices/storage/disk_backend/src/lib.rs Outdated

gurasinghMS and others added 2 commits April 22, 2026 15:12

Merge branch 'main' into log-request-failures

28377b7

Fixing failures

e71fddf

Copilot AI review requested due to automatic review settings April 22, 2026 22:56

Copilot started reviewing on behalf of gurasinghMS April 22, 2026 22:57 View session

Now using ratelimited

73960c3

Copilot AI reviewed Apr 22, 2026

View reviewed changes

gurasinghMS added 2 commits April 22, 2026 16:08

Minor update to name

a3cca7d

added ratelimited for reservation report

0142c4f

Copilot AI review requested due to automatic review settings April 22, 2026 23:11

Copilot started reviewing on behalf of gurasinghMS April 22, 2026 23:11 View session

Merge branch 'main' into log-request-failures

6491dfa

gurasinghMS marked this pull request as ready for review April 22, 2026 23:12

gurasinghMS requested review from a team as code owners April 22, 2026 23:12

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated

gurasinghMS added 2 commits April 22, 2026 16:17

Fix clippy

26bd453

changes error msg

2c43f37

Copilot AI review requested due to automatic review settings April 22, 2026 23:29

Copilot started reviewing on behalf of gurasinghMS April 22, 2026 23:29 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread vm/devices/user_driver/src/page_allocator.rs

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs

last minor update

a1eea9a

axelandrejs approved these changes Apr 23, 2026

View reviewed changes

Conversation

gurasinghMS commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gurasinghMS Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mattkur Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mattkur Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mattkur Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gurasinghMS Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants