Skip to content

nvme_driver: Added logging and updated panics to GuestIOFailure when there isn't enough memory#3359

Open
gurasinghMS wants to merge 13 commits intomicrosoft:mainfrom
gurasinghMS:log-request-failures
Open

nvme_driver: Added logging and updated panics to GuestIOFailure when there isn't enough memory#3359
gurasinghMS wants to merge 13 commits intomicrosoft:mainfrom
gurasinghMS:log-request-failures

Conversation

@gurasinghMS
Copy link
Copy Markdown
Contributor

We saw some crashes surfacing because there wasn't enough memory in the persistent reservations when issue_out is invoked. The current theory is that the device reports a really large size when Namespace::reservation_report_extended function is invoked. We then attempt to get enough memory to read from the device dynamically going beyond the default of 1 page. Since the requested size cannot fit in the PR memory, we crash.
This PR does 2 things:

  1. Adds logging when the data buffer is resized so that we can track the resized data buffer size.
  2. Changes behavior from panic to guest IO failure when the reservation report cannot be read.

Copilot AI review requested due to automatic review settings April 22, 2026 21:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent crashes in the NVMe disk backend when persistent reservation report handling requires more DMA-buffer memory than is available, by adding diagnostics and converting a previous panic into a recoverable error surfaced to the guest.

Changes:

  • Added RequestError::InsufficientMemory and used it instead of panicking when DMA pool allocation fails in issue_in/issue_out.
  • Added an info log when the reservation report buffer is resized to a larger size.
  • Plumbed the new NVMe driver error up into disk_backend::DiskError via disk_nvme error mapping.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
vm/devices/storage/disk_nvme/src/lib.rs Maps the new NVMe driver InsufficientMemory error to a disk-backend error.
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Introduces RequestError::InsufficientMemory and returns it on allocation failure instead of panicking.
vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs Logs when the reservation report buffer is resized.
vm/devices/storage/disk_backend/src/lib.rs Adds DiskError::InsufficientMemory (but currently also introduces a duplicate variant).

Comment thread vm/devices/storage/disk_backend/src/lib.rs
Comment thread vm/devices/storage/disk_backend/src/lib.rs
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated
.alloc_bytes(data.len())
.await
.expect("pool cap is >= 1 page");
.unwrap_or(return Err(RequestError::InsufficientMemory));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattkur If we start reporting this as a guest io error instead, I am worried it might just slip under the radar in the future (as in no tickets will be generated when this is hit). If the device is not lying about the reservation report size then future commands will continue to fail. What's our feedback loop in that case so that we can investigate the device / why the report size was so large.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure occurs when we're out of memory and are unable to process commands. You are right that this may become more hidden, but we can expect two avenues:

(1) please work internally to enable monitoring of this error in production. This means we will receive incidents when this is sent in prod.
(2) if this breaks customers, they will complain too. Obviously it's much better if we notice this ourselves, but we will also get customer signals. Those customers are going to be much happier with IO errors than they will be with VM crashes.

.alloc_bytes(data.len())
.await
.expect("pool cap is >= 1 page");
.unwrap_or(return Err(RequestError::InsufficientMemory));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include the amount of requested memory here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the log above will do a good job if we are in this path because we needed more than one page, but not if we fail to alloc a single page )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point! Looks like only the PageAllocator has access to the max number of pages and data.len() is also converted to num pages. I added a tracing::warn! Ok to leave the enum empty in that case?

  2. Can we even fail to alloc a single page? alloc_pages only fails under one condition: When the number of pages requested is more than the max. Once past that it either never returns at all or returns properly.

Copilot AI review requested due to automatic review settings April 22, 2026 21:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated
Comment thread vm/devices/user_driver/src/page_allocator.rs Outdated
Comment thread vm/devices/user_driver/src/page_allocator.rs Outdated
Comment thread vm/devices/storage/disk_backend/src/lib.rs Outdated
Copilot AI review requested due to automatic review settings April 22, 2026 22:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs
return None;
return Err(PageAllocationError {
requested: n,
available: self.max,
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new allocator error info is useful, but available is set to self.max even though alloc_pages explicitly reserves one page (so the largest satisfiable request is self.max - 1). Consider reporting the actual maximum-requestable pages (or rename the field to capacity_pages) to avoid misleading logs like max_pages = e.available.

Suggested change
available: self.max,
available: self.max.saturating_sub(1),

Copilot uses AI. Check for mistakes.
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs
Copilot AI review requested due to automatic review settings April 22, 2026 23:11
@gurasinghMS gurasinghMS marked this pull request as ready for review April 22, 2026 23:12
@gurasinghMS gurasinghMS requested review from a team as code owners April 22, 2026 23:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Outdated
Copilot AI review requested due to automatic review settings April 22, 2026 23:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Comment thread vm/devices/user_driver/src/page_allocator.rs
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs
Comment thread vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants