nvme_driver: Added logging and updated panics to GuestIOFailure when there isn't enough memory#3359
nvme_driver: Added logging and updated panics to GuestIOFailure when there isn't enough memory#3359gurasinghMS wants to merge 13 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent crashes in the NVMe disk backend when persistent reservation report handling requires more DMA-buffer memory than is available, by adding diagnostics and converting a previous panic into a recoverable error surfaced to the guest.
Changes:
- Added
RequestError::InsufficientMemoryand used it instead of panicking when DMA pool allocation fails inissue_in/issue_out. - Added an
infolog when the reservation report buffer is resized to a larger size. - Plumbed the new NVMe driver error up into
disk_backend::DiskErrorviadisk_nvmeerror mapping.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| vm/devices/storage/disk_nvme/src/lib.rs | Maps the new NVMe driver InsufficientMemory error to a disk-backend error. |
| vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs | Introduces RequestError::InsufficientMemory and returns it on allocation failure instead of panicking. |
| vm/devices/storage/disk_nvme/nvme_driver/src/namespace.rs | Logs when the reservation report buffer is resized. |
| vm/devices/storage/disk_backend/src/lib.rs | Adds DiskError::InsufficientMemory (but currently also introduces a duplicate variant). |
| .alloc_bytes(data.len()) | ||
| .await | ||
| .expect("pool cap is >= 1 page"); | ||
| .unwrap_or(return Err(RequestError::InsufficientMemory)); |
There was a problem hiding this comment.
@mattkur If we start reporting this as a guest io error instead, I am worried it might just slip under the radar in the future (as in no tickets will be generated when this is hit). If the device is not lying about the reservation report size then future commands will continue to fail. What's our feedback loop in that case so that we can investigate the device / why the report size was so large.
There was a problem hiding this comment.
This failure occurs when we're out of memory and are unable to process commands. You are right that this may become more hidden, but we can expect two avenues:
(1) please work internally to enable monitoring of this error in production. This means we will receive incidents when this is sent in prod.
(2) if this breaks customers, they will complain too. Obviously it's much better if we notice this ourselves, but we will also get customer signals. Those customers are going to be much happier with IO errors than they will be with VM crashes.
| .alloc_bytes(data.len()) | ||
| .await | ||
| .expect("pool cap is >= 1 page"); | ||
| .unwrap_or(return Err(RequestError::InsufficientMemory)); |
There was a problem hiding this comment.
Why not include the amount of requested memory here?
There was a problem hiding this comment.
(the log above will do a good job if we are in this path because we needed more than one page, but not if we fail to alloc a single page )
There was a problem hiding this comment.
-
Good point! Looks like only the PageAllocator has access to the max number of pages and data.len() is also converted to num pages. I added a tracing::warn! Ok to leave the enum empty in that case?
-
Can we even fail to alloc a single page? alloc_pages only fails under one condition: When the number of pages requested is more than the max. Once past that it either never returns at all or returns properly.
| return None; | ||
| return Err(PageAllocationError { | ||
| requested: n, | ||
| available: self.max, |
There was a problem hiding this comment.
The new allocator error info is useful, but available is set to self.max even though alloc_pages explicitly reserves one page (so the largest satisfiable request is self.max - 1). Consider reporting the actual maximum-requestable pages (or rename the field to capacity_pages) to avoid misleading logs like max_pages = e.available.
| available: self.max, | |
| available: self.max.saturating_sub(1), |
We saw some crashes surfacing because there wasn't enough memory in the persistent reservations when
issue_outis invoked. The current theory is that the device reports a really large size whenNamespace::reservation_report_extendedfunction is invoked. We then attempt to get enough memory to read from the device dynamically going beyond the default of 1 page. Since the requested size cannot fit in the PR memory, we crash.This PR does 2 things: