1. Summary
DRBD 9.3.1 with transport: tcp and net { tls yes; } on Linux 6.18.24
triggers a kernel WARNING in tls_sw_sendmsg followed by BUG: Bad page state reports in unrelated subsystems (XFS log buffers) and
force-io-failures( no -> yes ) cascades across DRBD resources, on all
peers within a 3 minute window. The faulting page state is consistent with a
compound (head) page that lost a refcount during a MSG_SPLICE_PAGES send
through kernel TLS. We have not measured DRBD 9.2.16 on the same hardware,
but code inspection (see section 7) shows 9.2.16 lacks the
dtt_send_bio(MSG_SPLICE_PAGES) and compound-page allocation paths
introduced in 9.3.1, so the same trigger should not exercise the same
interaction.
2. Environment
| Item |
Value |
| Kernel |
6.18.24-talos, built with Clang 22.1.2 + LLD 22.1.2 (ThinLTO) |
| Talos Linux |
1.13.0 (released 2026-04-27) |
| DRBD module |
9.3.1 (api:2/proto:118-124, transport:21) |
| Storage stack |
ZFS thin zvol -> LUKS -> DRBD -> XFS |
| Network |
DRBD over IPv6 mesh, three control-plane nodes (n1, n2, n3) |
3. DRBD configuration (per resource, generated by LINSTOR 1.33.1)
options { on-no-data-accessible suspend-io; on-no-quorum suspend-io; quorum off; }
net { cram-hmac-alg sha1; rr-conflict retry-connect; tls yes; verify-alg crc32c; }
disk { rs-discard-granularity 1048576; }
4. Reproduction trigger we observed
Cluster ran on Talos 1.13.0 (DRBD 9.3.1, kernel 6.18.24) for 8 to 9 days
without any kernel WARNs in dmesg. The first cascade was driven by a DRBD
peer reconnect/renegotiation, with the following sequence:
| Time UTC |
Event |
| 19:43:51 |
Cluster operator pushed a Kubernetes manifest apply that re-created the LINSTOR StorageClass objects (no parameter changes, but the apply caused LINSTOR to reconcile). |
| 20:07:46 |
LINSTOR controller logged No common DRBD verify algorithm found for 'pvc-...', clearing prop for several resources, then auto-resolved with Drbd-auto-verify-Algo ... automatically set to sha512. This change requires DRBD peer reconnect/renegotiate. |
| 20:08:14 |
First WARN at tls_sw_sendmsg on n1. |
| 20:09:00 |
Same WARN on n2. |
| 20:08:27 onward |
n3 floods 30+ BUG: Bad page state reports while unmounting/remounting affected XFS volumes. |
| 20:09 onward |
DRBD force-io-failures( no -> yes ) propagates to multiple resources; XFS shuts down the volumes; a HA-replicated application using bbolt-on-XFS storage loses one of its peers. |
We are disclosing this in case the reproduction trigger turns out to depend
on the specific verify-alg renegotiation. From our reading of the code, any
DRBD reconnect or sustained large-write under tls yes should be sufficient,
but you are better placed to confirm.
5. Stack trace 1: WARN at tls_sw_sendmsg (representative; identical on all three nodes)
WARNING: CPU: 3 PID: 12436 at ./include/linux/mm.h:1395 tls_sw_sendmsg+0x57d/0xaa0 [tls]
Modules linked in: tls drbd_transport_tcp(O) drbd(O) dm_thin_pool dm_persistent_data
dm_bio_prison zfs(O) ... nvme ... 6.18.24-talos #1 PREEMPT(none)
RIP: 0010:tls_sw_sendmsg+0x57d/0xaa0 [tls]
Call Trace:
__sock_sendmsg+0x3d/0x90
sock_sendmsg+0xf4/0x140
dtt_send_page+0x174/0x270 [drbd_transport_tcp]
dtt_send_bio+0xc7/0x110 [drbd_transport_tcp]
drbd_send_dblock+0x703/0x830 [drbd]
process_one_request+0x1fd/0x350 [drbd]
drbd_sender+0x113/0x7b0 [drbd]
drbd_thread_setup+0x8b/0x2e0 [drbd]
kthread+0x201/0x260
Comm: drbd_s_pvc-2aeb
6. Stack trace 2: BUG Bad page state (n3, repeated ~30 times in seconds)
BUG: Bad page state in process umount pfn:0x620e80
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8b75a0e83600
head: order:0 mapcount:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
flags: 0x10000000000040(head|node=0|zone=2)
page dumped because: nonzero _refcount
Call Trace:
__free_frozen_pages+0x63a/0x650
xlog_dealloc_log+0x40/0x80
xfs_unmountfs+0xe8/0x140
xfs_fs_put_super+0x3f/0x110
generic_shutdown_super+0x7c/0x120
kill_block_super+0x1b/0x40
xfs_kill_sb+0x12/0x20
cleanup_mnt+0x13a/0x190
7. Code-level differential between 9.2.16 and 9.3.1 (hypothesis source)
This section is the basis for our hypothesis; we have not run an empirical
A/B on the same hardware.
Comparing
drbd-9.2.16...drbd-9.3.1,
the relevant 9.3.x changes that land in drbd_transport_tcp's send path are:
032b7c37 "drbd: use compound pages to optimize large I/O" introduces
multi-order page allocation in DRBD.
eed6170c "drbd: Rename send_zc_bio() to send_bio(, MSG_SPLICE_PAGES)" makes
the TCP transport send via MSG_SPLICE_PAGES.
- The TCP transport now uses
compound_order(page) directly to size sends
(visible in the dtt_recv_bio rewrite), confirming compound pages flow
through this code path.
344e52ae "drbd: disable compound page allocation on kernels without
multi-page bvec" adds a guard, but only for older kernels. Linux 6.18.24
has multi-page bvec, so the guard does not suppress the path on us.
DRBD 9.2.16 has none of the above. It allocates 4 KiB pages via
drbd_alloc_page_chain and sends them one-page-at-a-time via the older
send_zc_bio path that does not use MSG_SPLICE_PAGES.
8. Hypothesis
The combination of:
- DRBD 9.3.1 allocating compound (head) pages in the TCP transport,
- passing them to
sock_sendmsg(MSG_SPLICE_PAGES), and
- the socket having kTLS attached (
tls yes in DRBD net config),
leads tls_sw_sendmsg to drop a refcount on the wrong page during
fragmentation, leaving the head with entire_mapcount=1 and refcount=1.
The same class of issue motivated the upstream sendpages_ok() checks for
nvme-tcp (Ubuntu LP #2093871, kernel commit a1d2aa48c6).
9. Suggested fix paths (you are better placed to choose)
- Add a
sendpages_ok() style validation in dtt_send_bio /
dtt_send_page before calling sock_sendmsg(MSG_SPLICE_PAGES), falling
back to a per-page split for compound pages, and unconditionally for
sockets with kTLS attached (tls_get_ctx() probe).
- Disable compound page allocation in the TCP transport whenever the socket
has kTLS attached, regardless of the multi-page bvec capability.
- Backport the 9.3.2 RDMA
dtr_send_page page-leak fix to the TCP transport
if the underlying refcount mishandling has the same shape.
10. Security framing (no CVE requested)
We are not requesting a CVE assignment because we will not publish exploit
details, but we want to make sure the impact framing is clear:
| Axis |
Severity |
| Confidentiality |
Low. Leaked compound pages we observed got reused by the XFS log allocator and immediately panicked; we have no evidence of data disclosure. A theoretical use-after-free path exists but would require precise allocator-timing control. |
| Integrity |
Low to Medium. XFS shutdowns triggered by force-io-failures recover via journal replay on remount. One HA peer's bbolt mmap took a SIGBUS and required application-level re-replication from a healthy peer; no silent data loss observed. |
| Availability |
Medium to High. Confirmed cluster-wide impact within a single 3 minute window. Repeats on every DRBD reconnect/renegotiation while tls yes and DRBD 9.3.1 are in place. |
Any caller that exercises the DRBD send path with kTLS attached can trigger
the WARN. This includes ordinary write-heavy in-cluster workloads when DRBD
peers reconnect (network blip, peer restart, satellite restart, LINSTOR
property change). Pre-authentication external network triggering is not
plausible because DRBD performs cram-hmac-alg authentication before
application-layer traffic, and the bug is on the application-layer send
path, not the handshake.
11. Mitigation pending the fix
We are not yet downgrading; we are observing for a week and gating a Talos
1.13.0 -> 1.12.7 rollback (which would replace DRBD 9.3.1 with 9.2.16 on
the same kernel) on whether we hit further recurrences. We can capture
additional traces, run a debug or instrumented build, or test a candidate
patch on a non-prod replica of the cluster on request. Full kernel logs
covering the cascade window from all three nodes are available.
1. Summary
DRBD 9.3.1 with
transport: tcpandnet { tls yes; }on Linux 6.18.24triggers a kernel
WARNINGintls_sw_sendmsgfollowed byBUG: Bad page statereports in unrelated subsystems (XFS log buffers) andforce-io-failures( no -> yes )cascades across DRBD resources, on allpeers within a 3 minute window. The faulting page state is consistent with a
compound (head) page that lost a refcount during a
MSG_SPLICE_PAGESsendthrough kernel TLS. We have not measured DRBD 9.2.16 on the same hardware,
but code inspection (see section 7) shows 9.2.16 lacks the
dtt_send_bio(MSG_SPLICE_PAGES)and compound-page allocation pathsintroduced in 9.3.1, so the same trigger should not exercise the same
interaction.
2. Environment
3. DRBD configuration (per resource, generated by LINSTOR 1.33.1)
4. Reproduction trigger we observed
Cluster ran on Talos 1.13.0 (DRBD 9.3.1, kernel 6.18.24) for 8 to 9 days
without any kernel WARNs in dmesg. The first cascade was driven by a DRBD
peer reconnect/renegotiation, with the following sequence:
StorageClassobjects (no parameter changes, but the apply caused LINSTOR to reconcile).No common DRBD verify algorithm found for 'pvc-...', clearing propfor several resources, then auto-resolved withDrbd-auto-verify-Algo ... automatically set to sha512. This change requires DRBD peer reconnect/renegotiate.tls_sw_sendmsgon n1.BUG: Bad page statereports while unmounting/remounting affected XFS volumes.force-io-failures( no -> yes )propagates to multiple resources; XFS shuts down the volumes; a HA-replicated application using bbolt-on-XFS storage loses one of its peers.We are disclosing this in case the reproduction trigger turns out to depend
on the specific verify-alg renegotiation. From our reading of the code, any
DRBD reconnect or sustained large-write under
tls yesshould be sufficient,but you are better placed to confirm.
5. Stack trace 1: WARN at tls_sw_sendmsg (representative; identical on all three nodes)
6. Stack trace 2: BUG Bad page state (n3, repeated ~30 times in seconds)
7. Code-level differential between 9.2.16 and 9.3.1 (hypothesis source)
This section is the basis for our hypothesis; we have not run an empirical
A/B on the same hardware.
Comparing
drbd-9.2.16...drbd-9.3.1,
the relevant 9.3.x changes that land in
drbd_transport_tcp's send path are:032b7c37"drbd: use compound pages to optimize large I/O" introducesmulti-order page allocation in DRBD.
eed6170c"drbd: Rename send_zc_bio() to send_bio(, MSG_SPLICE_PAGES)" makesthe TCP transport send via
MSG_SPLICE_PAGES.compound_order(page)directly to size sends(visible in the
dtt_recv_biorewrite), confirming compound pages flowthrough this code path.
344e52ae"drbd: disable compound page allocation on kernels withoutmulti-page bvec" adds a guard, but only for older kernels. Linux 6.18.24
has multi-page bvec, so the guard does not suppress the path on us.
DRBD 9.2.16 has none of the above. It allocates 4 KiB pages via
drbd_alloc_page_chainand sends them one-page-at-a-time via the oldersend_zc_biopath that does not useMSG_SPLICE_PAGES.8. Hypothesis
The combination of:
sock_sendmsg(MSG_SPLICE_PAGES), andtls yesin DRBD net config),leads
tls_sw_sendmsgto drop a refcount on the wrong page duringfragmentation, leaving the head with
entire_mapcount=1andrefcount=1.The same class of issue motivated the upstream
sendpages_ok()checks fornvme-tcp (Ubuntu LP #2093871, kernel commit a1d2aa48c6).
9. Suggested fix paths (you are better placed to choose)
sendpages_ok()style validation indtt_send_bio/dtt_send_pagebefore callingsock_sendmsg(MSG_SPLICE_PAGES), fallingback to a per-page split for compound pages, and unconditionally for
sockets with kTLS attached (
tls_get_ctx()probe).has kTLS attached, regardless of the multi-page bvec capability.
dtr_send_pagepage-leak fix to the TCP transportif the underlying refcount mishandling has the same shape.
10. Security framing (no CVE requested)
We are not requesting a CVE assignment because we will not publish exploit
details, but we want to make sure the impact framing is clear:
force-io-failuresrecover via journal replay on remount. One HA peer's bbolt mmap took a SIGBUS and required application-level re-replication from a healthy peer; no silent data loss observed.tls yesand DRBD 9.3.1 are in place.Any caller that exercises the DRBD send path with kTLS attached can trigger
the WARN. This includes ordinary write-heavy in-cluster workloads when DRBD
peers reconnect (network blip, peer restart, satellite restart, LINSTOR
property change). Pre-authentication external network triggering is not
plausible because DRBD performs
cram-hmac-algauthentication beforeapplication-layer traffic, and the bug is on the application-layer send
path, not the handshake.
11. Mitigation pending the fix
We are not yet downgrading; we are observing for a week and gating a Talos
1.13.0 -> 1.12.7 rollback (which would replace DRBD 9.3.1 with 9.2.16 on
the same kernel) on whether we hit further recurrences. We can capture
additional traces, run a debug or instrumented build, or test a candidate
patch on a non-prod replica of the cluster on request. Full kernel logs
covering the cascade window from all three nodes are available.