Kernel bad_page / refcount leak on drbd_transport_tcp + kTLS in DRBD 9.3.1 with kernel 6.18 (compound page regression)

### 1. Summary

DRBD 9.3.1 with `transport: tcp` and `net { tls yes; }` on Linux 6.18.24
triggers a kernel `WARNING` in `tls_sw_sendmsg` followed by `BUG: Bad page
state` reports in unrelated subsystems (XFS log buffers) and
`force-io-failures( no -> yes )` cascades across DRBD resources, on all
peers within a 3 minute window. The faulting page state is consistent with a
compound (head) page that lost a refcount during a `MSG_SPLICE_PAGES` send
through kernel TLS. We have not measured DRBD 9.2.16 on the same hardware,
but code inspection (see section 7) shows 9.2.16 lacks the
`dtt_send_bio(MSG_SPLICE_PAGES)` and compound-page allocation paths
introduced in 9.3.1, so the same trigger should not exercise the same
interaction.

### 2. Environment

| Item | Value |
|---|---|
| Kernel | 6.18.24-talos, built with Clang 22.1.2 + LLD 22.1.2 (ThinLTO) |
| Talos Linux | 1.13.0 (released 2026-04-27) |
| DRBD module | 9.3.1 (api:2/proto:118-124, transport:21) |
| Storage stack | ZFS thin zvol -> LUKS -> DRBD -> XFS |
| Network | DRBD over IPv6 mesh, three control-plane nodes (n1, n2, n3) |

### 3. DRBD configuration (per resource, generated by LINSTOR 1.33.1)

```text
options { on-no-data-accessible suspend-io; on-no-quorum suspend-io; quorum off; }
net { cram-hmac-alg sha1; rr-conflict retry-connect; tls yes; verify-alg crc32c; }
disk { rs-discard-granularity 1048576; }
```

### 4. Reproduction trigger we observed

Cluster ran on Talos 1.13.0 (DRBD 9.3.1, kernel 6.18.24) for 8 to 9 days
without any kernel WARNs in dmesg. The first cascade was driven by a DRBD
peer reconnect/renegotiation, with the following sequence:

| Time UTC | Event |
|---|---|
| 19:43:51 | Cluster operator pushed a Kubernetes manifest apply that re-created the LINSTOR `StorageClass` objects (no parameter changes, but the apply caused LINSTOR to reconcile). |
| 20:07:46 | LINSTOR controller logged `No common DRBD verify algorithm found for 'pvc-...', clearing prop` for several resources, then auto-resolved with `Drbd-auto-verify-Algo ... automatically set to sha512`. This change requires DRBD peer reconnect/renegotiate. |
| 20:08:14 | First WARN at `tls_sw_sendmsg` on n1. |
| 20:09:00 | Same WARN on n2. |
| 20:08:27 onward | n3 floods 30+ `BUG: Bad page state` reports while unmounting/remounting affected XFS volumes. |
| 20:09 onward | DRBD `force-io-failures( no -> yes )` propagates to multiple resources; XFS shuts down the volumes; a HA-replicated application using bbolt-on-XFS storage loses one of its peers. |

We are disclosing this in case the reproduction trigger turns out to depend
on the specific verify-alg renegotiation. From our reading of the code, any
DRBD reconnect or sustained large-write under `tls yes` should be sufficient,
but you are better placed to confirm.

### 5. Stack trace 1: WARN at tls_sw_sendmsg (representative; identical on all three nodes)

```text
WARNING: CPU: 3 PID: 12436 at ./include/linux/mm.h:1395 tls_sw_sendmsg+0x57d/0xaa0 [tls]
Modules linked in: tls drbd_transport_tcp(O) drbd(O) dm_thin_pool dm_persistent_data
  dm_bio_prison zfs(O) ... nvme ... 6.18.24-talos #1 PREEMPT(none)
RIP: 0010:tls_sw_sendmsg+0x57d/0xaa0 [tls]
Call Trace:
 __sock_sendmsg+0x3d/0x90
 sock_sendmsg+0xf4/0x140
 dtt_send_page+0x174/0x270 [drbd_transport_tcp]
 dtt_send_bio+0xc7/0x110 [drbd_transport_tcp]
 drbd_send_dblock+0x703/0x830 [drbd]
 process_one_request+0x1fd/0x350 [drbd]
 drbd_sender+0x113/0x7b0 [drbd]
 drbd_thread_setup+0x8b/0x2e0 [drbd]
 kthread+0x201/0x260
Comm: drbd_s_pvc-2aeb
```

### 6. Stack trace 2: BUG Bad page state (n3, repeated ~30 times in seconds)

```text
BUG: Bad page state in process umount  pfn:0x620e80
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8b75a0e83600
head: order:0 mapcount:0 entire_mapcount:1 nr_pages_mapped:0 pincount:0
flags: 0x10000000000040(head|node=0|zone=2)
page dumped because: nonzero _refcount
Call Trace:
 __free_frozen_pages+0x63a/0x650
 xlog_dealloc_log+0x40/0x80
 xfs_unmountfs+0xe8/0x140
 xfs_fs_put_super+0x3f/0x110
 generic_shutdown_super+0x7c/0x120
 kill_block_super+0x1b/0x40
 xfs_kill_sb+0x12/0x20
 cleanup_mnt+0x13a/0x190
```

### 7. Code-level differential between 9.2.16 and 9.3.1 (hypothesis source)

This section is the basis for our hypothesis; we have not run an empirical
A/B on the same hardware.

Comparing
[drbd-9.2.16...drbd-9.3.1](https://github.com/LINBIT/drbd/compare/drbd-9.2.16...drbd-9.3.1),
the relevant 9.3.x changes that land in `drbd_transport_tcp`'s send path are:

- `032b7c37` \"drbd: use compound pages to optimize large I/O\" introduces
  multi-order page allocation in DRBD.
- `eed6170c` \"drbd: Rename send_zc_bio() to send_bio(, MSG_SPLICE_PAGES)\" makes
  the TCP transport send via `MSG_SPLICE_PAGES`.
- The TCP transport now uses `compound_order(page)` directly to size sends
  (visible in the `dtt_recv_bio` rewrite), confirming compound pages flow
  through this code path.
- `344e52ae` \"drbd: disable compound page allocation on kernels without
  multi-page bvec\" adds a guard, but only for older kernels. Linux 6.18.24
  has multi-page bvec, so the guard does not suppress the path on us.

DRBD 9.2.16 has none of the above. It allocates 4 KiB pages via
`drbd_alloc_page_chain` and sends them one-page-at-a-time via the older
`send_zc_bio` path that does not use `MSG_SPLICE_PAGES`.

### 8. Hypothesis

The combination of:

- DRBD 9.3.1 allocating compound (head) pages in the TCP transport,
- passing them to `sock_sendmsg(MSG_SPLICE_PAGES)`, and
- the socket having kTLS attached (`tls yes` in DRBD net config),

leads `tls_sw_sendmsg` to drop a refcount on the wrong page during
fragmentation, leaving the head with `entire_mapcount=1` and `refcount=1`.
The same class of issue motivated the upstream `sendpages_ok()` checks for
nvme-tcp (Ubuntu LP #2093871, kernel commit a1d2aa48c6).

### 9. Suggested fix paths (you are better placed to choose)

1. Add a `sendpages_ok()` style validation in `dtt_send_bio` /
   `dtt_send_page` before calling `sock_sendmsg(MSG_SPLICE_PAGES)`, falling
   back to a per-page split for compound pages, and unconditionally for
   sockets with kTLS attached (`tls_get_ctx()` probe).
2. Disable compound page allocation in the TCP transport whenever the socket
   has kTLS attached, regardless of the multi-page bvec capability.
3. Backport the 9.3.2 RDMA `dtr_send_page` page-leak fix to the TCP transport
   if the underlying refcount mishandling has the same shape.

### 10. Security framing (no CVE requested)

We are not requesting a CVE assignment because we will not publish exploit
details, but we want to make sure the impact framing is clear:

| Axis | Severity |
|---|---|
| Confidentiality | Low. Leaked compound pages we observed got reused by the XFS log allocator and immediately panicked; we have no evidence of data disclosure. A theoretical use-after-free path exists but would require precise allocator-timing control. |
| Integrity | Low to Medium. XFS shutdowns triggered by `force-io-failures` recover via journal replay on remount. One HA peer's bbolt mmap took a SIGBUS and required application-level re-replication from a healthy peer; no silent data loss observed. |
| Availability | Medium to High. Confirmed cluster-wide impact within a single 3 minute window. Repeats on every DRBD reconnect/renegotiation while `tls yes` and DRBD 9.3.1 are in place. |

Any caller that exercises the DRBD send path with kTLS attached can trigger
the WARN. This includes ordinary write-heavy in-cluster workloads when DRBD
peers reconnect (network blip, peer restart, satellite restart, LINSTOR
property change). Pre-authentication external network triggering is not
plausible because DRBD performs `cram-hmac-alg` authentication before
application-layer traffic, and the bug is on the application-layer send
path, not the handshake.

### 11. Mitigation pending the fix

We are not yet downgrading; we are observing for a week and gating a Talos
1.13.0 -> 1.12.7 rollback (which would replace DRBD 9.3.1 with 9.2.16 on
the same kernel) on whether we hit further recurrences. We can capture
additional traces, run a debug or instrumented build, or test a candidate
patch on a non-prod replica of the cluster on request. Full kernel logs
covering the cascade window from all three nodes are available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel bad_page / refcount leak on drbd_transport_tcp + kTLS in DRBD 9.3.1 with kernel 6.18 (compound page regression) #134

1. Summary

2. Environment

3. DRBD configuration (per resource, generated by LINSTOR 1.33.1)

4. Reproduction trigger we observed

5. Stack trace 1: WARN at tls_sw_sendmsg (representative; identical on all three nodes)

6. Stack trace 2: BUG Bad page state (n3, repeated ~30 times in seconds)

7. Code-level differential between 9.2.16 and 9.3.1 (hypothesis source)

8. Hypothesis

9. Suggested fix paths (you are better placed to choose)

10. Security framing (no CVE requested)

11. Mitigation pending the fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
Kernel	6.18.24-talos, built with Clang 22.1.2 + LLD 22.1.2 (ThinLTO)
Talos Linux	1.13.0 (released 2026-04-27)
DRBD module	9.3.1 (api:2/proto:118-124, transport:21)
Storage stack	ZFS thin zvol -> LUKS -> DRBD -> XFS
Network	DRBD over IPv6 mesh, three control-plane nodes (n1, n2, n3)

Time UTC	Event
19:43:51	Cluster operator pushed a Kubernetes manifest apply that re-created the LINSTOR `StorageClass` objects (no parameter changes, but the apply caused LINSTOR to reconcile).
20:07:46	LINSTOR controller logged `No common DRBD verify algorithm found for 'pvc-...', clearing prop` for several resources, then auto-resolved with `Drbd-auto-verify-Algo ... automatically set to sha512`. This change requires DRBD peer reconnect/renegotiate.
20:08:14	First WARN at `tls_sw_sendmsg` on n1.
20:09:00	Same WARN on n2.
20:08:27 onward	n3 floods 30+ `BUG: Bad page state` reports while unmounting/remounting affected XFS volumes.
20:09 onward	DRBD `force-io-failures( no -> yes )` propagates to multiple resources; XFS shuts down the volumes; a HA-replicated application using bbolt-on-XFS storage loses one of its peers.

Axis	Severity
Confidentiality	Low. Leaked compound pages we observed got reused by the XFS log allocator and immediately panicked; we have no evidence of data disclosure. A theoretical use-after-free path exists but would require precise allocator-timing control.
Integrity	Low to Medium. XFS shutdowns triggered by `force-io-failures` recover via journal replay on remount. One HA peer's bbolt mmap took a SIGBUS and required application-level re-replication from a healthy peer; no silent data loss observed.
Availability	Medium to High. Confirmed cluster-wide impact within a single 3 minute window. Repeats on every DRBD reconnect/renegotiation while `tls yes` and DRBD 9.3.1 are in place.

Kernel bad_page / refcount leak on drbd_transport_tcp + kTLS in DRBD 9.3.1 with kernel 6.18 (compound page regression) #134

Description

1. Summary

2. Environment

3. DRBD configuration (per resource, generated by LINSTOR 1.33.1)

4. Reproduction trigger we observed

5. Stack trace 1: WARN at tls_sw_sendmsg (representative; identical on all three nodes)

6. Stack trace 2: BUG Bad page state (n3, repeated ~30 times in seconds)

7. Code-level differential between 9.2.16 and 9.3.1 (hypothesis source)

8. Hypothesis

9. Suggested fix paths (you are better placed to choose)

10. Security framing (no CVE requested)

11. Mitigation pending the fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions