fix(compositor): export composite-write fence for cross-device frame ordering#146
Conversation
Code Review by Qodo
1. Stale timeline support flag
|
|
Warning Review limit reached
More reviews will be available in 11 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR redesigns the project policy framework from a flat v1.3 checklist to a v2.0 layered invariant system with five policy domains (Process, Boundary, Resource, Sync, Liveness), while simultaneously implementing DRM timeline-based synchronization in the compositor to enforce the sync policy constraints. The changes establish formal architectural contracts and enable kernel-driven sync point progression. ChangesPolicy Framework & Compositor Timeline Sync
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| // Own a composite-OUTPUT signal timeline so every exported frame carries a producer-write | ||
| // fence (P5), independent of whether the client bound wp_linux_drm_syncobj_v1. Best-effort: | ||
| // absence degrades to the status-quo implicit path (see compositor_present.cpp). | ||
| if (renderer->features.timeline && drm_fd >= 0) { | ||
| present_signal_timeline = wlr_drm_syncobj_timeline_create(drm_fd); | ||
| present_timeline_supported = (present_signal_timeline != nullptr); | ||
| if (present_timeline_supported) { | ||
| GOGGLES_LOG_INFO("Compositor: composite-write signal timeline enabled"); | ||
| } else { | ||
| GOGGLES_LOG_WARN( | ||
| "Compositor: signal timeline creation failed; implicit signal sync fallback"); | ||
| } | ||
| } else { | ||
| GOGGLES_LOG_WARN("Compositor: renderer lacks features.timeline; composite-write fence " | ||
| "export unavailable (implicit signal sync fallback)"); | ||
| } |
There was a problem hiding this comment.
1. Stale timeline support flag 🐞 Bug ☼ Reliability
CompositorState::create_compositor() only updates present_timeline_supported when (renderer->features.timeline && drm_fd >= 0) is true, while teardown() clears present_signal_timeline without resetting the flag. If the compositor is stopped/started and the second init can’t create a timeline, render_surface_to_frame() may treat timeline support as enabled and pass/use a null present_signal_timeline, risking a crash.
Agent Prompt
### Issue description
`present_timeline_supported` can become stale across compositor restarts: `teardown()` nulls `present_signal_timeline` but does not reset `present_timeline_supported`, and `create_compositor()` only assigns `present_timeline_supported` inside one branch. This can leave `present_timeline_supported == true` while `present_signal_timeline == nullptr`, which is then used in `render_surface_to_frame()`.
### Issue Context
`CompositorServer` exposes public `start()`/`stop()` and `start()` re-runs the initialization sequence on the same `CompositorState` instance.
### Fix Focus Areas
- src/compositor/compositor_core.cpp[186-216]
- src/compositor/compositor_core.cpp[393-411]
- src/compositor/compositor_present.cpp[657-667]
### Implementation notes
- In `create_compositor()`, proactively reset the composite-write timeline state before feature checks:
- `present_timeline_supported = false;`
- `present_signal_timeline = nullptr;` (and/or unref any existing timeline defensively)
- optionally reset `present_signal_point = 0` when creating a new timeline.
- In the `else` path (no support), explicitly set `present_timeline_supported = false`.
- In `teardown()`, also set `present_timeline_supported = false` (and optionally reset `present_signal_point`).
- In `render_surface_to_frame()`, harden the condition to avoid null use (e.g., require both `present_timeline_supported` and `present_signal_timeline != nullptr`).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…ordering When a client did not bind wp_linux_drm_syncobj_v1 (basic GL/SW clients, many XWayland), the compositor exported each frame's DMA-BUF to the goggles Vulkan consumer with an empty sync_fd and published it anyway. The consumer's submit then had no primitive ordering the foreign compositor device's composite write before goggles sampled the buffer, so under GPU contention it could read a torn/partial/stale frame (P5-1). When a fence was exported it came from the root surface's acquire timeline -- the client's input point, not the composite output write -- and the comment claiming otherwise was inaccurate (P5-2). Have the compositor own a composite-OUTPUT signal timeline, pass it into the present buffer pass when renderer->features.timeline is supported, and export the materialized point as the frame's sync_fd. The consumer path is unchanged: the existing eSyncFd/eTemporary import now waits on the correct composite-write dma_fence regardless of whether the client uses explicit sync. When the renderer lacks timeline support, behavior degrades to the prior implicit-sync path (strictly no worse than before). Builds clean; full test suite passes 12/12. On AMD/radeonsi GLES2 the gate is active and the fence is exported on every frame (0 not-materialized over 29 frames). Tear-free-under-contention validation (RenderDoc/pixel-diff) remains for hardware testing.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
docs/policies/liveness.md (1)
9-23: ⚡ Quick winConsider aligning terminology with main policy document.
This rule uses "shader pipeline" terminology (lines 12, 16, 20) while the corresponding rule in
docs/project_policies.md(lines 363-375) uses "filter chain" and "adapters". For example:
- liveness.md line 12: "active shader pipeline"
- project_policies.md line 364: "active_slot always contains a valid, recordable filter chain"
The concepts are the same, but inconsistent naming across policy documents could cause confusion during code review or self-checks.
Suggested terminology alignment
Consider updating this file to use "filter chain" consistently with the main policy document, or alternatively update both documents to standardize on one term. "Filter chain" appears more specific to the codebase architecture (FilterChainController, filter chain slots, etc.).
Example:
-**Safety:** The active shader pipeline must always be recordable (or empty). +**Safety:** The active filter chain must always be recordable (or empty).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/policies/liveness.md` around lines 9 - 23, Update LIVE.render.001 to use the same terminology as the main policy: replace "shader pipeline" and related phrases in the liveness.md text with "filter chain" (or whichever term you choose to standardize) and ensure references align with the main-doc identifiers like "active_slot" and component names such as FilterChainController/filter chain slots; alternatively, change the main policy to match "shader pipeline" but make the change consistently in both LIVE.render.001 and the main policy (project_policies) so that symbols like "active_slot", "filter chain", and any mentions of adapters/slots are identical across documents.docs/policies/boundary.md (1)
65-77: ⚖️ Poor tradeoffEnforcement gap acknowledged but not addressed.
The "Escapes" section explicitly states that semgrep only catches one specific pattern and "all other unchecked results escape the hard gate," yet the rule doesn't propose a solution. Consider either: (a) expanding the static analysis coverage with additional patterns, (b) requiring a wrapper type that forces explicit handling (e.g.,
[[nodiscard]]on a result wrapper), or (c) documenting this as a known limitation with manual review dependency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/policies/boundary.md` around lines 65 - 77, The policy BOUND.render.002 notes semgrep only detects the static_cast<void>(waitIdle()) pattern leaving other vk::Result returns unchecked; address this by updating the policy to require explicit handling for all vk::Result returns: either (1) expand static analysis patterns to include other common call shapes (e.g., direct calls returning vk::Result, assignments, chained calls), (2) mandate using a wrapper or type annotated with [[nodiscard]] (or a Result<T> wrapper) so the compiler enforces handling, or (3) add a clear “known limitation” section instructing reviewers to manually verify every use of vk::Result and to prefer VK_TRY(call, code, msg) for propagation; reference BOUND.render.002, vk::Result, VK_TRY and the static_cast<void>(waitIdle()) example when you modify the doc to implement one of these remedies.docs/policies/sync.md (1)
51-63: ⚡ Quick winSilent message drop may lose critical configuration.
The rule specifies "Silent drop on overflow (no deadlock risk)" for the control channel, but doesn't address how the sender detects dropped messages or how critical configuration changes (e.g., target FPS, cursor visibility) are handled if lost. Consider adding guidance on either: (a) message prioritization (critical config vs. ephemeral input), (b) sender-side feedback mechanism for dropped messages, or (c) making critical settings use a separate reliable channel. The current specification prevents deadlock but may cause user-visible configuration loss under load.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/policies/sync.md` around lines 51 - 63, The policy SYNC.compositor-main.002 currently mandates a bounded, non-blocking control channel with silent drop on overflow but omits guidance for detecting or preserving critical configuration updates; update the policy text for SYNC.compositor-main.002 to require either (a) prioritized messages (mark messages like "target FPS" and "cursor visibility" as high-priority and ensure they preempt or reserve capacity), (b) sender-side feedback (a lightweight ACK/NACK or drop-counter so senders can detect dropped control messages and retry), or (c) a separate reliable path for critical settings (a small reliable, bounded channel or atomic with defined memory ordering for config updates), and add a short note describing which messages must be considered “critical” and how implementers should choose among prioritization, feedback, or separate reliable channel to avoid silent loss.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/policies/boundary.md`:
- Around line 51-63: Update BOUND.render.001 to explicitly require that the
compositor validates plane count before exporting ExternalImageFrame and
returns/logs an error if the buffer is multi-plane: in the production path where
ExternalImageFrame is created (the compositor export boundary), check the
frame's plane count and reject/export-fail multi-plane DMA-BUFs, include the
validation failure in logs and in the error/Result returned to the consumer;
keep the ownership semantics (use UniqueFd::dup() for fd and sync_fd when
present) and document that consumers should still guard against receiving
multi-plane buffers but can rely on compositor validation per BOUND.render.001.
In `@docs/policies/process.md`:
- Around line 23-35: Update PROC.app.002 to explicitly state the ownership
transfer for compiled shader artifacts: clarify that JobSystem threads may
perform compilation and produce intermediate bytecode/IR, but the construction
of final Vulkan objects (e.g., VkShaderModule, VkPipeline) and any Vulkan
device/state access must occur on the main thread (the sole Vulkan submitter);
require the handoff to happen over the typed message channel mechanism
(JobSystem -> Main thread) and cross-reference RES.render.003 (hot-reload
lifecycle) and SYNC.compositor-main for the handoff/synchronization pattern so
readers know where to find the lifecycle and messaging details.
In `@docs/policies/resource.md`:
- Around line 37-51: Update RES.render.003 to remove the contradiction between
Safety and Free: either (A) move "Async mechanism (futures, continuations, job
system)" from Free into Locked and state that continuation-based async is
required (adjust Safety to assert "must use continuation-based async; disallow
ad-hoc active/pending/retired state machines"), or (B) soften Safety to warn
about pitfalls of ad-hoc state-machine/manual frame counting (describe specific
risks: blocking render loop, retire-timing bugs) while leaving Async mechanism
in Free; update the Safety paragraph and Free list accordingly and reference
RES.render.003, the "Safety" and "Free" sections in the doc to ensure wording is
consistent.
---
Nitpick comments:
In `@docs/policies/boundary.md`:
- Around line 65-77: The policy BOUND.render.002 notes semgrep only detects the
static_cast<void>(waitIdle()) pattern leaving other vk::Result returns
unchecked; address this by updating the policy to require explicit handling for
all vk::Result returns: either (1) expand static analysis patterns to include
other common call shapes (e.g., direct calls returning vk::Result, assignments,
chained calls), (2) mandate using a wrapper or type annotated with [[nodiscard]]
(or a Result<T> wrapper) so the compiler enforces handling, or (3) add a clear
“known limitation” section instructing reviewers to manually verify every use of
vk::Result and to prefer VK_TRY(call, code, msg) for propagation; reference
BOUND.render.002, vk::Result, VK_TRY and the static_cast<void>(waitIdle())
example when you modify the doc to implement one of these remedies.
In `@docs/policies/liveness.md`:
- Around line 9-23: Update LIVE.render.001 to use the same terminology as the
main policy: replace "shader pipeline" and related phrases in the liveness.md
text with "filter chain" (or whichever term you choose to standardize) and
ensure references align with the main-doc identifiers like "active_slot" and
component names such as FilterChainController/filter chain slots; alternatively,
change the main policy to match "shader pipeline" but make the change
consistently in both LIVE.render.001 and the main policy (project_policies) so
that symbols like "active_slot", "filter chain", and any mentions of
adapters/slots are identical across documents.
In `@docs/policies/sync.md`:
- Around line 51-63: The policy SYNC.compositor-main.002 currently mandates a
bounded, non-blocking control channel with silent drop on overflow but omits
guidance for detecting or preserving critical configuration updates; update the
policy text for SYNC.compositor-main.002 to require either (a) prioritized
messages (mark messages like "target FPS" and "cursor visibility" as
high-priority and ensure they preempt or reserve capacity), (b) sender-side
feedback (a lightweight ACK/NACK or drop-counter so senders can detect dropped
control messages and retry), or (c) a separate reliable path for critical
settings (a small reliable, bounded channel or atomic with defined memory
ordering for config updates), and add a short note describing which messages
must be considered “critical” and how implementers should choose among
prioritization, feedback, or separate reliable channel to avoid silent loss.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ba93e48c-6e6d-445d-b480-894c81651a18
📒 Files selected for processing (11)
ROADMAP.mddocs/policies/boundary.mddocs/policies/liveness.mddocs/policies/process.mddocs/policies/resource.mddocs/policies/sync.mddocs/project_policies.mdfilter-chainsrc/compositor/compositor_core.cppsrc/compositor/compositor_present.cppsrc/compositor/compositor_state.hpp
| ### BOUND.render.001 — DMA-BUF frame ownership transfer | ||
|
|
||
| **Guard:** Modifying `ExternalImageFrame`, its production in the compositor, or its consumption in the render pipeline. | ||
| **Safety:** DMA-BUF handle ownership transfers via `UniqueFd::dup()` — producer retains its copy, consumer owns the duplicate. `sync_fd` (if present) transfers with the frame via the same dup mechanism. Receiver must not assume `sync_fd` is always present. Single-plane DMA-BUF only. | ||
|
|
||
| <details> | ||
|
|
||
| **Violation:** Double-close on DMA-BUF fd (shared instead of duplicated). Vulkan import fails on multi-plane buffer. | ||
| **Escapes:** Ownership transfer is semantic — no tool verifies that `dup()` is used rather than raw fd copy. | ||
| **Locked:** Ownership model (dup, not transfer). Single-plane requirement. | ||
| **Free:** Adding fields to the frame type. Internal rendering that produces the buffer. | ||
|
|
||
| </details> |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Clarify multi-plane DMA-BUF handling.
The rule states "Single-plane DMA-BUF only" as a locked constraint but doesn't specify whether the compositor should validate plane count before export or whether the consumer should reject multi-plane imports. Given that Vulkan import can fail on multi-plane buffers (per the Violation note), consider adding explicit validation at the production boundary with logging/error propagation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/policies/boundary.md` around lines 51 - 63, Update BOUND.render.001 to
explicitly require that the compositor validates plane count before exporting
ExternalImageFrame and returns/logs an error if the buffer is multi-plane: in
the production path where ExternalImageFrame is created (the compositor export
boundary), check the frame's plane count and reject/export-fail multi-plane
DMA-BUFs, include the validation failure in logs and in the error/Result
returned to the consumer; keep the ownership semantics (use UniqueFd::dup() for
fd and sync_fd when present) and document that consumers should still guard
against receiving multi-plane buffers but can rely on compositor validation per
BOUND.render.001.
| ### PROC.app.002 — Thread ownership is structural | ||
|
|
||
| **Guard:** Modifying which thread creates, accesses, or destroys resources. Adding new thread-affine operations. Introducing new cross-thread communication. | ||
| **Safety:** Each thread owns a set of types that cannot be accessed from other threads. Main thread owns: SDL window, Vulkan submission, ImGui, render loop. Compositor thread owns: wlroots event loop, all `wlr_*` objects, Wayland protocol handling. JobSystem threads own: async shader compilation only (no Vulkan submission, no wlroots access). Cross-thread communication uses typed message channels, not shared mutable state protected by convention. | ||
|
|
||
| <details> | ||
|
|
||
| **Violation:** Data race on wlroots objects from main thread. Vulkan submission from compositor thread. Callback registered on one thread executing on another without synchronization. | ||
| **Escapes:** Thread affinity is a runtime property. No static tool can verify that a function is only called from a specific thread. The goal is to make thread boundaries visible in the architecture (separate types per thread, messages at boundaries), not just documented in comments. | ||
| **Locked:** Main thread as sole Vulkan submitter. Compositor thread as sole wlroots accessor. Communication through typed channels only. | ||
| **Free:** Which specific types each thread owns. Internal threading within a subsystem (if fully encapsulated). | ||
|
|
||
| </details> |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Clarify ownership transfer for compiled shader artifacts.
The rule states that JobSystem threads own "async shader compilation only" but doesn't specify the ownership transfer mechanism for compiled Vulkan pipeline objects back to the main thread (which owns Vulkan submission per this rule). Cross-reference with RES.render.003, which describes hot-reload lifecycle but doesn't explicitly address cross-thread ownership transfer. Consider adding a note about which thread constructs the final Vulkan objects or referencing SYNC.compositor-main rules for the handoff mechanism.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/policies/process.md` around lines 23 - 35, Update PROC.app.002 to
explicitly state the ownership transfer for compiled shader artifacts: clarify
that JobSystem threads may perform compilation and produce intermediate
bytecode/IR, but the construction of final Vulkan objects (e.g., VkShaderModule,
VkPipeline) and any Vulkan device/state access must occur on the main thread
(the sole Vulkan submitter); require the handoff to happen over the typed
message channel mechanism (JobSystem -> Main thread) and cross-reference
RES.render.003 (hot-reload lifecycle) and SYNC.compositor-main for the
handoff/synchronization pattern so readers know where to find the lifecycle and
messaging details.
7ebcd5c to
44a526c
Compare
The committed lock referenced libxkbcommon 1.13.1, which no longer matches the manifest resolution, so `pixi install --locked` (run by CI's Setup Pixi on a clean checkout) reports "lock-file not up-to-date with the workspace" and every CI job fails before building. This pre-exists on main and is unrelated to the code change; included here so this PR's CI can pass. Regenerated with the CI-pinned pixi v0.65.0; only a transitive patch bump.
Problem
The compositor composites a target app's frame on its own wlroots renderer (GLES2/EGL by default), exports it as a DMA-BUF, and goggles imports + samples it on a separate Vulkan
VkDevice. Cross-device write→read ordering must be carried by an OS primitive (sync_file / dma_fence) — there is no shared queue.wp_linux_drm_syncobj_v1(basic GL/SW, many XWayland),frame.sync_fdwas left empty and the frame exported anyway, so the consumer's submit had no producer-ordering primitive — under GPU contention it can sample a torn/partial/stale frame. The lonedevice.waitIdle()is on goggles' device and cannot order the compositor device's write.acquire_timeline(client input), not the composite output write; the comment claiming otherwise was inaccurate.Fix
The compositor owns a composite-OUTPUT
wlr_drm_syncobj_timeline, gated onrenderer->features.timeline; the present pass is begun with{signal_timeline, ++point}, and after submit a non-blockingWAIT_AVAILABLEcheck +export_sync_filepopulatesframe.sync_fd(UniqueFd-owned). The old root-surface acquire-fence export is removed; the release-timeline signalling is kept. Consumer unchanged — the existingeSyncFd/eTemporaryimport now waits on the correct composite-write dma_fence. When the renderer lacks timeline support, behavior degrades to the prior implicit-sync path (no worse than before).Verification
ctest --preset test12/12.composite-write signal timeline enabled), fence exported on every frame (0 not-materialized over 29 frames).Notes
Split out of the original combined branch: the documentation/policy changes now live in #147. Adds three compositor-thread-owned
CompositorStatemembers; no new threads/deps/CI structure changes.The
pixi.locksync commit fixes a pre-existing CI breakage onmain(pixi install --lockedrejects the committed lock on a clean checkout —libxkbcommon 1.13.1no longer matches the manifest). Transitive patch bump only; identical change also on #147.🤖 Generated with Claude Code