Skip to content

feat: add agent recovery merge workflow#99

Merged
ZingerLittleBee merged 60 commits intomainfrom
agent-recovery-merge
Apr 16, 2026
Merged

feat: add agent recovery merge workflow#99
ZingerLittleBee merged 60 commits intomainfrom
agent-recovery-merge

Conversation

@ZingerLittleBee
Copy link
Copy Markdown
Owner

@ZingerLittleBee ZingerLittleBee commented Apr 16, 2026

Summary

  • add recovery job persistence, rebind lifecycle orchestration, and guarded write-freeze handling on the server
  • implement recovery history merge helpers plus candidate/job APIs and browser recovery websocket payloads
  • add recovery workflow UI on the server detail page and document the new recovery endpoints/flow

Test Plan

  • cargo test -p serverbee-server recovery_merge -- --nocapture
  • cargo test -p serverbee-server start_recovery_merge_sends_rebind_identity_command -- --nocapture
  • cargo test -p serverbee-server wrong_source_rebind_identity_failure_does_not_broadcast_recovery_update -- --nocapture
  • cargo test -p serverbee-server test_recovery_job_get_requires_admin_and_start_creates_job --test integration -- --exact --nocapture
  • bun x vitest run src/hooks/use-api.test.tsx src/hooks/use-servers-ws.test.ts src/stores/recovery-jobs-store.test.ts src/components/server/recovery-merge-dialog.test.tsx src/routes/_authed/servers/$id.test.tsx
  • bun --cwd apps/web run typecheck
  • cargo check -p serverbee-server
  • cargo check -p serverbee-common

Known Issue

  • cargo test --workspace still has one pre-existing unrelated failing test: test_server_detail_returns_runtime_capability_fields.

Summary by CodeRabbit

  • New Features

    • Admin "Recover Agent" workflow on offline server pages with candidate selection, start/resume actions, staged progress and status badge updates; real-time job updates over sockets.
    • New admin API endpoints to list recovery candidates, start a recovery-merge, and fetch recovery job details.
  • Localization

    • Added English and Chinese UI strings for recovery flow, progress stages, warnings, and lifecycle messages.
  • Documentation

    • Updated API reference and added server recovery setup and design docs (EN/ZH).

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
server-bee-docs Ready Ready Preview, Comment Apr 16, 2026 2:53pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Adds an admin-driven "agent recovery merge" feature: backend persistence and APIs, agent token rebind protocol, write-freeze coordination, server-side merge logic, frontend UI/state/hooks/tests, OpenAPI/schema changes, database migration, and cross-cutting tests across agent, server, and web.

Changes

Cohort / File(s) Summary
Documentation & Specs
docs/superpowers/plans/2026-04-16-agent-recovery-merge.md, docs/superpowers/specs/2026-04-16-agent-recovery-merge-design.md
New design and implementation plan documents describing end-to-end recovery merge workflow, protocol messages, DB schema, services, API surface, UI, and tests.
API Reference & Server Docs (EN/CN)
apps/docs/content/docs/en/api-reference.mdx, apps/docs/content/docs/cn/api-reference.mdx, apps/docs/content/docs/en/server.mdx, apps/docs/content/docs/cn/server.mdx
Added three recovery endpoints to admin API table and new "Recovering a Reinstalled Agent" server documentation sections.
Frontend UI & State
apps/web/src/components/server/recovery-merge-dialog.tsx, apps/web/src/stores/recovery-jobs-store.ts, apps/web/src/hooks/use-api.ts, apps/web/src/routes/_authed/servers/$id.tsx
New RecoveryMergeDialog component, Zustand recovery jobs store, React Query hooks (useRecoveryCandidates, useRecoveryJob) and startRecoveryMerge, integrated into server detail page with admin-only action and job badge.
Frontend Tests & Mocks
apps/web/src/components/server/recovery-merge-dialog.test.tsx, apps/web/src/hooks/use-api.test.tsx, apps/web/src/hooks/use-servers-ws.test.ts, apps/web/src/routes/_authed/servers/$id.test.tsx, apps/web/src/stores/recovery-jobs-store.test.ts
Comprehensive tests for dialog behavior, API hooks, websocket recovery messages, route-level presence of action, and recovery store operations.
Frontend i18n & API types
apps/web/src/locales/en/servers.json, apps/web/src/locales/zh/servers.json, apps/web/src/lib/api-schema.ts, apps/web/src/lib/api-types.ts, apps/web/openapi.json
Added i18n strings and TypeScript/OpenAPI types/schemas for recovery candidate/job/start endpoints and related enums.
Frontend WS handling
apps/web/src/hooks/use-servers-ws.ts
Extended WS payload types to include recoveries and added hydration/update helpers to populate recovery jobs store.
Server DB entity & migration
crates/server/src/entity/recovery_job.rs, crates/server/src/migration/m20260416_000017_create_recovery_job.rs, crates/server/src/entity/mod.rs, crates/server/src/migration/mod.rs
Added recovery_job SeaORM entity and migration with unique constraints and SQLite triggers to enforce single-running-job invariants.
Server services
crates/server/src/service/recovery_job.rs, crates/server/src/service/recovery_lock.rs, crates/server/src/service/recovery_merge.rs, crates/server/src/service/traffic.rs, crates/server/src/service/db_error.rs
New services: RecoveryJobService (persist/manage jobs), RecoveryLockService (write-freeze gate), RecoveryMergeService (validation, job lifecycle, merge/rewrites, token rotation), DB error helpers, and traffic-history merge helpers.
Server API routes & OpenAPI
crates/server/src/router/api/server_recovery.rs, crates/server/src/router/api/mod.rs, crates/server/src/openapi.rs
New admin-only routes: GET /api/servers/{target_id}/recovery-candidates, GET /api/servers/recovery-jobs/{job_id}, POST /api/servers/{target_id}/recover-merge; OpenAPI schema registrations added.
Server WS & agent handling
crates/server/src/router/ws/agent.rs, crates/server/src/router/ws/browser.rs, crates/server/src/service/agent_manager.rs
Added RebindIdentity/ack/fail handling, recovery snapshot/broadcast for browser WS, admin-only recovery inclusion, and gating of DB/audit writes behind recovery lock.
Server state & write gating
crates/server/src/state.rs, crates/server/src/task/record_writer.rs
Added recovery_lock to AppState; record/traffic write paths now respect recovery_lock.writes_allowed_for(...).
Agent: token persistence & rebind
crates/agent/src/rebind.rs, crates/agent/src/register.rs, crates/agent/src/config.rs, crates/agent/src/main.rs, crates/agent/Cargo.toml, crates/agent/src/reporter.rs
Agent gains atomic token persistence API (persist_rebind_token), register uses token-env guard and persistence, reporter handles RebindIdentity by persisting token and triggering reconnect; Windows-specific dependency added.
Protocol & common types
crates/common/src/protocol.rs, crates/common/src/constants.rs
Added recovery DTO/enums and agent/server message variants, browser recoveries payloads; bumped PROTOCOL_VERSION 3→4 and added tests.
Server integration tests & other updates
crates/server/tests/integration.rs, various test adjustments, small unrelated adjustments (e.g., crates/server/src/router/api/agent.rs capability constant, crates/agent/src/file_manager.rs sort tweak)
New integration tests for endpoints and access control; multiple test updates and small refactors across repo to support feature.

Sequence Diagram(s)

sequenceDiagram
    actor Admin as Admin (Browser)
    participant WebUI as Web UI
    participant Server as Server API
    participant DB as Database
    participant AgentNew as Replacement Agent
    participant AgentOld as Original Agent (offline)

    Admin->>WebUI: Open recovery dialog for target
    WebUI->>Server: GET /api/servers/{target}/recovery-candidates
    Server->>DB: Query online servers & running jobs
    DB-->>Server: Candidates
    Server-->>WebUI: Candidate list

    Admin->>WebUI: Select candidate, POST /recover-merge
    WebUI->>Server: POST /api/servers/{target}/recover-merge {source_id}
    Server->>DB: Begin txn, rotate target token, create recovery_job (running/rebinding)
    DB-->>Server: Job created
    Server->>AgentNew: Send ServerMessage::RebindIdentity {job_id, token}
    Server-->>WebUI: Return RecoveryJobResponse

    AgentNew->>AgentNew: Persist token locally
    AgentNew->>Server: Reconnect as target (new identity)
    AgentNew->>Server: Send AgentMessage::RebindIdentityAck {job_id}
    Server->>DB: Update job -> awaiting_target_online
    Server->>DB: Freeze writes for target (recovery_lock)
    Server->>DB: Merge history (source→target)
    DB-->>Server: Merge complete
    Server->>DB: Update job -> succeeded
    Server->>WebUI: Broadcast recovery update (FullSync/Update with recoveries)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I nudge the token, soft and sly,

rebinding paws beneath the sky.
Old IDs wake, new hops begin,
histories stitch, the merge folds in.
A carrot cheer — recovery wins!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add agent recovery merge workflow' accurately summarizes the primary change: implementation of a complete recovery merge feature allowing offline servers to be recovered by rebinding replacement agents.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent-recovery-merge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/server/src/task/record_writer.rs (1)

45-120: ⚠️ Potential issue | 🟠 Major

Snapshot the recovery-write gate once per server iteration.

writes_allowed_for(server_id) is checked separately before each DB write and after awaited I/O. If the freeze flips mid-iteration, the same report can partially persist — for example, a metric row gets written while the traffic state for that same sample is skipped. That breaks the atomic boundary this lock is meant to create.

🧩 Suggested fix
         let mut count = 0;
         for (server_id, report) in &reports {
+            let writes_allowed = state.recovery_lock.writes_allowed_for(server_id);
+
             // Save metrics record
-            if state.recovery_lock.writes_allowed_for(server_id) {
+            if writes_allowed {
                 if let Err(e) = RecordService::save_report(&state.db, server_id, report).await {
                     tracing::error!("Failed to save record for {server_id}: {e}");
                 } else {
                     count += 1;
                 }
@@
-                    if state.recovery_lock.writes_allowed_for(server_id) {
+                    if writes_allowed {
                         if let Err(e) = TrafficService::upsert_state(
                             &state.db,
                             server_id,
                             curr_in,
@@
-                if state.recovery_lock.writes_allowed_for(server_id) {
+                if writes_allowed {
                     if let Err(e) = TrafficService::upsert_hourly(
                         &state.db,
                         server_id,
                         hour,
@@
-            if state.recovery_lock.writes_allowed_for(server_id) {
+            if writes_allowed {
                 if let Err(e) =
                     TrafficService::upsert_state(&state.db, server_id, curr_in, curr_out).await
                 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/task/record_writer.rs` around lines 45 - 120, The loop
checks state.recovery_lock.writes_allowed_for(server_id) repeatedly and after
awaits, allowing the freeze to flip mid-iteration; snapshot the gate once per
server iteration by calling writes_allowed_for(server_id) at the top of the loop
(e.g., let writes_allowed = state.recovery_lock.writes_allowed_for(server_id))
and then replace all subsequent calls to
state.recovery_lock.writes_allowed_for(server_id) within that iteration with the
captured writes_allowed boolean when deciding whether to call
RecordService::save_report, TrafficService::upsert_state, and
TrafficService::upsert_hourly so all writes for the same report are consistently
allowed or skipped.
🧹 Nitpick comments (7)
apps/web/src/hooks/use-servers-ws.test.ts (1)

215-225: Consider extracting the duplicate makeQueryClient helper.

This helper is identical to the one defined at lines 113-123 in the handleWsMessage upgrade messages test suite. Extract it to module scope to reduce duplication.

♻️ Suggested refactor
+function makeQueryClient() {
+  const cache = new Map<string, unknown>()
+  return {
+    setQueryData: (key: unknown[], value: unknown | ((prev: unknown) => unknown)) => {
+      const cacheKey = JSON.stringify(key)
+      const prev = cache.get(cacheKey)
+      const next = typeof value === 'function' ? (value as (prev: unknown) => unknown)(prev) : value
+      cache.set(cacheKey, next)
+    }
+  }
+}
+
 describe('handleWsMessage upgrade messages', () => {
-  function makeQueryClient() {
-    const cache = new Map<string, unknown>()
-    return {
-      setQueryData: (key: unknown[], value: unknown | ((prev: unknown) => unknown)) => {
-        const cacheKey = JSON.stringify(key)
-        const prev = cache.get(cacheKey)
-        const next = typeof value === 'function' ? (value as (prev: unknown) => unknown)(prev) : value
-        cache.set(cacheKey, next)
-      }
-    }
-  }
   // ... tests
 })

 describe('handleWsMessage recovery messages', () => {
-  function makeQueryClient() {
-    const cache = new Map<string, unknown>()
-    return {
-      setQueryData: (key: unknown[], value: unknown | ((prev: unknown) => unknown)) => {
-        const cacheKey = JSON.stringify(key)
-        const prev = cache.get(cacheKey)
-        const next = typeof value === 'function' ? (value as (prev: unknown) => unknown)(prev) : value
-        cache.set(cacheKey, next)
-      }
-    }
-  }
   // ... tests
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/src/hooks/use-servers-ws.test.ts` around lines 215 - 225, The
duplicate makeQueryClient helper is defined twice; extract it to module scope
and reuse it to remove duplication: move the makeQueryClient function (the
factory that returns setQueryData using a Map and JSON.stringify keying) out of
the individual tests into a single top-level helper in this test module (or into
a shared test-utils file and import it), then replace the duplicate inline
definitions in the "handleWsMessage upgrade messages" test suite and the current
test with calls to the extracted makeQueryClient so both tests reference the
same function.
apps/web/src/components/server/recovery-merge-dialog.test.tsx (2)

68-72: Consider using consistent assertion style for button disabled state.

The current approach uses getAttribute('disabled') === '' and hasAttribute('disabled') === false asymmetrically. Using the toBeDisabled() / toBeEnabled() matchers from @testing-library/jest-dom would be cleaner.

♻️ Suggested improvement
     const button = screen.getByText('Start Recovery').closest('button')
-    expect(button?.getAttribute('disabled')).toBe('')
+    expect(button).toBeDisabled()

     fireEvent.click(screen.getByText('Source'))
-    expect(button?.hasAttribute('disabled')).toBe(false)
+    expect(button).toBeEnabled()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/src/components/server/recovery-merge-dialog.test.tsx` around lines
68 - 72, Replace the asymmetric DOM-attribute assertions for the "Start
Recovery" button with jest-dom matchers: instead of checking
button?.getAttribute('disabled') and hasAttribute('disabled'), call
expect(screen.getByText('Start Recovery').closest('button')).toBeDisabled()
before clicking and expect(...).toBeEnabled() after clicking the "Source"
option; ensure `@testing-library/jest-dom` is available in the test setup so
toBeDisabled/toBeEnabled work.

52-53: Consider using toBeInTheDocument() for clearer presence assertions.

Using toBeDefined() works but toBeInTheDocument() from @testing-library/jest-dom is more idiomatic and provides better failure messages for DOM element assertions.

♻️ Suggested improvement
-    expect(screen.getByText('Source')).toBeDefined()
-    expect(screen.getByText('same remote address')).toBeDefined()
+    expect(screen.getByText('Source')).toBeInTheDocument()
+    expect(screen.getByText('same remote address')).toBeInTheDocument()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/src/components/server/recovery-merge-dialog.test.tsx` around lines
52 - 53, Replace the generic presence assertions using toBeDefined() with the
idiomatic DOM matcher toBeInTheDocument(): update the two assertions that call
expect(screen.getByText('Source')).toBeDefined() and
expect(screen.getByText('same remote address')).toBeDefined() to use
expect(...).toBeInTheDocument(); ensure the test environment has
`@testing-library/jest-dom` installed and imported (or configured in setupTests)
so toBeInTheDocument is available for use with screen.getByText.
apps/web/src/lib/api-schema.ts (1)

29-64: Prefer generated OpenAPI aliases for the recovery DTOs.

This file is meant to stay a thin wrapper around components['schemas']. Hand-writing the recovery request/response shapes here creates a second contract that can silently drift from the server the next time the OpenAPI types are regenerated. Please move these into the generated schema and re-export aliases once the spec is updated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/src/lib/api-schema.ts` around lines 29 - 64, Replace the handwritten
DTOs with aliases to the generated OpenAPI schemas: remove the explicit
interfaces/types RecoveryCandidateResponse, RecoveryJobStatus, RecoveryJobStage,
RecoveryJobResponse, and StartRecoveryRequest and instead export type aliases
that reference the corresponding entries in the generated components['schemas']
(e.g. export type RecoveryCandidateResponse =
components["schemas"]["RecoveryCandidateResponse"]; etc.) so this file becomes a
thin re-export layer; after the spec is regenerated, update the alias names to
match the generated schema keys and ensure all imports/exports in
apps/web/src/lib/api-schema.ts point to the generated types.
crates/server/src/router/api/server_recovery.rs (1)

643-644: Duplicate set_protocol_version call.

Line 644 duplicates the call on line 643 with the same arguments. This is harmless but unnecessary.

Remove duplicate line
         state.agent_manager.set_protocol_version("source-1", 3);
-        state.agent_manager.set_protocol_version("source-1", 3);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/router/api/server_recovery.rs` around lines 643 - 644, The
call to state.agent_manager.set_protocol_version("source-1", 3) is duplicated;
remove the redundant second invocation so the method is only called once (leave
the first call intact) to avoid unnecessary repeated work in the initialization
path where set_protocol_version is used.
crates/server/src/service/recovery_merge.rs (2)

886-892: Redundant nullable parameter handling.

Both branches of the if nullable statement produce identical results. The nullable parameter appears unused in its intended purpose.

Simplify to single branch
-            let value = if nullable {
-                rewritten.unwrap_or_else(|| "[]".to_string()).into()
-            } else {
-                rewritten
-                    .unwrap_or_else(|| "[]".to_string())
-                    .into()
-            };
+            let value = rewritten.unwrap_or_else(|| "[]".to_string()).into();

If nullable was intended to handle NULL vs empty array differently, the logic should be:

let value = if nullable && rewritten.as_deref() == Some("[]") {
    sea_orm::Value::String(None) // NULL for empty array when nullable
} else {
    rewritten.unwrap_or_else(|| "[]".to_string()).into()
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/service/recovery_merge.rs` around lines 886 - 892, The `if
nullable` branches are redundant; replace the conditional block that sets
`value` with a single assignment that uses rewritten.unwrap_or_else(||
"[]".to_string()).into() (remove the `if nullable` and the duplicate branches).
Locate the `value` binding where `nullable` and `rewritten` are used and
collapse it to the single-expression assignment so `nullable` is not ignored; if
you actually meant to treat empty arrays as SQL NULL, implement the alternative
logic using rewritten.as_deref() == Some("[]") to produce a NULL variant instead
of the stringified "[]".

42-50: Duplicate helper functions with recovery_job.rs.

is_unique_violation and is_active_recovery_conflict are duplicated from recovery_job.rs. Consider extracting these to a shared utility module to maintain consistency.

Example extraction

Create a shared utility in crates/server/src/service/mod.rs or a dedicated db_utils.rs:

// In a shared location
pub(crate) fn is_unique_violation(err: &sea_orm::DbErr) -> bool {
    let message = err.to_string();
    message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
}

pub(crate) fn is_active_recovery_conflict(err: &sea_orm::DbErr) -> bool {
    is_unique_violation(err) || err.to_string().contains("recovery_job_active_conflict")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/service/recovery_merge.rs` around lines 42 - 50, The two
functions is_unique_violation and is_active_recovery_conflict are duplicated;
extract them into a shared utility (e.g., a new service-level module like
db_utils or inside service::mod) as pub(crate) functions, move the
implementations exactly as in the comment (is_active_recovery_conflict should
call is_unique_violation and check "recovery_job_active_conflict"), then replace
the local definitions in recovery_merge.rs and recovery_job.rs with imports from
that shared module and remove the duplicated code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/web/src/components/server/recovery-merge-dialog.tsx`:
- Around line 59-66: The dialog must become read-only whenever currentJob is
non-null: locate the JSX that renders candidate selection controls and the start
action (e.g., the candidate selection inputs/components, the start button that
triggers startRecoveryMutation, and any editable fields between the blocks
around currentJob and lines covering 87-115 and 124-140) and disable or replace
them with non-interactive text when currentJob exists; specifically, guard the
candidate selection UI and the button that calls startRecoveryMutation with a
conditional using currentJob (or pass a disabled prop) so inputs are
non-editable and the start action is not invoked while currentJob is active, and
ensure any other actionable UI in the same component (selection lists, radio
buttons, dropdowns) is similarly disabled or hidden when currentJob is present.
- Around line 49-50: The dialog's candidate list (inside
DialogContent/DialogHeader in the RecoveryMergeDialog component) can grow past
the viewport; wrap the candidate list/body (the block rendering candidates
around lines where the list is rendered, approx. lines 87–115) in the shadcn
ScrollArea component, import ScrollArea, and give it a constrained max-height
(e.g., a viewport-relative max-h) so the dialog header/warning/actions stay
visible; replace any native overflow-auto/overflow-y-auto with this ScrollArea
wrapper around the list.

In `@apps/web/src/routes/_authed/servers/`$id.tsx:
- Line 588: recoveryJob.stage is rendered raw in the Badge (recoveryJob and
Badge); map the enum to a localized label before rendering by converting
recoveryJob.stage into a translation key or display string (e.g., via an
existing i18n helper like t()/translate() or a stageToLabel mapping) and pass
that label to <Badge> instead of the raw enum; update the rendering logic around
the recoveryJob check so it computes the translatedStage and uses <Badge
variant="secondary">{translatedStage}</Badge>.

In `@apps/web/src/stores/recovery-jobs-store.ts`:
- Around line 4-40: Add a hydration flag to the recovery jobs store: extend
RecoveryJobsState with a boolean hydrated and a setter (e.g., setHydrated)
initialized false in useRecoveryJobsStore, flip hydrated to true inside setJobs
(and/or expose setHydrated so the WS sync handler can mark hydration after
full_sync), and keep existing methods (setJob, clearJob, getJob) unchanged;
consumers (e.g., routes/_authed/servers/$id.tsx) should gate UI/actions on
get().hydrated === true before treating absence of getJob(id) as “no running
job.”

In `@crates/agent/src/rebind.rs`:
- Around line 8-24: The rewrite of token content in render_token_content
currently loses a trailing newline because existing.lines() drops a final '\n'
and lines.join("\n") doesn't restore it; update render_token_content to detect
whether existing.ends_with('\n') and, after building the new content from lines
(using the same logic with is_table_header and is_token_line), append a single
'\n' if the original had one so the function returns a string preserving the
original trailing newline.
- Around line 128-132: The error message currently formats "failed to atomically
replace {path} with {temp_path}" but the source/destination are reversed
compared to the Unix version; update the format invocation that uses
path.display() and temp_path.display() so the message reads "failed to
atomically replace {temp_path} with {path}" (swap the order of path.display()
and temp_path.display() in the format call) in the function containing that
format! invocation in rebind.rs.

In `@crates/server/src/migration/m20260416_000017_create_recovery_job.rs`:
- Around line 16-79: The recovery_job table relies on literal "running" strings
in the partial indexes and triggers (idx_recovery_job_target_running,
idx_recovery_job_source_running, trg_recovery_job_running_insert,
trg_recovery_job_running_update), so add a CHECK constraint on the CREATE TABLE
for status (and similarly for stage) to enumerate the allowed values (including
the exact 'running' literal used by the indexes/triggers) so the DB rejects
mismatched casing/typos; modify the CREATE TABLE statement for recovery_job to
include "CHECK (status IN (...))" and "CHECK (stage IN (...))" with the
canonical state machine values used by the service, and then keep the existing
partial-index and trigger logic unchanged.

In `@crates/server/src/router/ws/browser.rs`:
- Around line 257-266: The FullSync handler is currently including recoveries
(produced by recovery_snapshot) in every BrowserMessage::FullSync payload which
leaks admin-only recovery details to any authenticated socket; update the
handler to not include recoveries by default and instead only attach them when
the authenticated principal has an admin role (or move recovery_snapshot into a
separate admin-only WS feed). Concretely: remove passing recoveries into
BrowserMessage::FullSync where ServerService::list_servers and
state.upgrade_tracker.snapshot() are assembled, and add a role check using your
request/session principal (e.g., inspect the auth context on state or the
current connection) to conditionally call recovery_snapshot and include its
result only for admins; alternatively implement a new admin-only
message/endpoint that returns recovery_snapshot to authorized sockets. Ensure
the same gating change is applied to the other FullSync/related spots noted (the
other occurrences around the same handler).
- Around line 369-383: recovery_snapshot currently swallows DB errors and
returns an empty Vec, causing broadcast_recovery_update to falsely send
recoveries: Some([]); change recovery_snapshot to return
Option<Vec<RecoveryJobDto>> (or Result<Vec<...>, _>) so that on DB error you log
the error and return None instead of Vec::new(), and update
broadcast_recovery_update to await recovery_snapshot(state) and skip sending the
BrowserMessage when it returns None (or handle Err by not broadcasting) —
reference recovery_snapshot and broadcast_recovery_update to locate where to
change the return type, error handling, and the send branch.

In `@docs/superpowers/plans/2026-04-16-agent-recovery-merge.md`:
- Around line 166-184: The persist_rebind_token function currently writes to a
fixed .tmp file without fsyncs and can collide with concurrent writers; update
it to use the durable helper pattern from crates/agent/src/rebind.rs: create a
unique temporary file (e.g. use a random/suffixed tmp name or tempfile), write
the joined lines to that temp file via an OpenOptions write handle, call
sync_all() on the temp file to flush contents to disk, atomically rename the
temp into place with std::fs::rename, then open the parent directory and call
sync_all() on it to ensure the rename is durable; keep the same logic for
replacing or appending the token line in persist_rebind_token but replace the
fixed tmp usage with this durable temp-write + file and parent fsync sequence.

---

Outside diff comments:
In `@crates/server/src/task/record_writer.rs`:
- Around line 45-120: The loop checks
state.recovery_lock.writes_allowed_for(server_id) repeatedly and after awaits,
allowing the freeze to flip mid-iteration; snapshot the gate once per server
iteration by calling writes_allowed_for(server_id) at the top of the loop (e.g.,
let writes_allowed = state.recovery_lock.writes_allowed_for(server_id)) and then
replace all subsequent calls to
state.recovery_lock.writes_allowed_for(server_id) within that iteration with the
captured writes_allowed boolean when deciding whether to call
RecordService::save_report, TrafficService::upsert_state, and
TrafficService::upsert_hourly so all writes for the same report are consistently
allowed or skipped.

---

Nitpick comments:
In `@apps/web/src/components/server/recovery-merge-dialog.test.tsx`:
- Around line 68-72: Replace the asymmetric DOM-attribute assertions for the
"Start Recovery" button with jest-dom matchers: instead of checking
button?.getAttribute('disabled') and hasAttribute('disabled'), call
expect(screen.getByText('Start Recovery').closest('button')).toBeDisabled()
before clicking and expect(...).toBeEnabled() after clicking the "Source"
option; ensure `@testing-library/jest-dom` is available in the test setup so
toBeDisabled/toBeEnabled work.
- Around line 52-53: Replace the generic presence assertions using toBeDefined()
with the idiomatic DOM matcher toBeInTheDocument(): update the two assertions
that call expect(screen.getByText('Source')).toBeDefined() and
expect(screen.getByText('same remote address')).toBeDefined() to use
expect(...).toBeInTheDocument(); ensure the test environment has
`@testing-library/jest-dom` installed and imported (or configured in setupTests)
so toBeInTheDocument is available for use with screen.getByText.

In `@apps/web/src/hooks/use-servers-ws.test.ts`:
- Around line 215-225: The duplicate makeQueryClient helper is defined twice;
extract it to module scope and reuse it to remove duplication: move the
makeQueryClient function (the factory that returns setQueryData using a Map and
JSON.stringify keying) out of the individual tests into a single top-level
helper in this test module (or into a shared test-utils file and import it),
then replace the duplicate inline definitions in the "handleWsMessage upgrade
messages" test suite and the current test with calls to the extracted
makeQueryClient so both tests reference the same function.

In `@apps/web/src/lib/api-schema.ts`:
- Around line 29-64: Replace the handwritten DTOs with aliases to the generated
OpenAPI schemas: remove the explicit interfaces/types RecoveryCandidateResponse,
RecoveryJobStatus, RecoveryJobStage, RecoveryJobResponse, and
StartRecoveryRequest and instead export type aliases that reference the
corresponding entries in the generated components['schemas'] (e.g. export type
RecoveryCandidateResponse = components["schemas"]["RecoveryCandidateResponse"];
etc.) so this file becomes a thin re-export layer; after the spec is
regenerated, update the alias names to match the generated schema keys and
ensure all imports/exports in apps/web/src/lib/api-schema.ts point to the
generated types.

In `@crates/server/src/router/api/server_recovery.rs`:
- Around line 643-644: The call to
state.agent_manager.set_protocol_version("source-1", 3) is duplicated; remove
the redundant second invocation so the method is only called once (leave the
first call intact) to avoid unnecessary repeated work in the initialization path
where set_protocol_version is used.

In `@crates/server/src/service/recovery_merge.rs`:
- Around line 886-892: The `if nullable` branches are redundant; replace the
conditional block that sets `value` with a single assignment that uses
rewritten.unwrap_or_else(|| "[]".to_string()).into() (remove the `if nullable`
and the duplicate branches). Locate the `value` binding where `nullable` and
`rewritten` are used and collapse it to the single-expression assignment so
`nullable` is not ignored; if you actually meant to treat empty arrays as SQL
NULL, implement the alternative logic using rewritten.as_deref() == Some("[]")
to produce a NULL variant instead of the stringified "[]".
- Around line 42-50: The two functions is_unique_violation and
is_active_recovery_conflict are duplicated; extract them into a shared utility
(e.g., a new service-level module like db_utils or inside service::mod) as
pub(crate) functions, move the implementations exactly as in the comment
(is_active_recovery_conflict should call is_unique_violation and check
"recovery_job_active_conflict"), then replace the local definitions in
recovery_merge.rs and recovery_job.rs with imports from that shared module and
remove the duplicated code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf5541bf-9e44-4dbf-b8a3-a95d8f1c0114

📥 Commits

Reviewing files that changed from the base of the PR and between de53c60 and 8e6a5ad.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (45)
  • apps/docs/content/docs/cn/api-reference.mdx
  • apps/docs/content/docs/cn/server.mdx
  • apps/docs/content/docs/en/api-reference.mdx
  • apps/docs/content/docs/en/server.mdx
  • apps/web/src/components/server/recovery-merge-dialog.test.tsx
  • apps/web/src/components/server/recovery-merge-dialog.tsx
  • apps/web/src/hooks/use-api.test.tsx
  • apps/web/src/hooks/use-api.ts
  • apps/web/src/hooks/use-servers-ws.test.ts
  • apps/web/src/hooks/use-servers-ws.ts
  • apps/web/src/lib/api-schema.ts
  • apps/web/src/locales/en/servers.json
  • apps/web/src/locales/zh/servers.json
  • apps/web/src/routes/_authed/servers/$id.test.tsx
  • apps/web/src/routes/_authed/servers/$id.tsx
  • apps/web/src/stores/recovery-jobs-store.test.ts
  • apps/web/src/stores/recovery-jobs-store.ts
  • crates/agent/Cargo.toml
  • crates/agent/src/config.rs
  • crates/agent/src/main.rs
  • crates/agent/src/rebind.rs
  • crates/agent/src/register.rs
  • crates/agent/src/reporter.rs
  • crates/common/src/constants.rs
  • crates/common/src/protocol.rs
  • crates/server/src/entity/mod.rs
  • crates/server/src/entity/recovery_job.rs
  • crates/server/src/migration/m20260416_000017_create_recovery_job.rs
  • crates/server/src/migration/mod.rs
  • crates/server/src/openapi.rs
  • crates/server/src/router/api/mod.rs
  • crates/server/src/router/api/server_recovery.rs
  • crates/server/src/router/ws/agent.rs
  • crates/server/src/router/ws/browser.rs
  • crates/server/src/service/agent_manager.rs
  • crates/server/src/service/mod.rs
  • crates/server/src/service/recovery_job.rs
  • crates/server/src/service/recovery_lock.rs
  • crates/server/src/service/recovery_merge.rs
  • crates/server/src/service/traffic.rs
  • crates/server/src/state.rs
  • crates/server/src/task/record_writer.rs
  • crates/server/tests/integration.rs
  • docs/superpowers/plans/2026-04-16-agent-recovery-merge.md
  • docs/superpowers/specs/2026-04-16-agent-recovery-merge-design.md

Comment thread apps/web/src/components/server/recovery-merge-dialog.tsx
Comment thread apps/web/src/components/server/recovery-merge-dialog.tsx
Comment thread apps/web/src/routes/_authed/servers/$id.tsx Outdated
Comment thread apps/web/src/stores/recovery-jobs-store.ts
Comment thread crates/agent/src/rebind.rs
Comment thread crates/agent/src/rebind.rs
Comment thread crates/server/src/migration/m20260416_000017_create_recovery_job.rs
Comment thread crates/server/src/router/ws/browser.rs Outdated
Comment thread crates/server/src/router/ws/browser.rs Outdated
Comment thread docs/superpowers/plans/2026-04-16-agent-recovery-merge.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (2)
apps/web/src/stores/recovery-jobs-store.ts (1)

18-23: Unify setJob keying with the payload to avoid accidental mismatches.

Line 18 accepts targetServerId separately from job, while Line 37 in setJobs keys by job.target_server_id. Consider deriving the key from job.target_server_id in setJob too, so callers cannot store a job under the wrong server id.

♻️ Suggested refactor
 interface RecoveryJobsState {
-  setJob: (targetServerId: string, job: RecoveryJobResponse) => void
+  setJob: (job: RecoveryJobResponse) => void
 }

 export const useRecoveryJobsStore = create<RecoveryJobsState>()((set, get) => ({
@@
-  setJob: (targetServerId: string, job: RecoveryJobResponse) => {
+  setJob: (job: RecoveryJobResponse) => {
     set((state) => {
       const next = new Map(state.jobs)
-      next.set(targetServerId, job)
+      next.set(job.target_server_id, job)
       return { jobs: next }
     })
   },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/src/stores/recovery-jobs-store.ts` around lines 18 - 23, Change
setJob so it derives the Map key from the job payload itself instead of taking a
separate targetServerId to avoid mismatches: inside the setJob implementation
(function setJob) use job.target_server_id as the key when creating/updating the
Map (consistent with setJobs which keys by job.target_server_id), and update the
function signature/callsites if needed to accept only the job (or validate that
provided targetServerId matches job.target_server_id) so callers cannot store a
job under the wrong server id.
apps/web/openapi.json (1)

7049-7063: Tighten the LatestAgentVersionResponse contract.

Line 7049 currently makes every field optional, so {} is a valid payload and generated clients have to handle undefined as well as null. If this endpoint always serializes these keys, mark them as required and keep nullability only where it is intentional.

🔧 Minimal schema tightening
       "LatestAgentVersionResponse": {
         "type": "object",
+        "required": ["error", "released_at", "version"],
         "properties": {
           "error": {
             "type": ["string", "null"]
           },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/openapi.json` around lines 7049 - 7063, The schema for
LatestAgentVersionResponse currently allows an empty object because none of the
properties are required; update the OpenAPI schema so that the keys "error",
"released_at", and "version" are listed as required on
LatestAgentVersionResponse (keeping "null" in the type only if the server
intentionally emits null for a field), and ensure the types remain accurate
(e.g., "released_at" retains "format": "date-time" but only allows "string" or
"null" if intended). Locate the LatestAgentVersionResponse object in the spec
and add a required array with ["error","released_at","version"], removing
optionality only where not intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/web/openapi.json`:
- Around line 12-15: The OpenAPI spec adds endpoints using tags "agent" and
"server-recovery" (for example the path "/api/agent/latest-version" uses tag
"agent") but the top-level "tags" array is missing updated metadata and has no
entry for "server-recovery"; update the document's top-level "tags" section to
include an entry for "agent" with an accurate description that reflects all
agent-related groups (not just registration) and add a new "server-recovery" tag
entry with a descriptive summary (optionally include x-order or similar ordering
metadata to maintain docs ordering) so generated docs show correct descriptions
and order for these endpoints.

In `@crates/server/src/router/api/server_recovery.rs`:
- Around line 152-160: The candidate list includes agents that are online/idle
but below the rebind protocol minimum, causing later rejection by
RecoveryMergeService::validate_start_request(); update the filter chain (around
server::Entity::find() ... .map(|source| build_candidate_response(...))) to also
exclude agents that don't meet the rebind-minimum check by querying the agent
manager (e.g. call something like
state.agent_manager.supports_rebind_minimum(&source.id) or compare
state.agent_manager.protocol_version(&source.id) against the
REBIND_MIN_PROTOCOL_VERSION constant) so only agents that can actually start
recovery are included.
- Around line 250-275: Capture the current target token before calling
RecoveryMergeService::rotate_target_token_on_txn (store it in a local variable),
and if sender.send(...) fails after txn.commit(), restore the previous token in
the database by calling a recovery method (add or reuse a function such as
RecoveryMergeService::restore_target_token_on_txn or a new method that sets the
target token back using state.db) and await it before marking the job failed
with RecoveryJobService::mark_failed; ensure you still propagate the original
send error as AppError::Internal and handle any restore errors (log/propagate)
so the target can continue reconnecting with its original secret.
- Around line 98-117: The OpenAPI #[utoipa::path] response bodies must be
changed to use the ApiResponse<T> wrapper: replace any occurrence of body =
Vec<RecoveryCandidateResponse> with body =
ApiResponse<Vec<RecoveryCandidateResponse>> and replace body =
RecoveryJobResponse with body = ApiResponse<RecoveryJobResponse> in the
server-recovery route attributes so the documented response shape matches the
actual return type Result<Json<ApiResponse<T>>, AppError>; look for the
attributes that reference RecoveryCandidateResponse and RecoveryJobResponse in
this module and update them to use ApiResponse<...>.

In `@crates/server/src/service/db_error.rs`:
- Around line 3-11: The current is_active_recovery_conflict(err: &DbErr)
incorrectly treats any UNIQUE violation as an active recovery conflict because
is_unique_violation(err) matches generic "UNIQUE"; update the logic in
is_active_recovery_conflict to only check for the explicit trigger string
"recovery_job_active_conflict" (i.e., remove the is_unique_violation(err)
branch) OR alternatively narrow is_unique_violation to match only the specific
recovery_jobs constraint name used by the DB trigger; adjust the functions
is_unique_violation and/or is_active_recovery_conflict accordingly so active
recovery conflicts are detected solely by the trigger-specific identifier or the
precise constraint name.

In `@crates/server/src/service/recovery_merge.rs`:
- Around line 363-368: The code casts source.protocol_version to u32 directly
which wraps negative values (e.g. -1 -> huge u32) and can bypass the
REBIND_IDENTITY_MIN_PROTOCOL_VERSION check; fix by treating negative persisted
protocol versions as absent before the v4 gate: compute a persisted_protocol
option (e.g., if source.protocol_version < 0 then None else
Some(source.protocol_version as u32)) and use that as the fallback for
state.agent_manager.get_protocol_version(&source.id).unwrap_or(...) so negative
values do not get cast to large u32s and cannot satisfy the minimum-version
check.
- Around line 643-661: The current DELETE removes every target row with
timestamps between the source min and max, causing loss of target-only samples;
change the deletion to only remove target rows that actually conflict with
source timestamps. Replace the DELETE that uses the min/max window (the
db.execute call building the DELETE with {time_column} >=/...<= MIN/MAX) with a
query that deletes rows from the target server where an actual source row exists
at the same timestamp (e.g., use an IN (SELECT {time_column} FROM {table} WHERE
server_id = $2) or a correlated EXISTS subquery) so only direct timestamp
conflicts between source_server_id and target_server_id are removed before the
UPDATE that sets server_id for source rows.

In `@crates/server/src/task/record_writer.rs`:
- Around line 46-47: The code currently snapshots writes_allowed with
state.recovery_lock.writes_allowed_for(server_id) and reuses it across awaits,
allowing a TOCTOU if freeze()/release() changes the lock mid-iteration; change
the logic to call state.recovery_lock.writes_allowed_for(server_id) immediately
before each actual DB write (i.e., inside the loop/just prior to invoking the
write operation) and abort or handle the write if the fresh check disallows it,
removing reliance on the earlier writes_allowed variable and ensuring the check
happens after awaits and right before the write.

In `@docs/superpowers/plans/2026-04-16-agent-recovery-merge.md`:
- Around line 76-78: The plan's description mistakenly states the store is keyed
by `target_server_id` and `job_id`; update the text for `recovery-jobs-store` to
reflect the actual implementation which keys entries only by `target_server_id`
(remove `job_id` from the key description) so the plan matches the
`recovery-jobs-store` behavior.

---

Nitpick comments:
In `@apps/web/openapi.json`:
- Around line 7049-7063: The schema for LatestAgentVersionResponse currently
allows an empty object because none of the properties are required; update the
OpenAPI schema so that the keys "error", "released_at", and "version" are listed
as required on LatestAgentVersionResponse (keeping "null" in the type only if
the server intentionally emits null for a field), and ensure the types remain
accurate (e.g., "released_at" retains "format": "date-time" but only allows
"string" or "null" if intended). Locate the LatestAgentVersionResponse object in
the spec and add a required array with ["error","released_at","version"],
removing optionality only where not intended.

In `@apps/web/src/stores/recovery-jobs-store.ts`:
- Around line 18-23: Change setJob so it derives the Map key from the job
payload itself instead of taking a separate targetServerId to avoid mismatches:
inside the setJob implementation (function setJob) use job.target_server_id as
the key when creating/updating the Map (consistent with setJobs which keys by
job.target_server_id), and update the function signature/callsites if needed to
accept only the job (or validate that provided targetServerId matches
job.target_server_id) so callers cannot store a job under the wrong server id.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b19501b8-df72-49fe-bf9c-61dda2939f80

📥 Commits

Reviewing files that changed from the base of the PR and between 8e6a5ad and 8bd2bbf.

📒 Files selected for processing (28)
  • apps/web/openapi.json
  • apps/web/src/components/server/recovery-merge-dialog.test.tsx
  • apps/web/src/components/server/recovery-merge-dialog.tsx
  • apps/web/src/hooks/use-api.ts
  • apps/web/src/hooks/use-servers-ws.test.ts
  • apps/web/src/hooks/use-servers-ws.ts
  • apps/web/src/lib/api-schema.ts
  • apps/web/src/lib/api-types.ts
  • apps/web/src/locales/en/servers.json
  • apps/web/src/locales/zh/servers.json
  • apps/web/src/routes/_authed/servers/$id.test.tsx
  • apps/web/src/routes/_authed/servers/$id.tsx
  • apps/web/src/stores/recovery-jobs-store.test.ts
  • apps/web/src/stores/recovery-jobs-store.ts
  • crates/agent/src/file_manager.rs
  • crates/agent/src/rebind.rs
  • crates/server/src/migration/m20260416_000017_create_recovery_job.rs
  • crates/server/src/router/api/agent.rs
  • crates/server/src/router/api/server.rs
  • crates/server/src/router/api/server_recovery.rs
  • crates/server/src/router/ws/browser.rs
  • crates/server/src/service/db_error.rs
  • crates/server/src/service/mod.rs
  • crates/server/src/service/recovery_job.rs
  • crates/server/src/service/recovery_merge.rs
  • crates/server/src/service/traffic.rs
  • crates/server/src/task/record_writer.rs
  • docs/superpowers/plans/2026-04-16-agent-recovery-merge.md
✅ Files skipped from review due to trivial changes (4)
  • crates/server/src/router/api/server.rs
  • apps/web/src/locales/zh/servers.json
  • crates/server/src/migration/m20260416_000017_create_recovery_job.rs
  • apps/web/src/locales/en/servers.json
🚧 Files skipped from review as they are similar to previous changes (11)
  • apps/web/src/hooks/use-servers-ws.test.ts
  • apps/web/src/components/server/recovery-merge-dialog.test.tsx
  • crates/server/src/service/mod.rs
  • apps/web/src/stores/recovery-jobs-store.test.ts
  • apps/web/src/hooks/use-servers-ws.ts
  • apps/web/src/lib/api-schema.ts
  • apps/web/src/routes/_authed/servers/$id.test.tsx
  • apps/web/src/routes/_authed/servers/$id.tsx
  • crates/agent/src/rebind.rs
  • crates/server/src/service/traffic.rs
  • crates/server/src/router/ws/browser.rs

Comment thread apps/web/openapi.json
Comment on lines +12 to +15
"/api/agent/latest-version": {
"get": {
"tags": ["agent"],
"operationId": "latest_version",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add top-level tag metadata for the new endpoint groups.

These changes expand agent beyond registration and introduce server-recovery, but the top-level tags section still only describes agent as registration and does not define server-recovery at all. The spec will still parse, but the generated docs lose the right description/order for these new APIs.

📝 Suggested tag metadata update
   {
     "name": "agent",
-    "description": "Agent registration"
+    "description": "Agent registration and version metadata"
   },
+  {
+    "name": "server-recovery",
+    "description": "Admin-driven recovery merge workflow"
+  },

Also applies to: 3552-3553, 4125-4126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/openapi.json` around lines 12 - 15, The OpenAPI spec adds endpoints
using tags "agent" and "server-recovery" (for example the path
"/api/agent/latest-version" uses tag "agent") but the top-level "tags" array is
missing updated metadata and has no entry for "server-recovery"; update the
document's top-level "tags" section to include an entry for "agent" with an
accurate description that reflects all agent-related groups (not just
registration) and add a new "server-recovery" tag entry with a descriptive
summary (optionally include x-order or similar ordering metadata to maintain
docs ordering) so generated docs show correct descriptions and order for these
endpoints.

Comment on lines +98 to +117
#[utoipa::path(
get,
path = "/api/servers/{target_id}/recovery-candidates",
params(
("target_id" = String, Path, description = "Original offline server id")
),
responses(
(status = 200, description = "Recommended recovery candidates", body = Vec<RecoveryCandidateResponse>),
(status = 401, description = "Authentication required", body = crate::error::ErrorBody),
(status = 403, description = "Admin required", body = crate::error::ErrorBody),
(status = 404, description = "Target server not found", body = crate::error::ErrorBody),
(status = 409, description = "Target must be offline and not already in a running recovery job", body = crate::error::ErrorBody),
),
security(
("session_cookie" = []),
("api_key" = []),
("bearer_token" = [])
),
tag = "server-recovery"
)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does utoipa support documenting generic wrapper response bodies like ApiResponsefor Axum handlers returningJson<ApiResponse>, and what annotation pattern is recommended?

💡 Result:

Yes, Utoipa supports documenting generic wrapper response bodies like ApiResponse for Axum handlers returning Json<ApiResponse>. Derive #[utoipa::ToSchema] on the generic ApiResponse struct (with T: ToSchema), and explicitly reference the concrete instantiation in the #[utoipa::path(...)] macro's responses using a type alias defined via #[aliases(...)]. Recommended annotation pattern: 1. Define the generic wrapper: #[derive(utoipa::ToSchema)] pub struct ApiResponse { // fields wrapping T, e.g. data: T, status: String, etc. } 2. Define a type alias for the concrete use via #[aliases] on the generic (or inner type): #[derive(utoipa::ToSchema)] #[aliases(ApiResponseMyType = ApiResponse)] pub struct ApiResponse { ... } 3. In the handler annotation, use the alias in body: #[derive(utoipa::ToSchema)] struct MyType { ... } #[utoipa::path( get, path = "/example", responses( (status = 200, body = ApiResponseMyType) ) )] async fn handler -> Json<ApiResponse> { ... } This instantiates the schema as ApiResponse_MyType (generic params inlined by default), registers it via auto-collection, and references it correctly. Schemas for MyType are recursively collected. For reuse across handlers, define aliases for each T. Utoipa does not automatically infer response bodies from Axum handler return types like Json<ApiResponse>; responses(...) must be explicitly specified in #[utoipa::path]. Schemas are auto-collected recursively from body references if ToSchema is derived.

Citations:


🏁 Script executed:

cat -n crates/server/src/router/api/server_recovery.rs | sed -n '98,120p'

Repository: ZingerLittleBee/ServerBee

Length of output: 1209


🏁 Script executed:

cat -n crates/server/src/router/api/server_recovery.rs | sed -n '173,195p'

Repository: ZingerLittleBee/ServerBee

Length of output: 1087


🏁 Script executed:

cat -n crates/server/src/router/api/server_recovery.rs | sed -n '203,230p'

Repository: ZingerLittleBee/ServerBee

Length of output: 1515


🏁 Script executed:

# Check if ApiResponse has ToSchema and aliases defined
rg -A 10 "struct ApiResponse" crates/server/src/

Repository: ZingerLittleBee/ServerBee

Length of output: 602


🏁 Script executed:

# Check the actual handler return types
rg -B 2 "async fn" crates/server/src/router/api/server_recovery.rs | head -40

Repository: ZingerLittleBee/ServerBee

Length of output: 853


🏁 Script executed:

# Check if ApiResponse has ToSchema attribute
rg -B 5 "struct ApiResponse" crates/server/src/error.rs | head -20

Repository: ZingerLittleBee/ServerBee

Length of output: 219


🏁 Script executed:

# Check for any type aliases or #[aliases] in the file
rg "aliases" crates/server/src/router/api/server_recovery.rs

Repository: ZingerLittleBee/ServerBee

Length of output: 51


🏁 Script executed:

# Look at the actual return signature of list_candidates
rg -A 3 "^async fn list_candidates" crates/server/src/router/api/server_recovery.rs

Repository: ZingerLittleBee/ServerBee

Length of output: 244


🏁 Script executed:

# Verify the responses() body declarations match actual return types for all three handlers
cat -n crates/server/src/router/api/server_recovery.rs | sed -n '118,130p'

Repository: ZingerLittleBee/ServerBee

Length of output: 703


🏁 Script executed:

cat -n crates/server/src/router/api/server_recovery.rs | sed -n '192,200p'

Repository: ZingerLittleBee/ServerBee

Length of output: 475


🏁 Script executed:

cat -n crates/server/src/router/api/server_recovery.rs | sed -n '225,235p'

Repository: ZingerLittleBee/ServerBee

Length of output: 633


Update OpenAPI response bodies to wrap with ApiResponse<T>.

All three handlers return Result<Json<ApiResponse<T>>, AppError>, but the #[utoipa::path] response body declarations are bare types without the wrapper. This causes OpenAPI documentation and generated clients to expect the wrong JSON structure.

Update lines 105, 180, and 211:

  • Line 105: body = Vec<RecoveryCandidateResponse>body = ApiResponse<Vec<RecoveryCandidateResponse>>
  • Line 180: body = RecoveryJobResponsebody = ApiResponse<RecoveryJobResponse>
  • Line 211: body = RecoveryJobResponsebody = ApiResponse<RecoveryJobResponse>

Since ApiResponse<T> already derives #[utoipa::ToSchema], these changes will correctly document the wrapper shape and ensure generated clients deserialize { data: ... } as specified in the coding guideline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/router/api/server_recovery.rs` around lines 98 - 117, The
OpenAPI #[utoipa::path] response bodies must be changed to use the
ApiResponse<T> wrapper: replace any occurrence of body =
Vec<RecoveryCandidateResponse> with body =
ApiResponse<Vec<RecoveryCandidateResponse>> and replace body =
RecoveryJobResponse with body = ApiResponse<RecoveryJobResponse> in the
server-recovery route attributes so the documented response shape matches the
actual return type Result<Json<ApiResponse<T>>, AppError>; look for the
attributes that reference RecoveryCandidateResponse and RecoveryJobResponse in
this module and update them to use ApiResponse<...>.

Comment on lines +152 to +160
let mut candidates = server::Entity::find()
.filter(server::Column::Id.ne(target_id.as_str()))
.all(&state.db)
.await?
.into_iter()
.filter(|source| state.agent_manager.is_online(&source.id))
.filter(|source| !active_server_ids.contains(&source.id))
.map(|source| build_candidate_response(&target, &source))
.collect::<Vec<_>>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Filter out candidates that can never pass recovery start.

This list only checks online/active-job state. Agents below the rebind protocol minimum still appear here, but RecoveryMergeService::validate_start_request() rejects them later, so the picker can surface impossible recovery choices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/router/api/server_recovery.rs` around lines 152 - 160, The
candidate list includes agents that are online/idle but below the rebind
protocol minimum, causing later rejection by
RecoveryMergeService::validate_start_request(); update the filter chain (around
server::Entity::find() ... .map(|source| build_candidate_response(...))) to also
exclude agents that don't meet the rebind-minimum check by querying the agent
manager (e.g. call something like
state.agent_manager.supports_rebind_minimum(&source.id) or compare
state.agent_manager.protocol_version(&source.id) against the
REBIND_MIN_PROTOCOL_VERSION constant) so only agents that can actually start
recovery are included.

Comment on lines +250 to +275
let txn = state.db.begin().await?;
let job = RecoveryMergeService::start_on_txn(&txn, target_id, source_server_id).await?;
if let Err(error) =
RecoveryMergeService::validate_dispatch_preconditions(state, target_id, source_server_id)
.await
{
txn.rollback().await?;
return Err(error);
}
let token = RecoveryMergeService::rotate_target_token_on_txn(&txn, target_id).await?;
txn.commit().await?;

if let Err(error) = sender
.send(ServerMessage::RebindIdentity {
job_id: job.job_id.clone(),
target_server_id: target_id.to_string(),
token,
})
.await
{
let message = format!("Failed to dispatch RebindIdentity to source agent: {error}");
RecoveryJobService::mark_failed(&state.db, &job.job_id, RECOVERY_STAGE_REBINDING, &message)
.await?;
return Err(AppError::Internal(format!(
"Failed to dispatch RebindIdentity to source agent: {error}"
)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore the old target token when dispatch never leaves the server.

The new token is committed before send(), and the failure path only marks the job failed. If that channel is already closed, no agent knows the new token and the original target can no longer reconnect with its existing secret.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/router/api/server_recovery.rs` around lines 250 - 275,
Capture the current target token before calling
RecoveryMergeService::rotate_target_token_on_txn (store it in a local variable),
and if sender.send(...) fails after txn.commit(), restore the previous token in
the database by calling a recovery method (add or reuse a function such as
RecoveryMergeService::restore_target_token_on_txn or a new method that sets the
target token back using state.db) and await it before marking the job failed
with RecoveryJobService::mark_failed; ensure you still propagate the original
send error as AppError::Internal and handle any restore errors (log/propagate)
so the target can continue reconnecting with its original secret.

Comment on lines +3 to +11
pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
}

pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
let message = err.to_string();
is_unique_violation(err) || message.contains("recovery_job_active_conflict")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C3 "is_unique_violation|is_active_recovery_conflict" crates/server/src
rg -n -C5 "recovery_job_active_conflict|UNIQUE constraint failed" crates/server/src/migration crates/server/src/service

Repository: ZingerLittleBee/ServerBee

Length of output: 6667


Narrow conflict detection; current UNIQUE match is too broad.

Line 5 (message.contains("UNIQUE")) can classify any UNIQUE constraint violation as an active recovery conflict via line 10, causing incorrect 409 Conflict responses and masking real failures. The database trigger explicitly raises 'recovery_job_active_conflict' as the error signal; the secondary is_unique_violation pathway should target only recovery_job-specific constraints or be removed if not required.

Suggested fix:
Narrow to specific recovery_job UNIQUE constraints or remove the is_unique_violation check from line 10 if the trigger's explicit error message is the only signal needed for active recovery conflicts:

Option A: Target specific constraints
 pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
     let message = err.to_string();
-    message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
+    message.contains("UNIQUE constraint failed: recovery_job.target_server_id")
+        || message.contains("UNIQUE constraint failed: recovery_job.source_server_id")
 }
Option B: Rely on explicit trigger error only
 pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
     let message = err.to_string();
-    is_unique_violation(err) || message.contains("recovery_job_active_conflict")
+    message.contains("recovery_job_active_conflict")
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
}
pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
let message = err.to_string();
is_unique_violation(err) || message.contains("recovery_job_active_conflict")
}
pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("UNIQUE constraint failed: recovery_job.target_server_id")
|| message.contains("UNIQUE constraint failed: recovery_job.source_server_id")
}
pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
let message = err.to_string();
is_unique_violation(err) || message.contains("recovery_job_active_conflict")
}
Suggested change
pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
}
pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
let message = err.to_string();
is_unique_violation(err) || message.contains("recovery_job_active_conflict")
}
pub(crate) fn is_unique_violation(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("UNIQUE constraint failed") || message.contains("UNIQUE")
}
pub(crate) fn is_active_recovery_conflict(err: &DbErr) -> bool {
let message = err.to_string();
message.contains("recovery_job_active_conflict")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/service/db_error.rs` around lines 3 - 11, The current
is_active_recovery_conflict(err: &DbErr) incorrectly treats any UNIQUE violation
as an active recovery conflict because is_unique_violation(err) matches generic
"UNIQUE"; update the logic in is_active_recovery_conflict to only check for the
explicit trigger string "recovery_job_active_conflict" (i.e., remove the
is_unique_violation(err) branch) OR alternatively narrow is_unique_violation to
match only the specific recovery_jobs constraint name used by the DB trigger;
adjust the functions is_unique_violation and/or is_active_recovery_conflict
accordingly so active recovery conflicts are detected solely by the
trigger-specific identifier or the precise constraint name.

Comment on lines +363 to +368
let protocol_version = state
.agent_manager
.get_protocol_version(&source.id)
.unwrap_or(source.protocol_version as u32);

if protocol_version < REBIND_IDENTITY_MIN_PROTOCOL_VERSION {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negative persisted protocol versions before the v4 gate.

source.protocol_version as u32 wraps negative rows into huge values, so a bad/migrated -1 will satisfy the minimum check and let an unsupported source into the recovery flow.

🛠️ Suggested fix
         let protocol_version = state
             .agent_manager
             .get_protocol_version(&source.id)
-            .unwrap_or(source.protocol_version as u32);
+            .or_else(|| u32::try_from(source.protocol_version).ok())
+            .unwrap_or(0);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let protocol_version = state
.agent_manager
.get_protocol_version(&source.id)
.unwrap_or(source.protocol_version as u32);
if protocol_version < REBIND_IDENTITY_MIN_PROTOCOL_VERSION {
let protocol_version = state
.agent_manager
.get_protocol_version(&source.id)
.or_else(|| u32::try_from(source.protocol_version).ok())
.unwrap_or(0);
if protocol_version < REBIND_IDENTITY_MIN_PROTOCOL_VERSION {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/service/recovery_merge.rs` around lines 363 - 368, The code
casts source.protocol_version to u32 directly which wraps negative values (e.g.
-1 -> huge u32) and can bypass the REBIND_IDENTITY_MIN_PROTOCOL_VERSION check;
fix by treating negative persisted protocol versions as absent before the v4
gate: compute a persisted_protocol option (e.g., if source.protocol_version < 0
then None else Some(source.protocol_version as u32)) and use that as the
fallback for state.agent_manager.get_protocol_version(&source.id).unwrap_or(...)
so negative values do not get cast to large u32s and cannot satisfy the
minimum-version check.

Comment on lines +643 to +661
db.execute(Statement::from_sql_and_values(
db.get_database_backend(),
format!(
"DELETE FROM {table} \
WHERE server_id = $1 \
AND (SELECT MIN({time_column}) FROM {table} WHERE server_id = $2) IS NOT NULL \
AND {time_column} >= (SELECT MIN({time_column}) FROM {table} WHERE server_id = $2) \
AND {time_column} <= (SELECT MAX({time_column}) FROM {table} WHERE server_id = $2)"
),
[target_server_id.into(), source_server_id.into()],
))
.await?;

db.execute(Statement::from_sql_and_values(
db.get_database_backend(),
format!("UPDATE {table} SET server_id = $1 WHERE server_id = $2"),
[target_server_id.into(), source_server_id.into()],
))
.await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't delete the whole source min/max window.

This removes every target row between the first and last source timestamp. Sparse source data will erase target-only samples that happen to fall in the gaps, which is permanent merge-time data loss.

🛠️ Suggested fix
         db.execute(Statement::from_sql_and_values(
             db.get_database_backend(),
             format!(
-                "DELETE FROM {table} \
-                 WHERE server_id = $1 \
-                 AND (SELECT MIN({time_column}) FROM {table} WHERE server_id = $2) IS NOT NULL \
-                 AND {time_column} >= (SELECT MIN({time_column}) FROM {table} WHERE server_id = $2) \
-                 AND {time_column} <= (SELECT MAX({time_column}) FROM {table} WHERE server_id = $2)"
+                "DELETE FROM {table} AS target \
+                 WHERE target.server_id = $1 \
+                 AND EXISTS ( \
+                     SELECT 1 FROM {table} AS source \
+                     WHERE source.server_id = $2 \
+                     AND source.{time_column} = target.{time_column} \
+                 )"
             ),
             [target_server_id.into(), source_server_id.into()],
         ))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/service/recovery_merge.rs` around lines 643 - 661, The
current DELETE removes every target row with timestamps between the source min
and max, causing loss of target-only samples; change the deletion to only remove
target rows that actually conflict with source timestamps. Replace the DELETE
that uses the min/max window (the db.execute call building the DELETE with
{time_column} >=/...<= MIN/MAX) with a query that deletes rows from the target
server where an actual source row exists at the same timestamp (e.g., use an IN
(SELECT {time_column} FROM {table} WHERE server_id = $2) or a correlated EXISTS
subquery) so only direct timestamp conflicts between source_server_id and
target_server_id are removed before the UPDATE that sets server_id for source
rows.

Comment on lines +943 to +977
if source.fingerprint.is_some() {
server::Entity::update_many()
.col_expr(server::Column::Fingerprint, Expr::value(None::<String>))
.col_expr(server::Column::UpdatedAt, Expr::value(Utc::now()))
.filter(server::Column::Id.eq(source.id.clone()))
.exec(db)
.await?;
}

let target = server::Entity::find_by_id(target_server_id)
.one(db)
.await?
.ok_or_else(|| AppError::NotFound("Server not found".to_string()))?;

let mut active: server::ActiveModel = target.into();
active.cpu_name = sea_orm::Set(source.cpu_name.clone());
active.cpu_cores = sea_orm::Set(source.cpu_cores);
active.cpu_arch = sea_orm::Set(source.cpu_arch.clone());
active.os = sea_orm::Set(source.os.clone());
active.kernel_version = sea_orm::Set(source.kernel_version.clone());
active.mem_total = sea_orm::Set(source.mem_total);
active.swap_total = sea_orm::Set(source.swap_total);
active.disk_total = sea_orm::Set(source.disk_total);
active.ipv4 = sea_orm::Set(source.ipv4.clone());
active.ipv6 = sea_orm::Set(source.ipv6.clone());
active.region = sea_orm::Set(source.region.clone());
active.country_code = sea_orm::Set(source.country_code.clone());
active.virtualization = sea_orm::Set(source.virtualization.clone());
active.agent_version = sea_orm::Set(source.agent_version.clone());
active.protocol_version = sea_orm::Set(source.protocol_version);
active.features = sea_orm::Set(source.features.clone());
active.last_remote_addr = sea_orm::Set(source.last_remote_addr.clone());
active.fingerprint = sea_orm::Set(source.fingerprint.clone());
active.updated_at = sea_orm::Set(Utc::now());
active.update(db).await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move the source fingerprint clear after the target write succeeds.

This helper mutates the source row first. If the target lookup or update fails, recovery loses the only persisted fingerprint without ever transferring it.

🛠️ Suggested fix
-        if source.fingerprint.is_some() {
-            server::Entity::update_many()
-                .col_expr(server::Column::Fingerprint, Expr::value(None::<String>))
-                .col_expr(server::Column::UpdatedAt, Expr::value(Utc::now()))
-                .filter(server::Column::Id.eq(source.id.clone()))
-                .exec(db)
-                .await?;
-        }
-
         let target = server::Entity::find_by_id(target_server_id)
             .one(db)
             .await?
             .ok_or_else(|| AppError::NotFound("Server not found".to_string()))?;
@@
         active.fingerprint = sea_orm::Set(source.fingerprint.clone());
         active.updated_at = sea_orm::Set(Utc::now());
         active.update(db).await?;
+
+        if source.fingerprint.is_some() {
+            server::Entity::update_many()
+                .col_expr(server::Column::Fingerprint, Expr::value(None::<String>))
+                .col_expr(server::Column::UpdatedAt, Expr::value(Utc::now()))
+                .filter(server::Column::Id.eq(source.id.clone()))
+                .exec(db)
+                .await?;
+        }

Comment on lines +46 to +47
let writes_allowed = state.recovery_lock.writes_allowed_for(server_id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-check recovery lock at each write to avoid TOCTOU on freeze/release

writes_allowed is snapshotted once at Line 46, then reused after multiple awaits. If freeze()/release() happens mid-iteration, subsequent writes can violate current lock state. Re-evaluate lock immediately before each DB write.

Proposed fix
-            let writes_allowed = state.recovery_lock.writes_allowed_for(server_id);
-
             // Save metrics record
-            if writes_allowed {
+            if state.recovery_lock.writes_allowed_for(server_id) {
                 if let Err(e) = RecordService::save_report(&state.db, server_id, report).await {
                     tracing::error!("Failed to save record for {server_id}: {e}");
                 } else {
                     count += 1;
                 }
             } else {
                 tracing::info!("Skipping recovery-frozen record write for {server_id}");
             }
@@
-                    if writes_allowed {
+                    if state.recovery_lock.writes_allowed_for(server_id) {
                         if let Err(e) = TrafficService::upsert_state(
                             &state.db,
                             server_id,
@@
-                if writes_allowed {
+                if state.recovery_lock.writes_allowed_for(server_id) {
                     if let Err(e) = TrafficService::upsert_hourly(
                         &state.db,
                         server_id,
@@
-            if writes_allowed {
+            if state.recovery_lock.writes_allowed_for(server_id) {
                 if let Err(e) =
                     TrafficService::upsert_state(&state.db, server_id, curr_in, curr_out).await
                 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/server/src/task/record_writer.rs` around lines 46 - 47, The code
currently snapshots writes_allowed with
state.recovery_lock.writes_allowed_for(server_id) and reuses it across awaits,
allowing a TOCTOU if freeze()/release() changes the lock mid-iteration; change
the logic to call state.recovery_lock.writes_allowed_for(server_id) immediately
before each actual DB write (i.e., inside the loop/just prior to invoking the
write operation) and abort or handle the write if the fresh check disallows it,
removing reliance on the earlier writes_allowed variable and ensuring the check
happens after awaits and right before the write.

Comment on lines +76 to +78
- Create: `apps/web/src/stores/recovery-jobs-store.ts`
Holds live recovery job state keyed by `target_server_id` and `job_id`.
- Create: `apps/web/src/stores/recovery-jobs-store.test.ts`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct the store key description in the plan.

Line 77 says the store is keyed by target_server_id and job_id, but the implemented store is keyed only by target_server_id. Please update this line to avoid future implementation drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-16-agent-recovery-merge.md` around lines 76 -
78, The plan's description mistakenly states the store is keyed by
`target_server_id` and `job_id`; update the text for `recovery-jobs-store` to
reflect the actual implementation which keys entries only by `target_server_id`
(remove `job_id` from the key description) so the plan matches the
`recovery-jobs-store` behavior.

@ZingerLittleBee ZingerLittleBee merged commit 1838a4f into main Apr 16, 2026
10 checks passed
@ZingerLittleBee ZingerLittleBee deleted the agent-recovery-merge branch April 16, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant