Skip to content

feat(db) resource version cas#1292

Open
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas
Open

feat(db) resource version cas#1292
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas

Conversation

@derekwaynecarr
Copy link
Copy Markdown
Collaborator

@derekwaynecarr derekwaynecarr commented May 9, 2026

Summary

Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.

Core infrastructure:

  • Add resource_version field to ObjectMeta proto (uint64)
  • Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT)
  • Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional)
  • Add PersistenceError::Conflict variant for version mismatch
  • Add Store::put_if() and Store::delete_if() CAS methods
  • Add Store::update_message_cas() with bounded retry for mutations
  • Implement CAS operations for both SQLite and Postgres backends
  • Hydrate resource_version on all typed reads (defaults to 1 for backfill)

Migrations:

  • Migrate policy mutations to CAS (draft operations, settings)
  • Migrate provider updates to CAS (credentials, config merging)
  • Migrate sandbox updates to CAS (phase transitions, status reconciliation)
  • Migrate compute status updates to CAS (driver watch event handling)

Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.

gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.

Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.

Related Issue

Fixes #1255

Changes

Testing

  • [x ] mise run pre-commit passes
  • [ x] Unit tests added/updated
  • [ x] E2E tests added/updated (if applicable)

Checklist

  • [ x] Follows Conventional Commits
  • [ x] Commits are signed off (DCO)
  • [ x] Architecture docs updated (if applicable)

@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 9, 2026 13:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@derekwaynecarr derekwaynecarr changed the title Feat/db resource version cas feat(db) resource version cas May 9, 2026
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 30e8540 to 46c93c3 Compare May 10, 2026 01:38
Comment thread crates/openshell-server/src/grpc/provider.rs Outdated
Comment thread crates/openshell-server/src/grpc/provider.rs Outdated
Comment thread crates/openshell-server/src/grpc/policy.rs Outdated
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 46c93c3 to 4931e83 Compare May 14, 2026 05:20
@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

@johntmyers ptal. updated per prior feedback, also updated all proto paths that required CAS enablement. the recently merged draft chunk work is not covered right now, i would like to leave that as a follow-on.

Comment thread crates/openshell-server/src/grpc/provider.rs
Comment thread crates/openshell-server/src/compute/mod.rs
Comment thread crates/openshell-server/src/compute/mod.rs Outdated
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 4931e83 to 4456536 Compare May 16, 2026 01:06
Add resource_version-based optimistic concurrency control to the
persistence layer. Every write now requires an explicit WriteCondition
(MustCreate or MatchResourceVersion), enforced at compile time by
gating unconditional put/put_message behind #[cfg(test)].

- Add WriteCondition enum and put_if for conditional writes
- Add update_message_cas for atomic read-modify-write operations
- Add list_messages/list_messages_with_selector helpers that hydrate
  resource_version from authoritative DB rows
- Convert all production write paths to CAS-aware methods
- Gate put/put_message behind #[cfg(test)] to prevent non-CAS writes
- Use structured PersistenceError::UniqueViolation matching instead of
  string matching for duplicate detection
- Hydrate resource_version from WriteResult directly on creates,
  eliminating unnecessary read-after-write round trips

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 4456536 to d65c359 Compare May 16, 2026 01:21
@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

All prior comments are resolved.

  • All create/update paths for all objects that have an ObjectMeta with resource version have been updated.
  • All list and gets should properly hydrate resource version as well.
  • Updated arch docs.

@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

/ok to test d65c359

@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

failure due to flaky test, fix #1417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): add DB-backed resource_version CAS for stored objects

2 participants