fix: support Curvine LanceDB clone and conditional writes#824
Conversation
| } | ||
| } | ||
|
|
||
| async fn release_object_write_lock(&self, guard: &ConditionalWriteLock) -> OsResult<()> { |
There was a problem hiding this comment.
The lock acquisition/release pair has two correctness gaps that LanceDB conditional writes will hit in production:
-
No ownership check on release.
release_object_write_lockunconditionally sendsLockType::UnLockagainstguard.path. If process A acquired the lock, hit the 120 s timeout doing nothing (or stalled on GC), and process B subsequently re-acquired the same lock and started its critical section, process A'sreleasecall will clear B's lock and let C charge in. The conditional-update etag check on disk still narrows the race window, but the staging+rename sequence is no longer the only writer between B'shead()and B's rename(). Consider gating release on the sameowner_id/pidtuple that acquired the lock (server-side compare-and-clear), or returning the previous lock state and refusing to clear a foreign owner. -
No TTL / lease, no crash recovery.
conditional_write_lockwritesacquire_time: 0and there's no heartbeat or expiry. If the LanceDB process iskill -9'd (OOM, deploy roll, machine reboot) mid-put_update, the advisory lock at/.curvine/lancedb/locks/<md5>stays held forever from Curvine's point of view. Every subsequent conditional write to that key will hitTimed out waiting for Curvine object write lockafter 120 s and never recover, and the lock files accumulate at/.curvine/lancedb/locks/with no GC. At minimum this needs:- A reasonable TTL/lease on the lock (so server reaps abandoned holders).
- A non-zero
acquire_timeso server-side reaping has a basis. - A periodic cleanup path for the lock-file directory, or a documented operator runbook for the leak.
These are blockers for using PutMode::Update as a real "this is safe to commit metadata on" primitive, which is the whole point of the conditional path.
There was a problem hiding this comment.
Checked this against the Curvine server lock implementation and I do not think the two core risks apply to the current code path.
- Release is owner-scoped:
LockMeta::set_flockremoves bykey = client_id + owner_id, and this facade stores the same generatedowner_idin the guard used for unlock. A stale A unlock cannot remove B’s later lock because B has a different owner key. - TTL/acquire_time are server-side:
LockMeta::set_flockoverwritesacquire_time = LocalTime::mills(), andMasterFilesystem::set_lockpassesmaster.lock_expire_time_ms()(default 5m). The client sends 0, but that is not persisted for acquired locks.
The lock files themselves may remain as marker files under /.curvine/lancedb/locks, but the held lock state is stored in inode lock metadata and is expired by the existing Curvine lock path. I am leaving this unchanged in this PR.
| .cloned() | ||
| .or_else(|| env::var(ClusterConf::ENV_CONF_FILE).ok()) | ||
| .ok_or_else(missing_curvine_config_error)?; | ||
| Ok(format!("{CURVINE_SCHEME}${conf_path}")) |
There was a problem hiding this comment.
Two concerns about embedding the raw config-file path in the store prefix:
-
Cache key fragility. Lance uses the prefix as the dedup key in the
ObjectStoreRegistry. Two callers pointing at the same Curvine cluster via different conf paths —/etc/curvine/cluster.tomlvs~/.curvine/cluster.toml, or absolute vs symlinked, or the conf file in two working dirs of the same binary — will get two distinctCurvineObjectStoreinstances, double the Curvine client connections, and double thecurvine_session()allocations. For an integration like this where the intent is "one store per cluster", consider deriving the prefix from the cluster's master endpoint(s) parsed out of the conf (or a hash of the resolved cluster identity) rather than the raw file path. -
Mandatory config at prefix-calculation time. Today
calculate_object_store_prefixis called by Lance on essentially every URI parse, including paths that go through the registry but don't ultimately need a Curvine connection (read-only manifest probes, listing, etc.). Making the function fail with"Missing Curvine cluster configuration"when the env var isn't set is a hard coupling that surprised me readingclone_tableinconnection.rs; consider falling back to a stablecurvine$<authority>/<path>identity when the conf isn't resolvable, and only failing innew_storewhere the configuration is actually used.
There was a problem hiding this comment.
Fixed in d6f4199d.
calculate_object_store_prefix no longer embeds the raw config-file path and no longer fails just because config is absent during prefix calculation. With a config file it parses the Curvine cluster config and derives the registry identity from the resolved master endpoints; without config it falls back to a stable URI-derived identity and leaves the real configuration failure to new_store.
I also added facade tests for both behaviors.
| store: CURVINE_SCHEME, | ||
| source: e.to_string().into(), | ||
| })?; | ||
| if self.dir_contains_visible_file(&child).await? { |
There was a problem hiding this comment.
dir_contains_visible_file does a full recursive directory walk for every top-level subdirectory found during list_with_delimiter, which inflates the RPC count significantly:
list_with_delimiter(None)on a workspace with N top-level directories and on average M files per subtree will issue O(N * depth)list_statusRPCs and walk O(N * M) files just to decide whether the directory should appear as acommon_prefix. LanceDB callslist_with_delimiteron hot paths (manifest fanout, version table discovery), so this scales poorly as the dataset count in a Curvine workspace grows.- The walk also has no caching, so back-to-back
list_with_delimitercalls pay the full cost again.
A couple of alternatives that keep the "don't return empty / staging-only prefix" semantics without the full walk:
- Stop the recursion as soon as the first user-visible file is found — the current
Box::pinrecursion already does this, but it still has to do the full walk for genuinely-empty staging-only directories (which is the common case during multipart in-flight). A shallower test ("the subtree contains at least one file that isn't.curvine/...") is cheap if Curvine exposes ahas_visible_descendantserver-side primitive, or by special-casing the multipart/lock root paths (which are the only known "filler" sources right now) and skipping them at the listing level rather than after the fact. - Alternatively, drop the visibility pruning entirely and let LanceDB see the empty prefix — Lance's manifest discovery code can already tolerate prefixes that don't contain a
.lancemanifest.
There was a problem hiding this comment.
Fixed in d6f4199d.
list_with_delimiter no longer calls dir_contains_visible_file, so it does not recursively walk each candidate directory. It now filters only known internal Curvine LanceDB directories (/.curvine/lancedb/multipart and /.curvine/lancedb/locks) and otherwise returns the immediate directory as a common_prefix, matching normal delimiter semantics and avoiding the hot-path O(N*M) scan.
584ebe8 to
2d63c0e
Compare
What changed
clone_tablewiring forcurvine://LanceDB databasesPutMode::Updatewith Curvine-backed conditional write locking and eTag checksWhy
This is PR 4 in the LanceDB on Curvine stack. It builds on PR #823 by adding the LanceDB semantics that require conditional object updates and correct URI/path handling across clone operations.
Stack
Validation
cargo fmt --allcargo clippy -p curvine-lancedb-rs --all-targets --all-features -- -D warningscargo test -p curvine-lancedb-rscargo clippy -p curvine-tests --test lancedb_object_store_e2e -- -D warnings