Skip to content

feat(retrieval): Phase 3 — upsert / delete#241

Open
jamby77 wants to merge 1 commit into
masterfrom
feature/retrieval-sdk-phase3-upsert-delete
Open

feat(retrieval): Phase 3 — upsert / delete#241
jamby77 wants to merge 1 commit into
masterfrom
feature/retrieval-sdk-phase3-upsert-delete

Conversation

@jamby77

@jamby77 jamby77 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Phase 3 — upsert / delete

Stacked on #240 (Phase 2) — base is the Phase 2 branch, not master.

Adds document write + delete to Retriever, with embedFn and dimension inference wired in.

What's new

  • embedFn constructor option; resolveDims() = schema.vector.dims or a one-time cached embedFn('probe'). createIndex now infers dims via embedFn when the schema omits them (deferred from Phase 2); the explicit-dims path is unchanged.
  • upsert(entries) — embeds text, validates vector length against resolved dims, HSET {name}:{id} with stringified fields, encodeFloat32 vector, and the reserved __text field. Rejects entries whose fields collide with the reserved __text or the vector field name.
  • delete(ids)DEL the derived {name}:{id} keys in one call. Both methods no-op on empty input.
  • Exports: EmbedFn, UpsertEntry, TEXT_FIELD, and resolveVectorFieldName (so Phase 4 query shares the text-field name and vector-field resolution).

Tests

12 unit tests (mocked client + fake embedFn), full package suite 54/54 green, tsc --noEmit + prettier clean. Includes a cache test asserting embedFn is probed exactly once across multiple entries.

Review-driven changes

A pre-PR code review surfaced and fixed: the reserved-field collision guard, exporting TEXT_FIELD/resolveVectorFieldName for Phase 4, and the cache-coverage test.

Deferred (not in this phase)

  • Batch embedding: upsert embeds sequentially. Bulk-ingestion parallelism needs bounded concurrency (rate-limit safety) — belongs with Phase 5 (observability/perf), not a naive Promise.all.
  • EmbedFn type is duplicated with semantic-cache; consolidating into packages/shared is a separate cross-package change.
  • Real-Valkey coverage (key-prefix/index-match, round-trip) lands in Phase 6 integration.

Note

Medium Risk
New Valkey write paths and embedding validation affect ingestion correctness; behavior is well covered by mocked unit tests but not yet by real-Valkey integration.

Overview
Adds document write and delete to Retriever, plus optional embedFn and shared dimension resolution for index creation and upserts.

createIndex now calls resolveDims() before FT.CREATE: use schema.vector.dims when set, otherwise a one-time cached embedFn('probe') (errors if neither is available).

upsert(entries) embeds each entry’s text, checks vector length against resolved dims, and **HSET**s {name}:{id} with stringified custom fields, encodeFloat32 on the vector field, and reserved __text. User fields cannot use __text or the configured vector field name; invalid batches fail before any HSET.

delete(ids) issues a single DEL for derived keys. Empty upsert / delete inputs are no-ops.

Public API additions: EmbedFn, UpsertEntry, TEXT_FIELD, and exported resolveVectorFieldName. New upsert-delete.test.ts covers inference, upsert/delete behavior, validation, and dim caching.

Reviewed by Cursor Bugbot for commit c6860e8. Bugbot is set up for automated code reviews on this repo. Configure here.

Base automatically changed from feature/retrieval-sdk-phase2-index-lifecycle to master June 15, 2026 16:51
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase3-upsert-delete branch from 3d23af1 to ebdfea4 Compare June 15, 2026 16:53

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ebdfea4. Configure here.

Comment thread packages/retrieval/src/retriever.ts
Comment thread packages/retrieval/src/retriever.ts
Comment thread packages/retrieval/src/retriever.ts
- Add embedFn to Retriever; resolve dims from schema or an embedFn probe (cached)
- createIndex infers dims via embedFn when the schema omits them
- upsert: embed text, HSET {name}:{id} with fields, encoded vector, and __text
- delete: DEL derived keys; both no-op on empty input
- Export resolveVectorFieldName from ft-create for reuse
- Export EmbedFn and UpsertEntry types

@KIvanow KIvanow left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, approving. Three non-blocking notes worth a follow-up:

  1. upsert isn't atomic at the wire level. Validation is all-or-nothing, but the HSET loop issues N separate calls, so if the connection drops mid-batch, entries 1..k-1 are already written. Fine for idempotent re-runs, but there's no MULTI. Matches the deferred batch/transaction note.

  2. HSET is merge, not replace. Re-upserting an id with fewer fields leaves stale fields from the prior write. If callers expect "replace document" semantics this surprises them. Probably intended as partial update, but worth documenting.

  3. id isn't validated. An empty id yields a key equal to the index prefix (docs:), with no guard. Edge case only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants