Skip to content

feat: tiered data storage#7

Open
enigbe wants to merge 46 commits intomainfrom
2025-10-tiered-data-storage
Open

feat: tiered data storage#7
enigbe wants to merge 46 commits intomainfrom
2025-10-tiered-data-storage

Conversation

@enigbe
Copy link
Copy Markdown
Owner

@enigbe enigbe commented Oct 21, 2025

What this PR does:

We introduce TierStore, a KVStore implementation that manages data across
three distinct storage layers.

The layers are:

  1. Primary: The main/remote data store.
  2. Ephemeral: A secondary store for non-critical, easily-rebuildable data
    (e.g., network graph). This tier aims to improve latency by leveraging a
    local KVStore designed for fast/local access.
  3. Backup: A tertiary store for disaster recovery. Backup operations are sent
    asynchronously/lazily to avoid blocking primary store operations.

We also permit the configuration of Node with these stores allowing
callers to set exponential back-off parameters, as well as backup and ephemeral
stores, and to build the Node with TierStore's primary store. These configuration
options also extend to our foreign interface, allowing bindings target to build the
Node with their own ffi::KVStore implementations.

A sample Python implementation is added and tested.

Additionally, we add comprehensive testing for TierStore by introducing

  1. Unit tests for TierStore core functionality.
  2. Integration tests for Node built with tiered storage.
  3. Python FFI tests for foreign ffi::KVStore implementations.

Concerns

It is worth considering the way retry logic is handled, especially because of nested
retries. TierStore comes with a basic one by default but there are KVStore implementations
that come with them baked-in (e.g. VssStore), and thus would have no need for
the wrapper-store's own logic.

@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 9 times, most recently from 29f47f3 to 264aa7f Compare November 4, 2025 22:07
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 3 times, most recently from a30cbfb to 1e7bdbc Compare December 4, 2025 23:30
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 5 times, most recently from b5e980f to 67d47c2 Compare February 4, 2026 16:28
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 2 times, most recently from 95285b0 to 4b2d345 Compare February 18, 2026 11:23
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 2 times, most recently from cba29a3 to db1fe83 Compare February 24, 2026 23:03
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch from db1fe83 to 35f9ec2 Compare March 9, 2026 15:47
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 3 times, most recently from e89ada5 to 3abd0a5 Compare April 1, 2026 21:44
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch from 3abd0a5 to 4a75d67 Compare April 2, 2026 09:55
Resolve Node.js 20 deprecation warnings by updating all GitHub Actions
to their latest major versions supporting Node.js 24.

Co-Authored-By: HAL 9000
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 3 times, most recently from 5647237 to 04855d7 Compare April 6, 2026 06:05
tankyleo and others added 18 commits April 15, 2026 15:55
This makes it consistent with the argument used in ldk-server's open
channel API.

Also add a few more variable renames to make things consistent.
…erve

Add tests and improve documentation for zero reserve channels
Extract the repeated "acquire write lock on `node_metrics`, mutate a
field or two, then write the encoded struct to the kv-store" idiom into
a single helper in `io::utils`. As a side effect, `write_node_metrics`
is inlined into the helper.

Co-Authored-By: HAL 9000
Previously, after each successful Rapid Gossip Sync update the
background task wrote `latest_rgs_snapshot_timestamp` to the persisted
`NodeMetrics` immediately, while the network graph itself is only
flushed to disk later by LDK's background processor. A crash in that
window left the on-disk metric ahead of the on-disk graph — on restart
we'd resume RGS from the newer timestamp and permanently skip the
updates that were never persisted together with the graph.

Instead, seed the RGS start timestamp from
`NetworkGraph::get_last_rapid_gossip_sync_timestamp`, which is part of
the graph's own serialized state and therefore lands on disk atomically
with the channel updates it describes. The same source now backs the
RGS timestamp reported via `NodeStatus::latest_rgs_snapshot_timestamp`,
so the reported value always matches what's reflected in the graph.
Worst case after a crash is that we refetch the snapshots since the
last persisted graph — an idempotent operation — rather than silently
losing them.

The `latest_rgs_snapshot_timestamp` field is retired from `NodeMetrics`,
and TLV slot 6 is kept readable for backwards compatibility via LDK's
`legacy` TLV grammar. Old persisted records still deserialize; new
records no longer carry slot 6. The dead "reset RGS timestamp on
gossip-source switch" block in the P2P builder branch also goes away,
since the graph's timestamp remains the correct resume point across a
P2P→RGS switch.

Co-Authored-By: HAL 9000
Introduce new configuration parameters to manage Human-Readable Name
(HRN) resolution and DNSSEC validation behavior.

These settings allow users to define custom resolution preferences
for BOLT12 offer lookups. Moving these parameters
into the central configuration struct ensures that node behavior is
customizable at runtime and consistent across different network
environments. This abstraction is necessary to support diverse DNSSEC
requirements without hard-coding resolution logic.
Inject specialized resolution capabilities into OnionMessenger to
support outbound payments and third-party resolution services.

This change refines the previous resolution logic by allowing the node
to act as a robust BIP 353 participant. If configured as a service
provider, the node utilizes a Domain Resolver to handle requests for
other participants. Otherwise, it uses an HRN Resolver specifically for
initiating its own outbound payments. Providing these as optional
parameters in the Node constructor ensures the logic matches the
node's designated role in the ecosystem.
Introduce a comprehensive test case to verify the full lifecycle of a
payment initiated via a Human Readable Name (HRN).

This test ensures that the integration between HRN parsing, BIP 353
resolution, and BOLT12 offer execution is functioning correctly within
the node. By asserting that an encoded URI can be successfully resolved
to a valid offer and subsequently paid, we validate the reliability of
the resolution pipeline and ensure that recent architectural changes
to the OnionMessenger and Node configuration work in unison.
Update the GitHub Actions workflow to include coverage for the new
hrn_tests feature across multiple build configurations.

This ensures that the DNSSEC override logic is validated in both
standard Rust and UniFFI-enabled environments. Including these flags in
CI prevents regressions where testing-specific code might break the
primary build or fail to compile due to type mismatches between the
LDK and FFI wrappers.

Testing both feature combinations (with and without UniFFI) guarantees
that the abstraction for HumanReadableName remains consistent across
all supported platforms and integration layers.
…esolving-hrns

Add support for resolving BIP 353 Human-Readable Names
Read the RGS sync timestamp from the network graph
This commit:

Adds `TierStore`, a tiered `KVStore`/`KVStoreSync` implementation that
routes node persistence across three storage roles:

- a primary store for durable, authoritative data
- an optional backup store for a second durable copy of primary-backed data
- an optional ephemeral store for rebuildable cached data such as the
  network graph and scorer

TierStore routes ephemeral cache data to the ephemeral store when
configured, while durable data remains primary+backup. Reads and lists
do not consult the backup store during normal operation.

For primary+backup writes and removals, this implementation treats the
backup store as part of the persistence success path rather than as a
best-effort background mirror. Earlier designs used asynchronous backup
queueing to avoid blocking the primary path, but that weakens the
durability contract by allowing primary success to be reported before
backup persistence has completed. TierStore now issues primary and backup
operations together and only returns success once both complete.

This gives callers a clearer persistence guarantee when a backup store is
configured: acknowledged primary+backup mutations have been attempted
against both durable stores. The tradeoff is that dual-store operations
are not atomic across stores, so an error may still be returned after one
store has already been updated.

TierStore also implements `KVStoreSync` in terms of dedicated synchronous
helpers that call the wrapped stores' sync interfaces directly. This
preserves the inner stores' synchronous semantics instead of routing sync
operations through a previously held async runtime.

Additionally, adds unit coverage for the current contract, including:
- basic read/write/remove/list persistence
- routing of ephemeral data away from the primary store
- backup participation in the foreground success path for writes and removals
Add native builder support for tiered storage by introducing
`TierStoreConfig`, backup and ephemeral store configuration methods,
and a `build_with_dynstore` path that wraps the provided store as the
primary tier and applies any configured secondary tiers.

This makes tiered storage a builder concern while keeping the Rust
`build_with_store` API ergonomic for native callers.

Note: The temporary `dead_code` allowance will be removed in a
follow-up commit once the new tier-store builder APIs are exercised.
improve TierStore documentation
silently log backup failure
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch 3 times, most recently from f59ec3c to bc5c1fb Compare April 27, 2026 07:39
enigbe added 3 commits April 27, 2026 09:09
This is an exploratory commit to:

- Introduce BackupMode with BestEffortBackup and SemiSync semantics,
  plus a locally persisted backup retry queue for failed backup writes
  and removals.
- Add serialization for pending backup ops, deterministic
  queue encoding, durable enqueue semantics for SemiSync, and a retry
  task skeleton with backoff and stale-op detection.

Also update TierStore and queue docs to reflect the new backup-mode
model and at-least-once cleanup semantics.
Replace TierStoreInner's raw backup DynStore with BackupStore, which
holds the backup store plus an optional BackupRetryQueue.

Update write/remove backup result handling to accept PendingBackupOp
and enqueue failed backup operations when a retry queue is present;
otherwise only log the backup failure.

This makes the configured backup semantics explicit: best-effort mode
logs backup failures, while semisync mode requires durably recording
failed backup intents for later retry.

Adjust set_backup_store, builder wiring, and docs for the new backup
configuration shape.
Thread BackupMode through TierStoreConfig and update backup-store
configuration to distinguish best-effort backup writes from semisync
behavior.

Build a local retry queue store for SemiSync during tier-store
construction, retain the concrete TierStore on Node, and spawn the
background backup retry task during Node::start() with shutdown
integration via stop_sender.

Also update TierStore backup result handling to enqueue concrete
PendingBackupOp values for durable retry, and refresh the related
backup and retry-task documentation.
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch from bc5c1fb to 5da76a8 Compare April 27, 2026 08:09
enigbe added 4 commits April 27, 2026 09:51
Refactor tier_store unit tests around shared filesystem-backed test
fixtures and add failure-injection helpers for backup and retry stores.

Cover best-effort backup failure handling, semisync retry enqueueing,
queue-persist failure behavior, retry-task replay of queued writes and
removes, remove-not-found idempotence, retry-queue restart reload,
dedup behavior across replacement sequences, stale snapshot skipping,
and pending-op serialization round-trips.
Extend the native store plumbing to support key enumeration for store
migration by wiring `MigratableKVStore` through `DynStoreWrapper` and
implementing it for LDK-native backends, including `SqliteStore`,
`VssStore`, and tiered storage, plus the in-repo test stores needed to
keep the migration path exercised.

This lays the plumbing for backup restoration and reconciliation by
making migration a first-class capability of the stores ldk-node
natively owns and configures, ensuring they can exhaustively enumerate
their persisted keys.
Introduce a restore path that copies durable state from a configured
backup store into an empty primary store before normal node
initialization.

This commit:
- Adds a recovery module to define the durable restore scope, filter known
  durable keys, restore them into the primary store, and detect when a
  primary already contains durable state.
- Wires this into the builder by separating backup restore from wallet recovery,
  adding restore-specific build errors, and running restore before any normal
  startup reads.

Also, covers the new logic with unit tests plus an integration test that exercises
backup population, restore into a fresh primary, and successful node boot with
preserved identity.
Delete unused enqueue_async method on the backup retry queue
@enigbe enigbe force-pushed the 2025-10-tiered-data-storage branch from 5da76a8 to dec4aa7 Compare April 27, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants