Skip to content

feat: data set lifecycle job#588

Merged
silent-cipher merged 25 commits into
mainfrom
feat/data-set-deletion-job
Jun 11, 2026
Merged

feat: data set lifecycle job#588
silent-cipher merged 25 commits into
mainfrom
feat/data-set-deletion-job

Conversation

@silent-cipher

@silent-cipher silent-cipher commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a new data_set_lifecycle_check job that exercises the full createDataSetterminateService lifecycle in a single self-contained run. It creates a throwaway empty data set, terminates it, and waits for transaction receipt. This runs regardless of how many datasets a provider already has.

The job is calibration-only by default (DATASET_LIFECYCLE_CHECK_ENABLED=false on mainnet).

Changes

  • jobs.service.ts — new data_set_lifecycle_check job type: scheduling, worker registration, singleton-per-SP enforcement
  • data-set-lifecycle.service.tsrunLifecycleCheck: creates a tagged empty data set (dealbotLifecycleCheck metadata key), calls terminateService, and waits for the transaction receipt
  • check-metrics.service.ts — new dataSetLifecycleCheckStatus and dataSetLifecycleCheckMs metrics
  • app.config.ts / .env.example — three new config variables (see below)
  • Docs — docs/checks/data-set-lifecycle-check.md, updated environment-variables.md, jobs.md, checks/README.md, events-and-metrics.md

New Config Variables

  • DATASET_LIFECYCLE_CHECK_ENABLED – default: true (calibration) / false (mainnet) , enable/disable the job
  • DATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR – check rate per provider
  • DATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS – default: 600 secs, max job runtime before forced abort

Closes #586

@FilOzzy FilOzzy added this to FOC Jun 1, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Jun 1, 2026
@silent-cipher silent-cipher changed the base branch from main to docs/data-set-creation-design-doc June 1, 2026 18:24
@silent-cipher silent-cipher self-assigned this Jun 1, 2026
@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FOC Jun 2, 2026
Base automatically changed from docs/data-set-creation-design-doc to main June 3, 2026 06:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.

Changes:

  • Introduces a detailed design/spec for a new data_set_termination job, including scheduling, handler algorithm, and idempotency expectations.
  • Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
  • Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@silent-cipher silent-cipher changed the title docs: data set deletion job design documentation docs: data set termination job design documentation Jun 3, 2026
@rjan90 rjan90 marked this pull request as ready for review June 3, 2026 14:40
@rjan90 rjan90 moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Jun 3, 2026
@BigLep

BigLep commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

@BigLep BigLep left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.

I don't think we should say that this closes #586

We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@silent-cipher

Copy link
Copy Markdown
Collaborator Author

I don't think we should say that this closes #586

I was planning to include implementation in this same PR.

@BigLep BigLep left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.

(This now concludes the review I started with #588 (review))

Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
Comment thread docs/data-set-termination.md Outdated
@BigLep

BigLep commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03.

Post GA item created: FilOzone/filecoin-services#503

@BigLep

BigLep commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

2026-06-04 notes/decisions from verbal parking lot after standup:

Participants: @BigLep, @silent-cipher, @SgtPooki

We discussed the current design where DATA_SET_TERMINATION_MIN_INDEX partitions slots into "golden" (stable for retrieval) and "recyclable" ranges. The concern is that the recyclable slots aren't truly separate from the data storage/retrieval checks — data_set_creation and deal checks may still add pieces to them, so they're neither fully golden nor fully throwaway. The current approach sits in an awkward middle ground.

Decision: simplify to a standalone create-then-terminate canary job.

Instead of the slot-partitioning approach, we agreed to:

  1. Decouple entirely from golden data sets. The termination canary should not recycle slots used by data_set_creation, data storage checks, or retrieval checks. It's a separate, lightweight job.
  2. Create and immediately terminate in a single job invocation. The job creates a new data set, confirms success, then calls terminateService on it. No pieces are added. No interaction with MIN_NUM_DATASETS_FOR_CHECKS or existing slots.
  3. No DB tracking. In the happy path, the data set is created and terminated within one job run. No new rows in the data sets table. Use a recognizable metadata key on the created data set so that if termination fails and data sets accumulate, they can be identified on-chain for manual cleanup.
  4. Accept possible resource leakage. If creation succeeds but termination fails, we'll leak data sets. This is an acceptable trade-off for simplicity — the metadata tagging gives us a way to query for and clean up any leaked data sets if necessary.
  5. Calibration-only by default. Should not run on mainnet by default, but don't hard-block mainnet — allow manual opt-in.
  6. Chaos monkey deferred. Randomly killing golden data sets (to test resilience of the full lifecycle) is a good future goal but out of scope for now. We want the simplest canary that catches createDataSet/terminateService regressions.

This is a meaningful pivot from the current design — sorry for not catching it sooner in review. The existing work on the spec and slot-management logic is appreciated, but this approach should result in simpler code and docs.

@BigLep BigLep marked this pull request as draft June 4, 2026 16:25
@BigLep BigLep moved this from 🔎 Awaiting review to ⌨️ In Progress in FOC Jun 4, 2026
@SgtPooki

SgtPooki commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

It was working locally. And, I think its decent amount of time for the job. But, if you agree , I can increase it to 10 mins.

yeah lets do 10 mins.. its just the max before forced abort, and i'd rather give cleanup room to finish than exit too early.

I was also thinking about creating an empty data set but then I saw your doc comment above createDataSetWithPiece - "empty data sets is being removed from Curio and synapse-sdk".

good catch.. i went and double checked and my doc comment is stale. createDataset was removed in FilOzone/pdp#201 but restored in FilOzone/pdp#219 specifically to allow empty data sets. curio still supports it (POST /pdp/data-sets, no pieces) and synapse-core still exports createDataSet. so empty is possible.. lets drop the seed piece per the parking-lot decision, and i'll fix that comment in a followup.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@silent-cipher

Copy link
Copy Markdown
Collaborator Author

lets drop the seed piece per the parking-lot decision

Done. Now creating empty data set.

Also, reverted all deal.service.ts changes and moved them to separate DataSetLifecycle module.

@BigLep BigLep left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main feedback is item is to consider to doing createDataSet and createDataSetAddPiece (sp - can't remember). I would hate for this canary to pass but then the main operation that users actually use to fail...

Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
@BigLep BigLep moved this from ⌨️ In Progress to ✔️ Approved by reviewer in FOC Jun 7, 2026
@BigLep BigLep moved this from ✔️ Approved by reviewer to ⌨️ In Progress in FOC Jun 8, 2026

@BigLep BigLep left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/data-set-lifecycle-check.md Outdated
Comment thread docs/checks/events-and-metrics.md Outdated
Comment thread docs/environment-variables.md Outdated
Comment thread docs/environment-variables.md Outdated
@silent-cipher

silent-cipher commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Incorporated all existing feedback and I think we're fully aligned on the data_set_lifecycle job now. I've tested the changes and they look good. I'll wait for a green light from @SgtPooki.

dataSetLifecycleCheckStatus{checkType="dataSetLifecycleCheck",providerId="2",providerName="ezpdpz-calib2",providerStatus="approved",value="success",app="dealbot",network="calibration"} 1
dataSetLifecycleCheckStatus{checkType="dataSetLifecycleCheck",providerId="5",providerName="Mongo2Stor",providerStatus="approved",value="failure.other",app="dealbot",network="calibration"} 1

@BigLep BigLep moved this from ⌨️ In Progress to ✔️ Approved by reviewer in FOC Jun 9, 2026

@SgtPooki SgtPooki left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, lets get this out on calibration.. going to wait for images to build from #571 before I merge this and interrupt the image builds though

@silent-cipher silent-cipher merged commit 37ddd2e into main Jun 11, 2026
9 checks passed
@silent-cipher silent-cipher deleted the feat/data-set-deletion-job branch June 11, 2026 06:13
@github-project-automation github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

Periodic dataset deletion job in calibration to canary createDataSet flow

6 participants