feat: data set lifecycle job#588
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new design document describing a proposed calibration-only data_set_termination pg-boss job intended to periodically terminate a managed dataset slot so the existing data_set_creation job recreates it, keeping the on-chain dataset lifecycle continuously exercised as a canary.
Changes:
- Introduces a detailed design/spec for a new
data_set_terminationjob, including scheduling, handler algorithm, and idempotency expectations. - Documents proposed configuration knobs and operational constraints (calibration-only gating, canary window sizing, rate constraints vs creation).
- Outlines observability expectations and BetterStack dashboard questions for validating the termination→creation loop.
|
Note to self: create backlog item for calibration lockup period adjustment (8 hours vs 30 days). I'll do this later 2026-06-03. |
There was a problem hiding this comment.
Note: I submitted this prematurely. I am still reviewing and will send another review when I'm done reading through.
I don't think we should say that this closes #586
We still need to do the implementation work and we should make sure we have visibility on this job on the internal dealbot dashboard.
I was planning to include implementation in this same PR. |
BigLep
left a comment
There was a problem hiding this comment.
I'm happy to take another look 2026-06-04, but hpefully this gives enough direction to give confidence about starting implementation.
(This now concludes the review I started with #588 (review))
Post GA item created: FilOzone/filecoin-services#503 |
|
2026-06-04 notes/decisions from verbal parking lot after standup: Participants: @BigLep, @silent-cipher, @SgtPooki We discussed the current design where Decision: simplify to a standalone create-then-terminate canary job. Instead of the slot-partitioning approach, we agreed to:
This is a meaningful pivot from the current design — sorry for not catching it sooner in review. The existing work on the spec and slot-management logic is appreciated, but this approach should result in simpler code and docs. |
yeah lets do 10 mins.. its just the max before forced abort, and i'd rather give cleanup room to finish than exit too early.
good catch.. i went and double checked and my doc comment is stale. |
Done. Now creating empty data set. Also, reverted all |
BigLep
left a comment
There was a problem hiding this comment.
Main feedback is item is to consider to doing createDataSet and createDataSetAddPiece (sp - can't remember). I would hate for this canary to pass but then the main operation that users actually use to fail...
|
Incorporated all existing feedback and I think we're fully aligned on the data_set_lifecycle job now. I've tested the changes and they look good. I'll wait for a green light from @SgtPooki. |
Summary
Adds a new
data_set_lifecycle_checkjob that exercises the fullcreateDataSet→terminateServicelifecycle in a single self-contained run. It creates a throwaway empty data set, terminates it, and waits for transaction receipt. This runs regardless of how many datasets a provider already has.The job is calibration-only by default (
DATASET_LIFECYCLE_CHECK_ENABLED=falseon mainnet).Changes
jobs.service.ts— newdata_set_lifecycle_checkjob type: scheduling, worker registration, singleton-per-SP enforcementdata-set-lifecycle.service.ts—runLifecycleCheck: creates a tagged empty data set (dealbotLifecycleCheckmetadata key), callsterminateService, and waits for the transaction receiptcheck-metrics.service.ts— newdataSetLifecycleCheckStatusanddataSetLifecycleCheckMsmetricsapp.config.ts/.env.example— three new config variables (see below)docs/checks/data-set-lifecycle-check.md, updatedenvironment-variables.md,jobs.md,checks/README.md,events-and-metrics.mdNew Config Variables
DATASET_LIFECYCLE_CHECK_ENABLED– default:true(calibration) /false(mainnet) , enable/disable the jobDATASET_LIFECYCLE_CHECKS_PER_SP_PER_HOUR– check rate per providerDATA_SET_LIFECYCLE_CHECK_JOB_TIMEOUT_SECONDS– default: 600 secs, max job runtime before forced abortCloses #586