fix: 0.17.2 — sqs_consumer secrets wait, FIFO packaging, canvas updating UX#384
Merged
fix: 0.17.2 — sqs_consumer secrets wait, FIFO packaging, canvas updating UX#384
Conversation
Root cause of the ECS Deployment Circuit Breaker failure on fresh deploys of quiltdata/deployment#2357: main() called config.get_benchling_secrets() unconditionally at startup. When BenchlingSecret is created empty (by design, populated later by the external config script), the fetch raises SecretsManagerError, the process exits, and Essential=True cascades to repeated task failures until the circuit breaker fires. The HTTP container survives the same condition via _runtime_guard_response (serves 503s while degraded). The consumer had no equivalent safety net. Pre-start readiness wait: signal handlers are installed before wait_for_ready_config() loops with bounded exponential backoff (30s -> 300s) until get_benchling_secrets() + apply_benchling_secrets() succeed. SIGTERM short-circuits the wait via asyncio.Event. Task reaches steady state, consumer starts processing as soon as the config script populates the secret -- no Essential=False tradeoff, no ordering constraint between stack deploy and config script. Also caps MaxNumberOfMessages at min(10, concurrency) so tail messages no longer sit in the semaphore backlog eating visibility timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the bare 'Processing...' placeholder canvas with the full navigation layout (Browse/Update buttons + package header + footer), where the footer reads 'Updating...' until the background workflow delivers the real 'Updated at' timestamp. The Update Package button is disabled while updating to prevent a second concurrent export; Browse Package stays enabled so the previous package version remains reachable on re-exports. Skips the linked-packages Athena query on the initial render to keep it fast on the request thread. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Webhook handlers used to fire-and-forget by spawning daemon threads, which let entry.created and canvas.created events for the same entry race to write entry.json. The .canvas_id S3 sidecar masked one symptom of that race (canvas_id loss), but the deeper issue was no per-entry sequencing. Replace the daemon-thread "queue" with a real FIFO SQS queue keyed by MessageGroupId=entry_id. A new BenchlingPackagingConsumer sidecar drains the queue and runs EntryPackager.execute_workflow. Sequential processing per entry makes the sidecar unnecessary; a late entry event simply reads canvas_id from any existing entry.json. - Add PackagingRequestQueue.fifo (+ DLQ) in CDK with 40-min visibility timeout and content-based dedup - New packaging_publisher module: webhook handlers enqueue here - New packaging_consumer module + Fargate sidecar that drains the queue - Refactor SqsConsumer into a BaseSqsConsumer + two subclasses - Delete _save_canvas_id, _load_canvas_id, _load_canvas_id_sidecar, and execute_workflow_async; existing .canvas_id files in S3 become harmless - Move synchronous send_updating_canvas into the canvas webhook handler Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds entries for the canvas Updating-layout fix and the FIFO SQS packaging sequencing that replaces the .canvas_id sidecar. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent user-visible fixes landing as 0.17.2.
1.
sqs_consumerwaits for Benchling secrets instead of crashing (ddf648f)Addresses the 🔴 "CI
deploy_dev_stackfailing" blocker sir-sigurd flagged on quiltdata/deployment#2357. Root cause lives here in the image, not in the deployment template.Before:
sqs_consumer.main()calledconfig.get_benchling_secrets()unconditionally at startup. On a fresh deploy whereBenchlingSecretis created empty by design (populated later by the external config script), the fetch raisedSecretsManagerError→ process exited →Essential=Truecascaded → ECS Deployment Circuit Breaker tripped →UPDATE_ROLLBACK_COMPLETE.The HTTP container is immune because it records the failure via
_runtime_guard_responseand keeps serving 503s. The sidecar had no equivalent safety net.docker/src/sqs_consumer.py). Signal handlers installed beforewait_for_ready_config()loops with bounded exponential backoff (30s → 300s) untilget_benchling_secrets()+apply_benchling_secrets()succeed.SIGTERMshort-circuits the wait viaasyncio.Eventso ECS can stop the container cleanly even during the wait. Catches bothSecretsManagerError(secret empty/malformed) andValueError(env var missing).MaxNumberOfMessages=min(10, concurrency)— default concurrency is 5; batching up to 10 would have 6–10th messages sitting in the semaphore backlog eating visibility timeout.2. Canvas shows full layout with "Updating..." during export (beb3b3c)
Replaces the bare "Processing..." placeholder canvas with the full navigation layout (Browse/Update buttons + package header + footer), where the footer reads "Updating..." until the background workflow delivers the real "Updated at" timestamp. Update Package is disabled while updating to prevent a second concurrent export; Browse Package stays enabled so the previous package version remains reachable on re-exports. Skips the linked-packages Athena query on the initial render to keep it fast on the request thread.
3. FIFO SQS per-entry sequencing replaces
.canvas_idsidecar (4e31324)Webhook handlers used to fire-and-forget by spawning daemon threads, which let
entry.createdandcanvas.createdevents for the same entry race to writeentry.json. The.canvas_idS3 sidecar masked one symptom of that race (canvas_id loss), but the deeper issue was no per-entry sequencing.Replace the daemon-thread "queue" with a real FIFO SQS queue keyed by
MessageGroupId=entry_id. A newBenchlingPackagingConsumersidecar drains the queue and runsEntryPackager.execute_workflow. Sequential processing per entry makes the sidecar unnecessary; a late entry event simply readscanvas_idfrom any existingentry.json.PackagingRequestQueue.fifo(+ DLQ) in CDK with 40-min visibility timeout and content-based deduppackaging_publishermodule: webhook handlers enqueue herepackaging_consumermodule + Fargate sidecar that drains the queueSqsConsumerinto aBaseSqsConsumer+ two subclasses_save_canvas_id,_load_canvas_id,_load_canvas_id_sidecar, andexecute_workflow_async; existing.canvas_idfiles in S3 become harmless orphanssend_updating_canvasinto the canvas webhook handlerTest plan
npm test— lint + typecheck + TS + Python unit testsnpm run version:tag:devtriggers CI image buildnpm run test:devagainst deployed imagedeploy_dev_stackno longer trips circuit breaker on fresh deployentry.created+canvas.createdreplay: confirmcanvas_idpreserved inentry.json🤖 Generated with Claude Code