Skip to content

fix: 0.17.2 — sqs_consumer secrets wait, FIFO packaging, canvas updating UX#384

Merged
drernie merged 7 commits intomainfrom
fix/sqs-consumer-graceful-degradation
Apr 16, 2026
Merged

fix: 0.17.2 — sqs_consumer secrets wait, FIFO packaging, canvas updating UX#384
drernie merged 7 commits intomainfrom
fix/sqs-consumer-graceful-degradation

Conversation

@drernie
Copy link
Copy Markdown
Member

@drernie drernie commented Apr 15, 2026

Summary

Three independent user-visible fixes landing as 0.17.2.

1. sqs_consumer waits for Benchling secrets instead of crashing (ddf648f)

Addresses the 🔴 "CI deploy_dev_stack failing" blocker sir-sigurd flagged on quiltdata/deployment#2357. Root cause lives here in the image, not in the deployment template.

Before: sqs_consumer.main() called config.get_benchling_secrets() unconditionally at startup. On a fresh deploy where BenchlingSecret is created empty by design (populated later by the external config script), the fetch raised SecretsManagerError → process exited → Essential=True cascaded → ECS Deployment Circuit Breaker tripped → UPDATE_ROLLBACK_COMPLETE.

The HTTP container is immune because it records the failure via _runtime_guard_response and keeps serving 503s. The sidecar had no equivalent safety net.

  • Pre-start readiness wait (docker/src/sqs_consumer.py). Signal handlers installed before wait_for_ready_config() loops with bounded exponential backoff (30s → 300s) until get_benchling_secrets() + apply_benchling_secrets() succeed. SIGTERM short-circuits the wait via asyncio.Event so ECS can stop the container cleanly even during the wait. Catches both SecretsManagerError (secret empty/malformed) and ValueError (env var missing).
  • MaxNumberOfMessages = min(10, concurrency) — default concurrency is 5; batching up to 10 would have 6–10th messages sitting in the semaphore backlog eating visibility timeout.

2. Canvas shows full layout with "Updating..." during export (beb3b3c)

Replaces the bare "Processing..." placeholder canvas with the full navigation layout (Browse/Update buttons + package header + footer), where the footer reads "Updating..." until the background workflow delivers the real "Updated at" timestamp. Update Package is disabled while updating to prevent a second concurrent export; Browse Package stays enabled so the previous package version remains reachable on re-exports. Skips the linked-packages Athena query on the initial render to keep it fast on the request thread.

3. FIFO SQS per-entry sequencing replaces .canvas_id sidecar (4e31324)

Webhook handlers used to fire-and-forget by spawning daemon threads, which let entry.created and canvas.created events for the same entry race to write entry.json. The .canvas_id S3 sidecar masked one symptom of that race (canvas_id loss), but the deeper issue was no per-entry sequencing.

Replace the daemon-thread "queue" with a real FIFO SQS queue keyed by MessageGroupId=entry_id. A new BenchlingPackagingConsumer sidecar drains the queue and runs EntryPackager.execute_workflow. Sequential processing per entry makes the sidecar unnecessary; a late entry event simply reads canvas_id from any existing entry.json.

  • Add PackagingRequestQueue.fifo (+ DLQ) in CDK with 40-min visibility timeout and content-based dedup
  • New packaging_publisher module: webhook handlers enqueue here
  • New packaging_consumer module + Fargate sidecar that drains the queue
  • Refactor SqsConsumer into a BaseSqsConsumer + two subclasses
  • Delete _save_canvas_id, _load_canvas_id, _load_canvas_id_sidecar, and execute_workflow_async; existing .canvas_id files in S3 become harmless orphans
  • Move synchronous send_updating_canvas into the canvas webhook handler

Test plan

  • npm test — lint + typecheck + TS + Python unit tests
  • npm run version:tag:dev triggers CI image build
  • npm run test:dev against deployed image
  • Verify on quiltdata/deployment#2357 that deploy_dev_stack no longer trips circuit breaker on fresh deploy
  • Manual canvas export: confirm "Updating..." footer with Update disabled / Browse enabled
  • Manual concurrent entry.created + canvas.created replay: confirm canvas_id preserved in entry.json

🤖 Generated with Claude Code

drernie and others added 6 commits April 15, 2026 08:36
Root cause of the ECS Deployment Circuit Breaker failure on fresh
deploys of quiltdata/deployment#2357: main() called
config.get_benchling_secrets() unconditionally at startup. When
BenchlingSecret is created empty (by design, populated later by the
external config script), the fetch raises SecretsManagerError, the
process exits, and Essential=True cascades to repeated task failures
until the circuit breaker fires.

The HTTP container survives the same condition via _runtime_guard_response
(serves 503s while degraded). The consumer had no equivalent safety net.

Pre-start readiness wait: signal handlers are installed before
wait_for_ready_config() loops with bounded exponential backoff
(30s -> 300s) until get_benchling_secrets() + apply_benchling_secrets()
succeed. SIGTERM short-circuits the wait via asyncio.Event. Task reaches
steady state, consumer starts processing as soon as the config script
populates the secret -- no Essential=False tradeoff, no ordering
constraint between stack deploy and config script.

Also caps MaxNumberOfMessages at min(10, concurrency) so tail messages
no longer sit in the semaphore backlog eating visibility timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the bare 'Processing...' placeholder canvas with the full
navigation layout (Browse/Update buttons + package header + footer),
where the footer reads 'Updating...' until the background workflow
delivers the real 'Updated at' timestamp. The Update Package button
is disabled while updating to prevent a second concurrent export;
Browse Package stays enabled so the previous package version remains
reachable on re-exports. Skips the linked-packages Athena query on the
initial render to keep it fast on the request thread.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Webhook handlers used to fire-and-forget by spawning daemon threads, which
let entry.created and canvas.created events for the same entry race to
write entry.json. The .canvas_id S3 sidecar masked one symptom of that race
(canvas_id loss), but the deeper issue was no per-entry sequencing.

Replace the daemon-thread "queue" with a real FIFO SQS queue keyed by
MessageGroupId=entry_id. A new BenchlingPackagingConsumer sidecar drains
the queue and runs EntryPackager.execute_workflow. Sequential processing
per entry makes the sidecar unnecessary; a late entry event simply reads
canvas_id from any existing entry.json.

- Add PackagingRequestQueue.fifo (+ DLQ) in CDK with 40-min visibility
  timeout and content-based dedup
- New packaging_publisher module: webhook handlers enqueue here
- New packaging_consumer module + Fargate sidecar that drains the queue
- Refactor SqsConsumer into a BaseSqsConsumer + two subclasses
- Delete _save_canvas_id, _load_canvas_id, _load_canvas_id_sidecar, and
  execute_workflow_async; existing .canvas_id files in S3 become harmless
- Move synchronous send_updating_canvas into the canvas webhook handler

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@drernie drernie changed the title fix(sqs_consumer): wait for Benchling secrets instead of crashing fix: 0.17.2 — sqs_consumer secrets wait, FIFO packaging, canvas updating UX Apr 16, 2026
Adds entries for the canvas Updating-layout fix and the FIFO SQS
packaging sequencing that replaces the .canvas_id sidecar.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@drernie drernie merged commit b64c9db into main Apr 16, 2026
3 checks passed
@drernie drernie deleted the fix/sqs-consumer-graceful-degradation branch April 16, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant