What happened?
When a worker process restarts, or when multiple instances run in parallel, workers fail to register with this error:
StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" }))
The workers never recover — they loop on the error until manual intervention.
This breaks two common scenarios for us:
1. Rolling deploys (AWS ECS). During a rolling deploy, the new task starts before the old task is stopped, so both run simultaneously for 30s–2min. The new task cannot register because the old task's Postgres session still holds the advisory lock. The deploy produces minutes of error logs until the old task finally shuts down.
2. Horizontal scaling. Running the same task with 2+ instances fails the same way — only one instance successfully registers, the others loop indefinitely. The docs' horizontal scaling example (./my-worker & ./my-worker &) does not work as written with PostgresStorage.
After digging into the source, this is caused by the advisory-lock check in the register SQL:
ON CONFLICT (id) DO UPDATE SET ...
WHERE pg_try_advisory_lock(hashtext(workers.id));
from https://github.com/apalis-dev/apalis-postgres/blob/main/queries/worker/register.sql
The previous session still holds the lock, so pg_try_advisory_lock returns false, zero rows are updated, and the Rust layer returns WORKER_ALREADY_EXISTS.
Expected behavior
One of the following should be true, and clearly documented:
- Worker names can be stable (
email-worker) and restart/scale cleanly, or
- Worker names must be unique per running instance (
email-worker-{instance_id}), and the docs should say so explicitly — including for rolling deploys, not just scaling.
The current docs recommend "a unique, descriptive name" with the example email-worker-us-east, which doesn't make the per-instance requirement clear.
reference : https://apalis.dev/docs/guides/workers/introduction#best-practices
Steps to reproduce
- Start a process with a worker named
email-worker using PostgresStorage.
- Before stopping it, start a second process with the same name.
- The second process fails with
WORKER_ALREADY_EXISTS.
Same result if you stop the first process and restart it quickly — the old Postgres session hasn't fully closed yet.
Minimal code example
use apalis::prelude::*;
use apalis_postgres::PostgresStorage;
use sqlx::PgPool;
use tokio::signal::unix::{signal, SignalKind};
use std::time::Duration;
#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct Email { to: String }
#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct Sms { to: String }
async fn send_email(_: Email) -> Result<(), BoxDynError> {
Ok(())
}
async fn send_sms(_: Sms) -> Result<(), BoxDynError> {
Ok(())
}
async fn shutdown_signal() {
let mut sigterm = signal(SignalKind::terminate()).unwrap();
let mut sigint = signal(SignalKind::interrupt()).unwrap();
tokio::select! {
_ = sigterm.recv() => {},
_ = sigint.recv() => {},
}
}
#[tokio::main]
async fn main() -> Result<(), BoxDynError> {
let pool = PgPool::connect(&std::env::var("DATABASE_URL")?).await?;
PostgresStorage::setup(&pool).await?;
let email_storage = PostgresStorage::new(&pool);
let sms_storage = PostgresStorage::new(&pool);
Monitor::new()
.register(
WorkerBuilder::new("email-worker")
.backend(email_storage)
.build_fn(send_email),
)
.register(
WorkerBuilder::new("sms-worker")
.backend(sms_storage)
.build_fn(send_sms),
)
.shutdown_timeout(Duration::from_secs(30))
.run_with_signal(shutdown_signal())
.await?;
Ok(())
}
Version
1.0.0-rc.x
Environment
apalis 1.0.0-rc.4
apalis-postgres 1.0.0-rc.4
- PostgreSQL 16
- Rust 1.91
- Deployment: AWS ECS (Fargate)
Relevant log output
worker errored; restarting worker=email-worker error=StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" })) attempt=0 shutting_down=true
worker errored; restarting worker=sms-worker error=StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" })) attempt=0 shutting_down=true
Additional context
Questions that would help clarify the intended usage:
- Is this expected behavior, or a lifecycle bug?
- What is the recommended pattern for zero-downtime rolling deploys?
- Should the docs be updated to clearly state that worker names must be unique per instance (including across deploy overlap), and show how to derive them from environment?
What happened?
When a worker process restarts, or when multiple instances run in parallel, workers fail to register with this error:
The workers never recover — they loop on the error until manual intervention.
This breaks two common scenarios for us:
1. Rolling deploys (AWS ECS). During a rolling deploy, the new task starts before the old task is stopped, so both run simultaneously for 30s–2min. The new task cannot register because the old task's Postgres session still holds the advisory lock. The deploy produces minutes of error logs until the old task finally shuts down.
2. Horizontal scaling. Running the same task with 2+ instances fails the same way — only one instance successfully registers, the others loop indefinitely. The docs' horizontal scaling example (
./my-worker & ./my-worker &) does not work as written withPostgresStorage.After digging into the source, this is caused by the advisory-lock check in the register SQL:
from https://github.com/apalis-dev/apalis-postgres/blob/main/queries/worker/register.sql
The previous session still holds the lock, so
pg_try_advisory_lockreturns false, zero rows are updated, and the Rust layer returnsWORKER_ALREADY_EXISTS.Expected behavior
One of the following should be true, and clearly documented:
email-worker) and restart/scale cleanly, oremail-worker-{instance_id}), and the docs should say so explicitly — including for rolling deploys, not just scaling.The current docs recommend "a unique, descriptive name" with the example
email-worker-us-east, which doesn't make the per-instance requirement clear.reference : https://apalis.dev/docs/guides/workers/introduction#best-practices
Steps to reproduce
email-workerusingPostgresStorage.WORKER_ALREADY_EXISTS.Same result if you stop the first process and restart it quickly — the old Postgres session hasn't fully closed yet.
Minimal code example
Version
1.0.0-rc.x
Environment
apalis1.0.0-rc.4apalis-postgres1.0.0-rc.4Relevant log output
Additional context
Questions that would help clarify the intended usage: