Skip to content

[Bug]: WORKER_ALREADY_EXISTS on restart — should worker names be unique per instance? #21

@TheLingeringWill

Description

@TheLingeringWill

What happened?

When a worker process restarts, or when multiple instances run in parallel, workers fail to register with this error:

StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" }))

The workers never recover — they loop on the error until manual intervention.

This breaks two common scenarios for us:

1. Rolling deploys (AWS ECS). During a rolling deploy, the new task starts before the old task is stopped, so both run simultaneously for 30s–2min. The new task cannot register because the old task's Postgres session still holds the advisory lock. The deploy produces minutes of error logs until the old task finally shuts down.

2. Horizontal scaling. Running the same task with 2+ instances fails the same way — only one instance successfully registers, the others loop indefinitely. The docs' horizontal scaling example (./my-worker & ./my-worker &) does not work as written with PostgresStorage.

After digging into the source, this is caused by the advisory-lock check in the register SQL:

ON CONFLICT (id) DO UPDATE SET ...
WHERE pg_try_advisory_lock(hashtext(workers.id));

from https://github.com/apalis-dev/apalis-postgres/blob/main/queries/worker/register.sql

The previous session still holds the lock, so pg_try_advisory_lock returns false, zero rows are updated, and the Rust layer returns WORKER_ALREADY_EXISTS.

Expected behavior

One of the following should be true, and clearly documented:

  • Worker names can be stable (email-worker) and restart/scale cleanly, or
  • Worker names must be unique per running instance (email-worker-{instance_id}), and the docs should say so explicitly — including for rolling deploys, not just scaling.

The current docs recommend "a unique, descriptive name" with the example email-worker-us-east, which doesn't make the per-instance requirement clear.

reference : https://apalis.dev/docs/guides/workers/introduction#best-practices

Steps to reproduce

  1. Start a process with a worker named email-worker using PostgresStorage.
  2. Before stopping it, start a second process with the same name.
  3. The second process fails with WORKER_ALREADY_EXISTS.

Same result if you stop the first process and restart it quickly — the old Postgres session hasn't fully closed yet.

Minimal code example

use apalis::prelude::*;
use apalis_postgres::PostgresStorage;
use sqlx::PgPool;
use tokio::signal::unix::{signal, SignalKind};
use std::time::Duration;

#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct Email { to: String }

#[derive(Clone, serde::Serialize, serde::Deserialize)]
struct Sms { to: String }

async fn send_email(_: Email) -> Result<(), BoxDynError> {
    Ok(())
}

async fn send_sms(_: Sms) -> Result<(), BoxDynError> {
    Ok(())
}

async fn shutdown_signal() {
    let mut sigterm = signal(SignalKind::terminate()).unwrap();
    let mut sigint = signal(SignalKind::interrupt()).unwrap();
    tokio::select! {
        _ = sigterm.recv() => {},
        _ = sigint.recv() => {},
    }
}

#[tokio::main]
async fn main() -> Result<(), BoxDynError> {
    let pool = PgPool::connect(&std::env::var("DATABASE_URL")?).await?;
    PostgresStorage::setup(&pool).await?;

    let email_storage = PostgresStorage::new(&pool);
    let sms_storage = PostgresStorage::new(&pool);

    Monitor::new()
        .register(
            WorkerBuilder::new("email-worker")
                .backend(email_storage)
                .build_fn(send_email),
        )
        .register(
            WorkerBuilder::new("sms-worker")
                .backend(sms_storage)
                .build_fn(send_sms),
        )
        .shutdown_timeout(Duration::from_secs(30))
        .run_with_signal(shutdown_signal())
        .await?;

    Ok(())
}

Version

1.0.0-rc.x

Environment

  • apalis 1.0.0-rc.4
  • apalis-postgres 1.0.0-rc.4
  • PostgreSQL 16
  • Rust 1.91
  • Deployment: AWS ECS (Fargate)

Relevant log output

worker errored; restarting worker=email-worker error=StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" })) attempt=0 shutting_down=true
worker errored; restarting worker=sms-worker error=StreamError(Io(Custom { kind: AddrInUse, error: "WORKER_ALREADY_EXISTS" })) attempt=0 shutting_down=true

Additional context

Questions that would help clarify the intended usage:

  1. Is this expected behavior, or a lifecycle bug?
  2. What is the recommended pattern for zero-downtime rolling deploys?
  3. Should the docs be updated to clearly state that worker names must be unique per instance (including across deploy overlap), and show how to derive them from environment?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions