Skip to content

Orchestration silently hangs after CallActivityWithRetry retry when next yield issues new tasks (TaskEventId reuse?) #603

@ppinto-afk

Description

@ppinto-afk

Summary

Two production orchestrations have entered a permanently-stuck state via the same trigger: an activity scheduled via call_activity_with_retry times out and is retried by RetryAbleTask, the retry succeeds, a subsequent task_all([N new activities]) is yielded, all N activities run to completion at the framework level — and then the orchestrator never wakes up again. No errors are logged. The instance reports RuntimeStatus: Running indefinitely (one of our hung instances is now 15 days old).

Environment

  • azure-functions-durable Python SDK: 1.5.0 (latest)
  • Microsoft.Azure.WebJobs.Extensions.DurableTask host extension: 3.10.2
  • Hosting: Azure Functions on Linux, Python 3.13, Premium plan
  • Storage backend: Azure Storage (default AzureWebJobsStorage)
  • Replay schema: default

Pattern (simplified pseudocode)

# yields below are translated to context.call_activity_with_retry / context.task_all
gatekeeper_result = yield call_activity_with_retry("activity_gatekeeper", ...)

# 3 tasks started in parallel via task_any([handles, flush_timer]); we wait for 2:
handles = [
    context.call_activity_with_retry("activity_summariser", retry_opts, ...),
    context.call_activity_with_retry("activity_segmenter",  retry_opts, ...),
    context.call_sub_orchestrator("orchestrate_overall_feedback", ...),
]
yield context.task_any([*handles, context.create_timer(context.current_utc_datetime)])  # flush
summary, segments = yield context.task_all([handles[0], handles[1]])

# Trigger: activity_summariser internally times out (asyncio.wait_for raises).
# RetryAbleTask schedules a retry (first_interval_ms=1000, max_number_of_attempts=3).
# Retry succeeds. Generator resumes.

segment_tasks = [context.call_activity_with_retry("activity_process_segment_safe", retry_opts, ...) for _ in range(13)]
segment_results = yield context.task_all(segment_tasks)
# ^ never returns. Orchestrator silent forever.

Observed behavior in framework logs

For a hung instance:

Log presence Event
(Activity) scheduled for the original 4 tasks (gatekeeper + 3 from the parallel start)
(Activity) scheduled for the summariser retry
(Activity) scheduled for any of the 13 new task_all tasks
(Activity) Started for the retry and for all 13 new tasks (workers pulled them from the work-item queue)
(Activity) Completed. State: Completed for the retry and for all 13 new tasks
Any orchestrator state event (Started / Awaited / Completed / Failed) after the failed-summariser Awaited

The work-item queue messages were dispatched and processed, but the corresponding TaskScheduledEvent records appear to have never been persisted to history.

Last orchestrator events for one stuck instance

SeqNum 294: activity_summariser     Failed       TaskEventId 1   23:07:52  (timeout)
SeqNum 295: orchestrate_*           Awaited      (parent reacts to failure)  23:07:52
SeqNum 296: orchestrate_*           Awaited      23:07:52
SeqNum 297: activity_summariser     Started      TaskEventId 6   23:07:52  (retry)
SeqNum 310: activity_summariser     Completed    TaskEventId 6   23:10:04  (retry succeeded)
SeqNum 311-323: 13× activity_process_segment_safe Started, TaskEventIds 5, 7, 8, 6, 11, 9, 10, 13, 12, 15, 14, 16, 17  — all at 23:10:04
SeqNum 336-348: 13× activity_process_segment_safe Completed       23:12:47 → 23:20:52
[Then no further log events for the instance — ever.]

TaskEventId reuse

In the excerpt above, TaskEventId: 6 is used twice within the same orchestration:

  1. The summariser retry (Started at SeqNum 297, Completed at SeqNum 310)
  2. One of the 13 segment tasks (Started at SeqNum 314)

Tracing the SDK:

  • DurableOrchestrationContext._add_to_open_tasks (models/DurableOrchestrationContext.py:721-729) assigns task.id = self._sequence_number; self._sequence_number += 1.
  • RetryAbleTask.try_set_value (models/Task.py:498-522) handles retry by generating an internal "rescheduled task" via _generate_task + _add_to_open_tasks, which consumes a _sequence_number slot.

Hypothesis (not verified): the retry's internal "rescheduled task" consumes ids 5 and 6 from _sequence_number on first execution, but on a subsequent replay where the user yields task_all([13 fresh tasks]), those 13 tasks get ids 5..17 reassigned from _sequence_number=5. Ids 5 and 6 then collide with the retry's previously-allocated internal tasks. Subsequent reconciliation between the orchestrator's expected handles and history events fails silently.

I'd appreciate any pointer on whether this is a known issue or already fixed on main.

Reproduction conditions

The trigger we keep hitting:

  1. An activity using call_activity_with_retry times out (an LLM streaming HTTP call hangs at the TCP read level, exceeding asyncio.wait_for).
  2. RetryAbleTask schedules a retry — the retry succeeds quickly.
  3. The orchestrator immediately yields task_all([many new tasks]).

Two of two instances that hit this pattern in the last 30 days have hung in this exact way (instances 14 days apart). Other concurrent orchestrations on the same host that didn't hit the retry path completed normally, ruling out infrastructure/scaling issues.

Expected behavior

Either:

  1. The retry mechanism shouldn't share a _sequence_number namespace with subsequent user-issued tasks; or
  2. If task ids collide, the SDK should detect the inconsistency during replay and raise NonDeterministicWorkflowError rather than silently hanging.

Workarounds being applied locally

  • Wrap the retry-prone activity in an outer "safe" activity that handles retries internally and never bubbles a retry up to the orchestrator.
  • Add a watchdog context.create_timer(now + 45m) inside the task_all so the orchestrator fails fast instead of hanging indefinitely.

Happy to reduce this to a minimal repro if it would help.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions