Summary
Two production orchestrations have entered a permanently-stuck state via the same trigger: an activity scheduled via call_activity_with_retry times out and is retried by RetryAbleTask, the retry succeeds, a subsequent task_all([N new activities]) is yielded, all N activities run to completion at the framework level — and then the orchestrator never wakes up again. No errors are logged. The instance reports RuntimeStatus: Running indefinitely (one of our hung instances is now 15 days old).
Environment
azure-functions-durable Python SDK: 1.5.0 (latest)
Microsoft.Azure.WebJobs.Extensions.DurableTask host extension: 3.10.2
- Hosting: Azure Functions on Linux, Python 3.13, Premium plan
- Storage backend: Azure Storage (default
AzureWebJobsStorage)
- Replay schema: default
Pattern (simplified pseudocode)
# yields below are translated to context.call_activity_with_retry / context.task_all
gatekeeper_result = yield call_activity_with_retry("activity_gatekeeper", ...)
# 3 tasks started in parallel via task_any([handles, flush_timer]); we wait for 2:
handles = [
context.call_activity_with_retry("activity_summariser", retry_opts, ...),
context.call_activity_with_retry("activity_segmenter", retry_opts, ...),
context.call_sub_orchestrator("orchestrate_overall_feedback", ...),
]
yield context.task_any([*handles, context.create_timer(context.current_utc_datetime)]) # flush
summary, segments = yield context.task_all([handles[0], handles[1]])
# Trigger: activity_summariser internally times out (asyncio.wait_for raises).
# RetryAbleTask schedules a retry (first_interval_ms=1000, max_number_of_attempts=3).
# Retry succeeds. Generator resumes.
segment_tasks = [context.call_activity_with_retry("activity_process_segment_safe", retry_opts, ...) for _ in range(13)]
segment_results = yield context.task_all(segment_tasks)
# ^ never returns. Orchestrator silent forever.
Observed behavior in framework logs
For a hung instance:
| Log presence |
Event |
| ✅ |
(Activity) scheduled for the original 4 tasks (gatekeeper + 3 from the parallel start) |
| ❌ |
(Activity) scheduled for the summariser retry |
| ❌ |
(Activity) scheduled for any of the 13 new task_all tasks |
| ✅ |
(Activity) Started for the retry and for all 13 new tasks (workers pulled them from the work-item queue) |
| ✅ |
(Activity) Completed. State: Completed for the retry and for all 13 new tasks |
| ❌ |
Any orchestrator state event (Started / Awaited / Completed / Failed) after the failed-summariser Awaited |
The work-item queue messages were dispatched and processed, but the corresponding TaskScheduledEvent records appear to have never been persisted to history.
Last orchestrator events for one stuck instance
SeqNum 294: activity_summariser Failed TaskEventId 1 23:07:52 (timeout)
SeqNum 295: orchestrate_* Awaited (parent reacts to failure) 23:07:52
SeqNum 296: orchestrate_* Awaited 23:07:52
SeqNum 297: activity_summariser Started TaskEventId 6 23:07:52 (retry)
SeqNum 310: activity_summariser Completed TaskEventId 6 23:10:04 (retry succeeded)
SeqNum 311-323: 13× activity_process_segment_safe Started, TaskEventIds 5, 7, 8, 6, 11, 9, 10, 13, 12, 15, 14, 16, 17 — all at 23:10:04
SeqNum 336-348: 13× activity_process_segment_safe Completed 23:12:47 → 23:20:52
[Then no further log events for the instance — ever.]
TaskEventId reuse
In the excerpt above, TaskEventId: 6 is used twice within the same orchestration:
- The summariser retry (Started at SeqNum 297, Completed at SeqNum 310)
- One of the 13 segment tasks (Started at SeqNum 314)
Tracing the SDK:
DurableOrchestrationContext._add_to_open_tasks (models/DurableOrchestrationContext.py:721-729) assigns task.id = self._sequence_number; self._sequence_number += 1.
RetryAbleTask.try_set_value (models/Task.py:498-522) handles retry by generating an internal "rescheduled task" via _generate_task + _add_to_open_tasks, which consumes a _sequence_number slot.
Hypothesis (not verified): the retry's internal "rescheduled task" consumes ids 5 and 6 from _sequence_number on first execution, but on a subsequent replay where the user yields task_all([13 fresh tasks]), those 13 tasks get ids 5..17 reassigned from _sequence_number=5. Ids 5 and 6 then collide with the retry's previously-allocated internal tasks. Subsequent reconciliation between the orchestrator's expected handles and history events fails silently.
I'd appreciate any pointer on whether this is a known issue or already fixed on main.
Reproduction conditions
The trigger we keep hitting:
- An activity using
call_activity_with_retry times out (an LLM streaming HTTP call hangs at the TCP read level, exceeding asyncio.wait_for).
RetryAbleTask schedules a retry — the retry succeeds quickly.
- The orchestrator immediately yields
task_all([many new tasks]).
Two of two instances that hit this pattern in the last 30 days have hung in this exact way (instances 14 days apart). Other concurrent orchestrations on the same host that didn't hit the retry path completed normally, ruling out infrastructure/scaling issues.
Expected behavior
Either:
- The retry mechanism shouldn't share a
_sequence_number namespace with subsequent user-issued tasks; or
- If task ids collide, the SDK should detect the inconsistency during replay and raise
NonDeterministicWorkflowError rather than silently hanging.
Workarounds being applied locally
- Wrap the retry-prone activity in an outer "safe" activity that handles retries internally and never bubbles a retry up to the orchestrator.
- Add a watchdog
context.create_timer(now + 45m) inside the task_all so the orchestrator fails fast instead of hanging indefinitely.
Happy to reduce this to a minimal repro if it would help.
Summary
Two production orchestrations have entered a permanently-stuck state via the same trigger: an activity scheduled via
call_activity_with_retrytimes out and is retried byRetryAbleTask, the retry succeeds, a subsequenttask_all([N new activities])is yielded, all N activities run to completion at the framework level — and then the orchestrator never wakes up again. No errors are logged. The instance reportsRuntimeStatus: Runningindefinitely (one of our hung instances is now 15 days old).Environment
azure-functions-durablePython SDK: 1.5.0 (latest)Microsoft.Azure.WebJobs.Extensions.DurableTaskhost extension: 3.10.2AzureWebJobsStorage)Pattern (simplified pseudocode)
Observed behavior in framework logs
For a hung instance:
(Activity) scheduledfor the original 4 tasks (gatekeeper + 3 from the parallel start)(Activity) scheduledfor the summariser retry(Activity) scheduledfor any of the 13 newtask_alltasks(Activity) Startedfor the retry and for all 13 new tasks (workers pulled them from the work-item queue)(Activity) Completed. State: Completedfor the retry and for all 13 new tasksStarted/Awaited/Completed/Failed) after the failed-summariserAwaitedThe work-item queue messages were dispatched and processed, but the corresponding
TaskScheduledEventrecords appear to have never been persisted to history.Last orchestrator events for one stuck instance
TaskEventId reuse
In the excerpt above,
TaskEventId: 6is used twice within the same orchestration:Tracing the SDK:
DurableOrchestrationContext._add_to_open_tasks(models/DurableOrchestrationContext.py:721-729) assignstask.id = self._sequence_number; self._sequence_number += 1.RetryAbleTask.try_set_value(models/Task.py:498-522) handles retry by generating an internal "rescheduled task" via_generate_task+_add_to_open_tasks, which consumes a_sequence_numberslot.Hypothesis (not verified): the retry's internal "rescheduled task" consumes ids 5 and 6 from
_sequence_numberon first execution, but on a subsequent replay where the user yieldstask_all([13 fresh tasks]), those 13 tasks get ids 5..17 reassigned from_sequence_number=5. Ids 5 and 6 then collide with the retry's previously-allocated internal tasks. Subsequent reconciliation between the orchestrator's expected handles and history events fails silently.I'd appreciate any pointer on whether this is a known issue or already fixed on
main.Reproduction conditions
The trigger we keep hitting:
call_activity_with_retrytimes out (an LLM streaming HTTP call hangs at the TCP read level, exceedingasyncio.wait_for).RetryAbleTaskschedules a retry — the retry succeeds quickly.task_all([many new tasks]).Two of two instances that hit this pattern in the last 30 days have hung in this exact way (instances 14 days apart). Other concurrent orchestrations on the same host that didn't hit the retry path completed normally, ruling out infrastructure/scaling issues.
Expected behavior
Either:
_sequence_numbernamespace with subsequent user-issued tasks; orNonDeterministicWorkflowErrorrather than silently hanging.Workarounds being applied locally
context.create_timer(now + 45m)inside thetask_allso the orchestrator fails fast instead of hanging indefinitely.Happy to reduce this to a minimal repro if it would help.