Orchestration silently hangs after CallActivityWithRetry retry when next yield issues new tasks (TaskEventId reuse?)

## Summary

Two production orchestrations have entered a permanently-stuck state via the same trigger: an activity scheduled via `call_activity_with_retry` times out and is retried by `RetryAbleTask`, the retry succeeds, a subsequent `task_all([N new activities])` is yielded, all N activities run to completion at the framework level — and then the orchestrator never wakes up again. No errors are logged. The instance reports `RuntimeStatus: Running` indefinitely (one of our hung instances is now 15 days old).

## Environment

- `azure-functions-durable` Python SDK: **1.5.0** (latest)
- `Microsoft.Azure.WebJobs.Extensions.DurableTask` host extension: **3.10.2**
- Hosting: Azure Functions on Linux, Python 3.13, Premium plan
- Storage backend: Azure Storage (default `AzureWebJobsStorage`)
- Replay schema: default

## Pattern (simplified pseudocode)

```python
# yields below are translated to context.call_activity_with_retry / context.task_all
gatekeeper_result = yield call_activity_with_retry("activity_gatekeeper", ...)

# 3 tasks started in parallel via task_any([handles, flush_timer]); we wait for 2:
handles = [
    context.call_activity_with_retry("activity_summariser", retry_opts, ...),
    context.call_activity_with_retry("activity_segmenter",  retry_opts, ...),
    context.call_sub_orchestrator("orchestrate_overall_feedback", ...),
]
yield context.task_any([*handles, context.create_timer(context.current_utc_datetime)])  # flush
summary, segments = yield context.task_all([handles[0], handles[1]])

# Trigger: activity_summariser internally times out (asyncio.wait_for raises).
# RetryAbleTask schedules a retry (first_interval_ms=1000, max_number_of_attempts=3).
# Retry succeeds. Generator resumes.

segment_tasks = [context.call_activity_with_retry("activity_process_segment_safe", retry_opts, ...) for _ in range(13)]
segment_results = yield context.task_all(segment_tasks)
# ^ never returns. Orchestrator silent forever.
```

## Observed behavior in framework logs

For a hung instance:

| Log presence | Event |
|---|---|
| ✅ | `(Activity) scheduled` for the **original** 4 tasks (gatekeeper + 3 from the parallel start) |
| ❌ | `(Activity) scheduled` for the summariser **retry** |
| ❌ | `(Activity) scheduled` for any of the **13 new** `task_all` tasks |
| ✅ | `(Activity) Started` for the retry and for all 13 new tasks (workers pulled them from the work-item queue) |
| ✅ | `(Activity) Completed. State: Completed` for the retry and for all 13 new tasks |
| ❌ | Any orchestrator state event (`Started` / `Awaited` / `Completed` / `Failed`) after the failed-summariser `Awaited` |

The work-item queue messages were dispatched and processed, but the corresponding `TaskScheduledEvent` records appear to have never been persisted to history.

### Last orchestrator events for one stuck instance

```
SeqNum 294: activity_summariser     Failed       TaskEventId 1   23:07:52  (timeout)
SeqNum 295: orchestrate_*           Awaited      (parent reacts to failure)  23:07:52
SeqNum 296: orchestrate_*           Awaited      23:07:52
SeqNum 297: activity_summariser     Started      TaskEventId 6   23:07:52  (retry)
SeqNum 310: activity_summariser     Completed    TaskEventId 6   23:10:04  (retry succeeded)
SeqNum 311-323: 13× activity_process_segment_safe Started, TaskEventIds 5, 7, 8, 6, 11, 9, 10, 13, 12, 15, 14, 16, 17  — all at 23:10:04
SeqNum 336-348: 13× activity_process_segment_safe Completed       23:12:47 → 23:20:52
[Then no further log events for the instance — ever.]
```

## TaskEventId reuse

In the excerpt above, `TaskEventId: 6` is used twice within the same orchestration:

1. The summariser retry (Started at SeqNum 297, Completed at SeqNum 310)
2. One of the 13 segment tasks (Started at SeqNum 314)

Tracing the SDK:

- `DurableOrchestrationContext._add_to_open_tasks` ([`models/DurableOrchestrationContext.py:721-729`](https://github.com/Azure/azure-functions-durable-python/blob/main/azure/durable_functions/models/DurableOrchestrationContext.py)) assigns `task.id = self._sequence_number; self._sequence_number += 1`.
- `RetryAbleTask.try_set_value` ([`models/Task.py:498-522`](https://github.com/Azure/azure-functions-durable-python/blob/main/azure/durable_functions/models/Task.py)) handles retry by generating an internal "rescheduled task" via `_generate_task` + `_add_to_open_tasks`, which consumes a `_sequence_number` slot.

**Hypothesis (not verified):** the retry's internal "rescheduled task" consumes ids 5 and 6 from `_sequence_number` on first execution, but on a subsequent replay where the user yields `task_all([13 fresh tasks])`, those 13 tasks get ids 5..17 reassigned from `_sequence_number=5`. Ids 5 and 6 then collide with the retry's previously-allocated internal tasks. Subsequent reconciliation between the orchestrator's expected handles and history events fails silently.

I'd appreciate any pointer on whether this is a known issue or already fixed on `main`.

## Reproduction conditions

The trigger we keep hitting:

1. An activity using `call_activity_with_retry` times out (an LLM streaming HTTP call hangs at the TCP read level, exceeding `asyncio.wait_for`).
2. `RetryAbleTask` schedules a retry — the retry succeeds quickly.
3. The orchestrator immediately yields `task_all([many new tasks])`.

Two of two instances that hit this pattern in the last 30 days have hung in this exact way (instances 14 days apart). Other concurrent orchestrations on the same host that didn't hit the retry path completed normally, ruling out infrastructure/scaling issues.

## Expected behavior

Either:

1. The retry mechanism shouldn't share a `_sequence_number` namespace with subsequent user-issued tasks; **or**
2. If task ids collide, the SDK should detect the inconsistency during replay and raise `NonDeterministicWorkflowError` rather than silently hanging.

## Workarounds being applied locally

- Wrap the retry-prone activity in an outer "safe" activity that handles retries internally and never bubbles a retry up to the orchestrator.
- Add a watchdog `context.create_timer(now + 45m)` inside the `task_all` so the orchestrator fails fast instead of hanging indefinitely.

Happy to reduce this to a minimal repro if it would help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestration silently hangs after CallActivityWithRetry retry when next yield issues new tasks (TaskEventId reuse?) #603

Summary

Environment

Pattern (simplified pseudocode)

Observed behavior in framework logs

Last orchestrator events for one stuck instance

TaskEventId reuse

Reproduction conditions

Expected behavior

Workarounds being applied locally

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Log presence	Event
✅	`(Activity) scheduled` for the original 4 tasks (gatekeeper + 3 from the parallel start)
❌	`(Activity) scheduled` for the summariser retry
❌	`(Activity) scheduled` for any of the 13 new `task_all` tasks
✅	`(Activity) Started` for the retry and for all 13 new tasks (workers pulled them from the work-item queue)
✅	`(Activity) Completed. State: Completed` for the retry and for all 13 new tasks
❌	Any orchestrator state event (`Started` / `Awaited` / `Completed` / `Failed`) after the failed-summariser `Awaited`

Orchestration silently hangs after CallActivityWithRetry retry when next yield issues new tasks (TaskEventId reuse?) #603

Description

Summary

Environment

Pattern (simplified pseudocode)

Observed behavior in framework logs

Last orchestrator events for one stuck instance

TaskEventId reuse

Reproduction conditions

Expected behavior

Workarounds being applied locally

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions