Preserve durable trace context in SQL history#306
Conversation
Keep orchestration replay span identity and sub-orchestration client span IDs when round-tripping SQL trace context so exported spans stitch correctly across continuations and nested sub-orchestrations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
overall LGTM, the failed CI test might be flaky. I run into this failure often as well. |
… test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Preserves DurableTask trace/span identity when SQL persists history by extending the TraceContext encoding (without a schema change) and adding regression tests that validate stable span IDs and correct parent/child hierarchies across replays/continuations and sub-orchestrations.
Changes:
- Extend SQL
TraceContextserialization/parsing to preserveDistributedTraceContext.Id/SpanIdand sub-orchestrationClientSpanId. - Rehydrate
ClientSpanIdwhen materializingSubOrchestrationInstanceCreatedEvent. - Add unit + integration tests that lock down round-tripping and emitted Activity parent/child relationships.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| test/DurableTask.SqlServer.Tests/Unit/SqlUtilsTraceContextTests.cs | Adds unit coverage for extended/legacy trace-context round-tripping and ClientSpanId persistence. |
| test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs | Adds integration regressions that validate exported Activity graphs have stable orchestration span IDs and no missing parents. |
| src/DurableTask.SqlServer/SqlUtils.cs | Implements the extended trace-context encoding/decoding and sub-orchestration ClientSpanId hydration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tions - Reserve line 1 of every TraceContext payload for traceparent (empty for sub-orchestration-only payloads) so all callers share a single parsing contract. Avoids the risk of treating '@clientspanid=...' as a malformed W3C traceparent if a sub-orchestration row is ever read through the general DistributedTraceContext path. - Centralize TraceContext parsing into a single ParseTraceContext helper. GetDistributedTraceContextFromReader and GetSubOrchestrationClientSpanId now share one parser, which extracts traceparent, tracestate, id, spanid, and clientspanid in one pass. Reduces drift over time. - Backward compatibility: the parser still accepts the legacy single-line '@clientspanid=...' format that was written by earlier builds, so an upgrade does not break histories already in production databases. Added a dedicated regression test for the legacy payload shape. - Tighten the sub-orchestration / activity client-span assertions in the TraceContextFlowCorrectly integration test to use Assert.Single rather than LastOrDefault. The test schedules exactly one of each, so any duplicate emission should now fail the regression deterministically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cgillum
left a comment
There was a problem hiding this comment.
Please fix the analyzer warnings in test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs.
…er warnings
Rolling-upgrade safety for the TraceContext column (fanyirobin/cgillum review):
The previous version of this PR introduced an '@tracestate=' prefix on line 2
of the TraceContext payload when Id/SpanId fields were present. That created
a real rolling-upgrade hazard: an older worker reading a new-format row would
see '@tracestate=foo' on line 2 and store that literal string as the
tracestate value, then propagate it through OpenTelemetry.
The new wire format is fully backward compatible in both directions without
any version flag:
Line 1: traceparent (unchanged)
Line 2: tracestate, RAW with no prefix — matches the legacy format exactly.
When tracestate is empty but Id/SpanId follow, line 2 is an empty
placeholder so newer @-prefixed lines never land in the tracestate
slot.
Line 3+: @id=..., @spanid=..., @clientspanid=... — older readers stop at
line 2 and ignore these.
Sub-orchestration ClientSpanId payloads now use:
line 1: empty
line 2: empty placeholder (reserved tracestate slot)
line 3: @clientspanid=...
so older readers, which look at parts[0]/parts[1] only, never see an @-prefix
line where they expect tracestate.
Added two unit tests that lock down the wire-format contract:
- GetTraceContext_NewWriterPlacesRawTraceStateOnLineTwo_ForOldReaderCompat
- GetTraceContext_NewWriterEmitsEmptyTraceStateLine_WhenOnlyIdAndSpanIdSet
Analyzer warnings (cgillum review):
Fixed xUnit2031 warnings on Orchestrations.cs by using the
Assert.Single(collection, predicate) overload instead of
Assert.Single(collection.Where(predicate)).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the review @cgillum! Pushed a68f9d1 to address both pieces of feedback: Analyzer warnings (xUnit2031) on Rolling-upgrade safety for the The new layout is fully backward-compatible in both directions without any version flag:
Locked down with two new unit tests:
All 6 |
[Copilot] Rolling-upgrade
|
Reverts the prior empty-placeholder wire format and replaces it with the
approach suggested by cgillum in the PR review.
PROBLEM WITH THE PRIOR APPROACH
The pre-PR (legacy) reader uses:
text.Split('\n', count: 2, StringSplitOptions.RemoveEmptyEntries)
RemoveEmptyEntries strips empty lines BEFORE the count cap is applied, and
count: 2 leaves the entire remainder of the string in parts[1]. This means
the previous attempt's wire format
traceparent\n\n@id=ID\n@spanid=SPAN
was read by old workers as:
parts[0] = 'traceparent'
parts[1] = '@id=ID\n@spanid=SPAN' <-- entire multi-line blob lands in
TraceState and gets propagated
through OpenTelemetry
The empty-placeholder line did not survive RemoveEmptyEntries, and any
@-prefixed continuation line was reachable by parts[1]. The earlier 'old
reader compat' tests used plain Split('\n') and therefore failed to
exercise the real legacy semantics.
NEW WIRE FORMAT
Carry the extended fields (Id, SpanId, ClientSpanId) inside the W3C
tracestate as a private vendor key named 'durabletask-mssql':
traceparent\ndurabletask-mssql=id:...;span:...[;client:...][,user-tracestate]
W3C tracestate explicitly requires unknown vendor keys to be preserved and
propagated unmodified, so:
- Old reader on a new row reads a clean single-line traceparent and a
clean single-line tracestate (no embedded newlines). It propagates the
whole tracestate, vendor key included, through Activity.TraceStateString
exactly as if any other vendor had inserted that entry.
- New reader on an old row sees no 'durabletask-mssql=' entry, leaves the
tracestate alone, and returns null for Id/SpanId/ClientSpanId — same as
current behavior for any history that predates this fix.
- New reader on a new row separates the vendor key from any user tracestate
entries, restoring exact field values on round-trip.
- Rows with no Id/SpanId emit a wire format byte-for-byte identical to the
pre-PR code.
Sub-orchestration rows previously had a NULL TraceContext column on main
and the legacy reader never calls GetTraceContext for them, so the wire
format there is a single-line 'durabletask-mssql=client:<spanId>' that is
only ever parsed by the new GetSubOrchestrationClientSpanId helper.
TESTS
Reworked the unit tests so they simulate the *actual* legacy split
(Split('\n', 2, RemoveEmptyEntries)) and assert that:
- parts[0] is a clean W3C traceparent with no embedded newlines or vendor
noise
- parts[1] (where present) is a clean single-line W3C tracestate value with
no embedded newlines and no '@'-prefixed garbage
Added theory-based coverage for the new reader to verify the vendor key is
extracted regardless of position within the user tracestate (and that user
entries are preserved in order with the vendor key stripped).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@cgillum You were 100% right — I verified by reading the pre-PR reader directly. It uses
I empirically confirmed the failure mode with a small console repro, then pushed Why this is rolling-upgrade safe (and now actually proven, not asserted):
The new tests simulate the real legacy split ( Sub-orchestration rows had 11/11 unit tests pass locally. Thanks for the catch. |
Summary
DistributedTraceContextClientSpanIdso parent-side client spans can be recreated with the same span ID the child execution points toWhy this is needed
DurableTask.SqlServerflattens history-event trace data into theTraceContextSQL column instead of serializing the full data contract graph. Before this change, that projection only preserved W3Ctraceparentand optionaltracestate.That was not enough for the Activity-based tracing contract in
DurableTask.Core:DistributedTraceContext.IdandDistributedTraceContext.SpanIdare used to restore the same orchestration execution span across reloads/replays/continuations.SubOrchestrationInstanceCreatedEvent.ClientSpanIdis used to recreate the parent-side client span for a sub-orchestration with the exact span ID that the child orchestration server span references as its parent.When SQL dropped those values:
parentSpanIdthat the parent side never actually emitted, which shows up as a missing parent/orphaned span in trace viewersWhat changed
TraceContextencoding to preserve extra durable trace fields without a schema change@tracestate=@id=@spanid=@clientspanid=ClientSpanIdwhen materializingSubOrchestrationInstanceCreatedEventDistributedTraceContextround-trippingClientSpanIdround-trippingParentSpanIdpointing to a missing span in the captured Activity graphScope note
This change fixes newly persisted SQL histories. It does not retroactively backfill older in-flight histories that were already stored in the legacy trace-context format before the provider change.
Validation
Sample screenshot for illustration of the problem - span's with incorrect parents ID.