Skip to content

Preserve durable trace context in SQL history#306

Open
chandramouleswaran wants to merge 7 commits into
microsoft:mainfrom
chandramouleswaran:fix/sql-trace-context-stitching
Open

Preserve durable trace context in SQL history#306
chandramouleswaran wants to merge 7 commits into
microsoft:mainfrom
chandramouleswaran:fix/sql-trace-context-stitching

Conversation

@chandramouleswaran
Copy link
Copy Markdown

@chandramouleswaran chandramouleswaran commented Apr 16, 2026

Summary

  • preserve orchestration replay span identity when SQL round-trips DistributedTraceContext
  • preserve sub-orchestration ClientSpanId so parent-side client spans can be recreated with the same span ID the child execution points to
  • add regression tests for continuation stability and listener-observed parent/child hierarchy

Why this is needed

DurableTask.SqlServer flattens history-event trace data into the TraceContext SQL column instead of serializing the full data contract graph. Before this change, that projection only preserved W3C traceparent and optional tracestate.

That was not enough for the Activity-based tracing contract in DurableTask.Core:

  • DistributedTraceContext.Id and DistributedTraceContext.SpanId are used to restore the same orchestration execution span across reloads/replays/continuations.
  • SubOrchestrationInstanceCreatedEvent.ClientSpanId is used to recreate the parent-side client span for a sub-orchestration with the exact span ID that the child orchestration server span references as its parent.

When SQL dropped those values:

  • later activity-completion and timer spans could parent themselves to orchestration span IDs that no longer matched the replayed orchestration span visible in the exported trace
  • child sub-orchestration spans could carry a parentSpanId that the parent side never actually emitted, which shows up as a missing parent/orphaned span in trace viewers

What changed

  • extend the SQL TraceContext encoding to preserve extra durable trace fields without a schema change
    • @tracestate=
    • @id=
    • @spanid=
    • @clientspanid=
  • keep the legacy trace-context format readable for backward compatibility
  • rehydrate ClientSpanId when materializing SubOrchestrationInstanceCreatedEvent
  • add unit tests for:
    • extended DistributedTraceContext round-tripping
    • legacy compatibility
    • sub-orchestration ClientSpanId round-tripping
  • add integration tests for:
    • stable orchestration span IDs across continuations
    • nested sub-orchestration hierarchies
    • repeated sibling sub-orchestrations
    • no emitted ParentSpanId pointing to a missing span in the captured Activity graph

Scope note

This change fixes newly persisted SQL histories. It does not retroactively backfill older in-flight histories that were already stored in the legacy trace-context format before the provider change.

Validation

dotnet test .\test\DurableTask.SqlServer.Tests\DurableTask.SqlServer.Tests.csproj --nologo -v minimal --filter "FullyQualifiedName~SqlUtilsTraceContextTests|FullyQualifiedName~TraceContextFlowCorrectly|FullyQualifiedName~TraceContextMaintainsStableOrchestrationSpanAcrossContinuations|FullyQualifiedName~ActivityListenerCapturesNestedSubOrchestrationHierarchyWithoutMissingParents|FullyQualifiedName~ActivityListenerCapturesRepeatedSubOrchestrationsWithoutMissingParents"

Sample screenshot for illustration of the problem - span's with incorrect parents ID.

image

Keep orchestration replay span identity and sub-orchestration client span IDs when round-tripping SQL trace context so exported spans stitch correctly across continuations and nested sub-orchestrations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread src/DurableTask.SqlServer/SqlUtils.cs Outdated
Comment thread test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs
@fanyirobin
Copy link
Copy Markdown
Contributor

overall LGTM, the failed CI test might be flaky. I run into this failure often as well.

@cgillum cgillum requested a review from Copilot May 7, 2026 19:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Preserves DurableTask trace/span identity when SQL persists history by extending the TraceContext encoding (without a schema change) and adding regression tests that validate stable span IDs and correct parent/child hierarchies across replays/continuations and sub-orchestrations.

Changes:

  • Extend SQL TraceContext serialization/parsing to preserve DistributedTraceContext.Id / SpanId and sub-orchestration ClientSpanId.
  • Rehydrate ClientSpanId when materializing SubOrchestrationInstanceCreatedEvent.
  • Add unit + integration tests that lock down round-tripping and emitted Activity parent/child relationships.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
test/DurableTask.SqlServer.Tests/Unit/SqlUtilsTraceContextTests.cs Adds unit coverage for extended/legacy trace-context round-tripping and ClientSpanId persistence.
test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs Adds integration regressions that validate exported Activity graphs have stable orchestration span IDs and no missing parents.
src/DurableTask.SqlServer/SqlUtils.cs Implements the extended trace-context encoding/decoding and sub-orchestration ClientSpanId hydration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/DurableTask.SqlServer/SqlUtils.cs
Comment thread src/DurableTask.SqlServer/SqlUtils.cs Outdated
Comment thread src/DurableTask.SqlServer/SqlUtils.cs Outdated
Comment thread src/DurableTask.SqlServer/SqlUtils.cs Outdated
Comment thread test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs Outdated
Comment thread test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs Outdated
…tions

- Reserve line 1 of every TraceContext payload for traceparent (empty for
  sub-orchestration-only payloads) so all callers share a single parsing
  contract. Avoids the risk of treating '@clientspanid=...' as a malformed
  W3C traceparent if a sub-orchestration row is ever read through the
  general DistributedTraceContext path.

- Centralize TraceContext parsing into a single ParseTraceContext helper.
  GetDistributedTraceContextFromReader and GetSubOrchestrationClientSpanId
  now share one parser, which extracts traceparent, tracestate, id, spanid,
  and clientspanid in one pass. Reduces drift over time.

- Backward compatibility: the parser still accepts the legacy single-line
  '@clientspanid=...' format that was written by earlier builds, so an
  upgrade does not break histories already in production databases. Added
  a dedicated regression test for the legacy payload shape.

- Tighten the sub-orchestration / activity client-span assertions in the
  TraceContextFlowCorrectly integration test to use Assert.Single rather
  than LastOrDefault. The test schedules exactly one of each, so any
  duplicate emission should now fail the regression deterministically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the analyzer warnings in test/DurableTask.SqlServer.Tests/Integration/Orchestrations.cs.

…er warnings

Rolling-upgrade safety for the TraceContext column (fanyirobin/cgillum review):

The previous version of this PR introduced an '@tracestate=' prefix on line 2
of the TraceContext payload when Id/SpanId fields were present. That created
a real rolling-upgrade hazard: an older worker reading a new-format row would
see '@tracestate=foo' on line 2 and store that literal string as the
tracestate value, then propagate it through OpenTelemetry.

The new wire format is fully backward compatible in both directions without
any version flag:

  Line 1: traceparent (unchanged)
  Line 2: tracestate, RAW with no prefix — matches the legacy format exactly.
          When tracestate is empty but Id/SpanId follow, line 2 is an empty
          placeholder so newer @-prefixed lines never land in the tracestate
          slot.
  Line 3+: @id=..., @spanid=..., @clientspanid=... — older readers stop at
           line 2 and ignore these.

Sub-orchestration ClientSpanId payloads now use:
  line 1: empty
  line 2: empty placeholder (reserved tracestate slot)
  line 3: @clientspanid=...
so older readers, which look at parts[0]/parts[1] only, never see an @-prefix
line where they expect tracestate.

Added two unit tests that lock down the wire-format contract:
  - GetTraceContext_NewWriterPlacesRawTraceStateOnLineTwo_ForOldReaderCompat
  - GetTraceContext_NewWriterEmitsEmptyTraceStateLine_WhenOnlyIdAndSpanIdSet

Analyzer warnings (cgillum review):

Fixed xUnit2031 warnings on Orchestrations.cs by using the
Assert.Single(collection, predicate) overload instead of
Assert.Single(collection.Where(predicate)).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chandramouleswaran
Copy link
Copy Markdown
Author

Thanks for the review @cgillum! Pushed a68f9d1 to address both pieces of feedback:

Analyzer warnings (xUnit2031) on Orchestrations.cs — switched the two Assert.Single(collection.Where(predicate)) calls to Assert.Single(collection, predicate) as the analyzer recommends. Orchestrations.cs is now warning-free.

Rolling-upgrade safety for the TraceContext column (this came up from @fanyirobin's review of the previous revision) — the prior wire format added an @tracestate= prefix on line 2, which would have caused older workers reading new-format rows to store the literal @tracestate=foo string as the tracestate value and propagate it through OpenTelemetry.

The new layout is fully backward-compatible in both directions without any version flag:

  • Line 1: traceparent (unchanged)
  • Line 2: tracestate, raw with no prefix — matches the legacy format. When tracestate is empty but Id/SpanId follow, line 2 is an empty placeholder so the newer @-prefixed lines never land in the tracestate slot.
  • Line 3+: @id=, @spanid=, @clientspanid= — older readers stop at line 2 and ignore these.

Locked down with two new unit tests:

  • GetTraceContext_NewWriterPlacesRawTraceStateOnLineTwo_ForOldReaderCompat
  • GetTraceContext_NewWriterEmitsEmptyTraceStateLine_WhenOnlyIdAndSpanIdSet

All 6 SqlUtilsTraceContextTests pass locally.

@cgillum
Copy link
Copy Markdown
Member

cgillum commented May 13, 2026

[Copilot] Rolling-upgrade tracestate corruption is not actually fixed

Thanks for the iteration on the wire format. Unfortunately I don't think the latest revision closes the rolling-upgrade hazard — the design assumes the pre-PR reader will preserve empty placeholder lines, but it won't, because that reader uses RemoveEmptyEntries:

// pre-PR reader
string[] parts = text.Split(new[] { '\n' }, count: 2, StringSplitOptions.RemoveEmptyEntries);

Example that still breaks

For an event with Id / SpanId set and TraceState = "vendor=value", the new writer emits:

00-<trace>-<span>-01\nvendor=value\n@id=...\n@spanid=...

A worker on the previous release reads this with Split('\n', 2, RemoveEmptyEntries) and gets:

  • parts[0] = "00-<trace>-<span>-01"
  • parts[1] = "vendor=value\n@id=...\n@spanid=..."

That entire multi-line string lands in TraceState, flows into Activity.TraceStateString, and gets propagated through OpenTelemetry — the same W3C-non-compliant value we were trying to avoid. The "no-tracestate but Id/SpanId set" case is even worse: RemoveEmptyEntries drops the empty line 2, so parts[1] becomes "@id=...\n@spanid=...".

For sub-orchestration payloads ("\n\n@clientspanid=..."), both leading empty lines get removed, leaving parts[0] = "@clientspanid=..." which an old reader would assign to TraceParent — a regression vs. behavior on main.

Why the new tests don't catch this

The two new "old-reader compat" tests simulate the legacy reader as payload.Split('\n'), but the actual pre-PR code uses Split('\n', 2, RemoveEmptyEntries). Those produce different results on these exact payloads:

// GetTraceContext_NewWriterPlacesRawTraceStateOnLineTwo_ForOldReaderCompat
string[] parts = payload.Split('\n');           // not what the old reader does
Assert.Equal("vendor=value", parts[1]);         // passes under plain split
Assert.DoesNotContain(parts[1], "@");           // would FAIL with the real legacy split

Same issue in GetTraceContext_NewWriterEmitsEmptyTraceStateLine_WhenOnlyIdAndSpanIdSet — the empty placeholder it asserts on simply doesn't survive RemoveEmptyEntries. The tests pass, but they're not exercising the legacy semantics they claim to.

A more faithful test would inline the literal pre-PR call:

string[] parts = payload.Split(new[] { '\n' }, 2, StringSplitOptions.RemoveEmptyEntries);

…and assert that parts[0] is a valid W3C traceparent and parts[1] (if present) is a valid W3C tracestate — no embedded newlines, no @-prefixed internals.

Suggestion

Newline-delimited extension fields can't be made backwards-safe by adding empty padding, because the existing reader collapses empties. A couple of approaches that would actually work without a version flag:

  • Carry the new fields inside tracestate as a private vendor key (e.g. dtmssql=id:...;span:...;client:...). Old readers grab the whole tracestate string unchanged and forward it; W3C tracestate is explicitly designed to allow unknown vendor keys.
  • Move the wire format off newlines entirely (single line, escape-aware key=value list, or length-prefixed). Old readers would see the entire blob as traceparent — still wrong, but easy to detect and fall back to null rather than emit malformed telemetry.

Happy to help iterate on either direction. The core blocker is just that any encoding which leaves @-prefixed text reachable by parts[0] or parts[1] of the old Split('\n', 2, RemoveEmptyEntries) will leak into emitted spans during a rolling upgrade.

Reverts the prior empty-placeholder wire format and replaces it with the
approach suggested by cgillum in the PR review.

PROBLEM WITH THE PRIOR APPROACH
The pre-PR (legacy) reader uses:
    text.Split('\n', count: 2, StringSplitOptions.RemoveEmptyEntries)

RemoveEmptyEntries strips empty lines BEFORE the count cap is applied, and
count: 2 leaves the entire remainder of the string in parts[1]. This means
the previous attempt's wire format

    traceparent\n\n@id=ID\n@spanid=SPAN

was read by old workers as:
    parts[0] = 'traceparent'
    parts[1] = '@id=ID\n@spanid=SPAN'   <-- entire multi-line blob lands in
                                             TraceState and gets propagated
                                             through OpenTelemetry

The empty-placeholder line did not survive RemoveEmptyEntries, and any
@-prefixed continuation line was reachable by parts[1]. The earlier 'old
reader compat' tests used plain Split('\n') and therefore failed to
exercise the real legacy semantics.

NEW WIRE FORMAT
Carry the extended fields (Id, SpanId, ClientSpanId) inside the W3C
tracestate as a private vendor key named 'durabletask-mssql':

    traceparent\ndurabletask-mssql=id:...;span:...[;client:...][,user-tracestate]

W3C tracestate explicitly requires unknown vendor keys to be preserved and
propagated unmodified, so:

- Old reader on a new row reads a clean single-line traceparent and a
  clean single-line tracestate (no embedded newlines). It propagates the
  whole tracestate, vendor key included, through Activity.TraceStateString
  exactly as if any other vendor had inserted that entry.
- New reader on an old row sees no 'durabletask-mssql=' entry, leaves the
  tracestate alone, and returns null for Id/SpanId/ClientSpanId — same as
  current behavior for any history that predates this fix.
- New reader on a new row separates the vendor key from any user tracestate
  entries, restoring exact field values on round-trip.
- Rows with no Id/SpanId emit a wire format byte-for-byte identical to the
  pre-PR code.

Sub-orchestration rows previously had a NULL TraceContext column on main
and the legacy reader never calls GetTraceContext for them, so the wire
format there is a single-line 'durabletask-mssql=client:<spanId>' that is
only ever parsed by the new GetSubOrchestrationClientSpanId helper.

TESTS
Reworked the unit tests so they simulate the *actual* legacy split
(Split('\n', 2, RemoveEmptyEntries)) and assert that:

- parts[0] is a clean W3C traceparent with no embedded newlines or vendor
  noise
- parts[1] (where present) is a clean single-line W3C tracestate value with
  no embedded newlines and no '@'-prefixed garbage

Added theory-based coverage for the new reader to verify the vendor key is
extracted regardless of position within the user tracestate (and that user
entries are preserved in order with the vendor key stripped).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chandramouleswaran
Copy link
Copy Markdown
Author

@cgillum You were 100% right — I verified by reading the pre-PR reader directly. It uses Split('\n', count: 2, StringSplitOptions.RemoveEmptyEntries), and the previous attempt completely fell apart under those semantics:

  • The empty placeholder line on row 2 was stripped by RemoveEmptyEntries.
  • count: 2 left the entire remainder of the string in parts[1], so @id=...\n@spanid=... was landing wholesale in TraceState.
  • The previous "old reader compat" tests used a plain Split('\n') and therefore failed to exercise the actual legacy semantics — your call-out on that was exactly correct.

I empirically confirmed the failure mode with a small console repro, then pushed a2f69da taking your first suggestion: carry the extended fields inside the W3C tracestate as a private vendor key named durabletask-mssql.

traceparent\ndurabletask-mssql=id:...;span:...[;client:...][,user-tracestate]

Why this is rolling-upgrade safe (and now actually proven, not asserted):

Writer Reader Outcome
New Old parts[0] is a clean traceparent, parts[1] is a clean single-line tracestate that includes our vendor key. W3C explicitly requires unknown vendor keys to be propagated, so it flows through Activity.TraceStateString unchanged. No newlines, no @ garbage.
Old New No durabletask-mssql= entry to extract, tracestate is left alone, Id/SpanId come back null — identical behavior to current main.
New New Vendor key parsed, user tracestate preserved in original order, fields round-trip exactly.
Old Old Unchanged (rows with no Id/SpanId are still emitted byte-for-byte identical to pre-PR).

The new tests simulate the real legacy split (Split('\n', 2, RemoveEmptyEntries)) and assert (1) parts[0] has no embedded newlines and no vendor noise and (2) parts[1] is a clean single-line tracestate. Theory cases also cover the vendor key appearing in any position within the user tracestate.

Sub-orchestration rows had TraceContext = NULL on main and the legacy reader's switch statement never invokes GetTraceContext for them, so the wire format there is a single-line durabletask-mssql=client:<spanId> that is only ever parsed by the new GetSubOrchestrationClientSpanId helper. Confirmed by reading the pre-PR GetHistoryEvent switch in upstream main.

11/11 unit tests pass locally. Thanks for the catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants