Add OpenTelemetry tracing across the backbeat pipeline#2733
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 5 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.4 #2733 +/- ##
===================================================
- Coverage 74.73% 74.48% -0.25%
===================================================
Files 199 200 +1
Lines 13650 13725 +75
===================================================
+ Hits 10201 10223 +22
- Misses 3439 3492 +53
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
9d08f7b to
2f7afb0
Compare
2f7afb0 to
d562a0a
Compare
d562a0a to
51a9f61
Compare
51a9f61 to
b9d3528
Compare
970a811 to
849d6b0
Compare
54399e3 to
3189d3a
Compare
3189d3a to
cedc91c
Compare
DarkIsDude
left a comment
There was a problem hiding this comment.
No more for me. Code is clean and factorised. No issue found. Congrats
Pin arsenal at the 8.4.6 release tag (shared tracing module + W3C trace-context stamping on MongoDB metadata writes). Drop the SDK-core packages now that arsenal carries them as optionalDependencies, and keep the four instrumentation packages (http, ioredis, mongodb, aws-sdk) here — the consumer owns and configures them. Issue: BB-764
lib/tracing/index.js becomes a thin shim over arsenal's shared module: it carries backbeat's config (serviceName, the http/ioredis/mongodb/ aws-sdk instrumentations, outbound-only HTTP via makeHttpInstrumentationConfig + disableIncomingRequestInstrumentation) so the 8 entry points keep calling init() with no args. kafkaTraceContext.js re-exports arsenal's kafka helpers so the existing require sites are unchanged. The trust-boundary filter and SDK bootstrap now live in arsenal. Issue: BB-764
Wire arsenal's tracing into the replication, lifecycle, GC, notification, and oplog-populator pods: init() at each entry point, per-pod spans, and trace-context propagation across the Kafka pipeline (producers stamp traceparent via the kafka helpers; consumers start linked spans from it). Out-of-process Kafka hops use span links, not parent/child, so traces stay bounded. Issue: BB-764
| const KafkaBacklogMetrics = require('../../../lib/KafkaBacklogMetrics'); | ||
|
|
||
| describe('BackbeatConsumer._processTask', () => { | ||
| const savedEnv = process.env.ENABLE_OTEL; |
There was a problem hiding this comment.
The savedEnv snapshot is captured at module-parse time (line 10), before any beforeEach has run. If another test file loaded earlier sets ENABLE_OTEL, this snapshot preserves that value and the afterEach restores it — but the real issue is that savedEnv is a const captured once, so if one test within this file changes it, the restore logic in afterEach always restores to the original module-load value, which is correct. However, if tests run in parallel or another file's after hook deletes ENABLE_OTEL between the const capture and the first afterEach, the restore is stale. A safer pattern is to snapshot inside beforeEach:js<br>let savedEnv;<br>beforeEach(() => {<br> savedEnv = process.env.ENABLE_OTEL;<br> sinon.stub(KafkaBacklogMetrics, 'onTaskStarted');<br>});<br>
Minor — not a blocker.
— Claude Code
| // in-process scan work (Mongo bucket listing) nests under it. | ||
| const tracer = trace.getTracer('backbeat'); | ||
| const span = tracer.startSpan('lifecycle.conductor.scan', { | ||
| kind: SpanKind.INTERNAL, |
There was a problem hiding this comment.
In the OTEL-disabled branch, cb is wrapped in (err, res) => { if (cb) {cb(err, res);} } — but the original processBuckets before this PR passed cb straight through to the body (now _processBucketsInternal), which itself calls cb(...) in multiple places without guarding. This new guard makes the disabled path safer than it was before, but the enabled path (line 433) does the same guard while _processBucketsInternal itself does not guard. Consider either always guarding inside _processBucketsInternal, or not guarding here (keeping the pre-PR contract that cb is required).
Not a bug today since callers always pass cb, but the inconsistency may confuse future readers.
— Claude Code
| if (registered) { | ||
| return; | ||
| } | ||
| api.context.setGlobalContextManager(new AsyncLocalStorageContextManager().enable()); |
There was a problem hiding this comment.
setGlobalContextManager and setGlobalPropagator are write-once in the OTEL API — calling them again (from another test file that also uses withActiveSpan) is a silent no-op but returns false. The registered guard handles this within one file, but if any other test file independently calls api.context.setGlobalContextManager(...), the two registrations will race and one silently loses.
This is fine for now since withActiveSpan is the single registering entry point, but worth noting in case a future test sets up its own context manager.
— Claude Code
|
Well-structured PR that adds OpenTelemetry tracing across the backbeat pipeline with a clean feature-flag gate ( |
Summary
Add OpenTelemetry tracing across the backbeat pipeline, gated behind
ENABLE_OTEL=true. When the flag is unset, no@opentelemetry/*package is loaded — zero overhead off the OTEL path.The SDK bootstrap, trust-boundary host filter, and Kafka trace-context helpers now live in arsenal's shared
lib/tracingmodule (scality/Arsenal#2632, ARSN-586); backbeat consumes it through a thin shim instead of carrying its own copy. Companion to the cloudserver (#6140, CLDSRV-884) and vault (#203, VAULT-708) PRs, so all four services share one implementation.Commits
chore: depend on arsenal OTEL tracing module— pin arsenal at the8.4.6release tag (shared tracing module + the W3C trace-context stamping on MongoDB metadata writes, ARSN-572, which the Kafka pipeline relies on to continue traces across the oplog boundary). Drop the SDK-core packages now that arsenal carries them asoptionalDependencies, and keep the four instrumentation packages backbeat configures itself:instrumentation-http/-ioredis/-mongodb/-aws-sdk.feat: replace in-tree tracing with arsenal shim—lib/tracing/index.jsbecomes a thin shim overrequire('arsenal/build/lib/tracing')carrying backbeat's config in one place:serviceName: 'backbeat', the four instrumentations, and outbound-only HTTP (...makeHttpInstrumentationConfig()for the trust-boundaryrequestHook, plusdisableIncomingRequestInstrumentation: truesince backbeat pods serve no application HTTP).lib/tracing/kafkaTraceContext.jsre-exports arsenal's kafka helpers so the existing require sites are unchanged. The trust-boundary filter and SDK bootstrap that used to live here are now arsenal's.feat: instrument backbeat pods and the Kafka pipeline— wire tracing into the replication, lifecycle, GC, notification, and oplog-populator pods:init()at each of the 8 entry points, per-pod spans, and trace-context propagation across the Kafka pipeline. Producers stamptraceparentonto message headers via the kafka helpers; consumers start a span linked to (not a child of) the upstream span — out-of-process Kafka hops can fire long after the original request, so links keep traces bounded.Why a shim (vs cloudserver/vault's direct calls)
backbeat has 8
init()entry points and 6 Kafka-helper require sites. The shim keeps all 14 call sites untouched and the backbeat-specific config in one file; cloudserver and vault have a single entry point each, so they deep-require arsenal directly.Configuration
OpenTelemetry environment variables are documented in the arsenal module.
Out of scope (follow-ups)
The trust-boundary enforcement (read
OTEL_TRUSTED_HOSTS, striptraceparenton untrusted outbound) lives entirely in arsenal and is wired automatically. Populating that env var per deployment is routine operator config — no backbeat-side work.Related tickets
Issue: BB-764