Skip to content

Add OpenTelemetry tracing across the backbeat pipeline#2733

Open
delthas wants to merge 3 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing
Open

Add OpenTelemetry tracing across the backbeat pipeline#2733
delthas wants to merge 3 commits into
development/9.4from
improvement/BB-764/otel-replication-tracing

Conversation

@delthas

@delthas delthas commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Add OpenTelemetry tracing across the backbeat pipeline, gated behind ENABLE_OTEL=true. When the flag is unset, no @opentelemetry/* package is loaded — zero overhead off the OTEL path.

The SDK bootstrap, trust-boundary host filter, and Kafka trace-context helpers now live in arsenal's shared lib/tracing module (scality/Arsenal#2632, ARSN-586); backbeat consumes it through a thin shim instead of carrying its own copy. Companion to the cloudserver (#6140, CLDSRV-884) and vault (#203, VAULT-708) PRs, so all four services share one implementation.

Commits

  1. chore: depend on arsenal OTEL tracing module — pin arsenal at the 8.4.6 release tag (shared tracing module + the W3C trace-context stamping on MongoDB metadata writes, ARSN-572, which the Kafka pipeline relies on to continue traces across the oplog boundary). Drop the SDK-core packages now that arsenal carries them as optionalDependencies, and keep the four instrumentation packages backbeat configures itself: instrumentation-http / -ioredis / -mongodb / -aws-sdk.

  2. feat: replace in-tree tracing with arsenal shimlib/tracing/index.js becomes a thin shim over require('arsenal/build/lib/tracing') carrying backbeat's config in one place: serviceName: 'backbeat', the four instrumentations, and outbound-only HTTP (...makeHttpInstrumentationConfig() for the trust-boundary requestHook, plus disableIncomingRequestInstrumentation: true since backbeat pods serve no application HTTP). lib/tracing/kafkaTraceContext.js re-exports arsenal's kafka helpers so the existing require sites are unchanged. The trust-boundary filter and SDK bootstrap that used to live here are now arsenal's.

  3. feat: instrument backbeat pods and the Kafka pipeline — wire tracing into the replication, lifecycle, GC, notification, and oplog-populator pods: init() at each of the 8 entry points, per-pod spans, and trace-context propagation across the Kafka pipeline. Producers stamp traceparent onto message headers via the kafka helpers; consumers start a span linked to (not a child of) the upstream span — out-of-process Kafka hops can fire long after the original request, so links keep traces bounded.

Incidental, intentional: the replicationStatusProcessor SIGTERM handler previously had no process.exit(0) on the success path — an inconsistency with the other 7 entry points. Since this PR adds tracing.close() to that handler, it also adds the missing process.exit(0) so the pod exits cleanly on SIGTERM like the rest (rather than potentially hanging on a non-empty event loop).

Why a shim (vs cloudserver/vault's direct calls)

backbeat has 8 init() entry points and 6 Kafka-helper require sites. The shim keeps all 14 call sites untouched and the backbeat-specific config in one file; cloudserver and vault have a single entry point each, so they deep-require arsenal directly.

Configuration

OpenTelemetry environment variables are documented in the arsenal module.

Out of scope (follow-ups)

The trust-boundary enforcement (read OTEL_TRUSTED_HOSTS, strip traceparent on untrusted outbound) lives entirely in arsenal and is wired automatically. Populating that env var per deployment is routine operator config — no backbeat-side work.

Related tickets

Issue: BB-764

@bert-e

bert-e commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e

bert-e commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov

codecov Bot commented Apr 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 54.54545% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.48%. Comparing base (c52fbcc) to head (e161d1e).
⚠️ Report is 18 commits behind head on development/9.4.

Files with missing lines Patch % Lines
...tensions/lifecycle/conductor/LifecycleConductor.js 44.44% 10 Missing ⚠️
lib/tracing/index.js 33.33% 6 Missing ⚠️
bin/queuePopulator.js 0.00% 3 Missing ⚠️
extensions/gc/service.js 0.00% 3 Missing ⚠️
extensions/lifecycle/bucketProcessor/task.js 0.00% 3 Missing ⚠️
extensions/lifecycle/conductor/service.js 0.00% 3 Missing ⚠️
extensions/lifecycle/objectProcessor/task.js 0.00% 3 Missing ⚠️
extensions/notification/queueProcessor/task.js 0.00% 3 Missing ⚠️
extensions/replication/queueProcessor/task.js 0.00% 3 Missing ⚠️
...ons/replication/replicationStatusProcessor/task.js 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js 91.56% <100.00%> (+0.01%) ⬆️
...ensions/notification/NotificationQueuePopulator.js 98.21% <100.00%> (+0.03%) ⬆️
...cation/destination/KafkaNotificationDestination.js 81.63% <100.00%> (+2.56%) ⬆️
extensions/replication/ReplicationAPI.js 86.95% <100.00%> (+0.28%) ⬆️
...xtensions/replication/ReplicationQueuePopulator.js 91.93% <100.00%> (-1.40%) ⬇️
lib/BackbeatConsumer.js 95.35% <100.00%> (+0.46%) ⬆️
lib/BackbeatProducer.js 90.17% <ø> (ø)
lib/queuePopulator/QueuePopulatorExtension.js 86.11% <100.00%> (-5.07%) ⬇️
bin/queuePopulator.js 0.00% <0.00%> (ø)
extensions/gc/service.js 0.00% <0.00%> (ø)
... and 8 more

... and 5 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.22% <75.00%> (-0.01%) ⬇️
Core Library 80.78% <80.00%> (-0.21%) ⬇️
Ingestion 70.63% <ø> (-0.61%) ⬇️
Lifecycle 78.63% <33.33%> (-0.43%) ⬇️
Oplog Populator 85.83% <ø> (ø)
Replication 59.64% <40.00%> (-0.15%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.4    #2733      +/-   ##
===================================================
- Coverage            74.73%   74.48%   -0.25%     
===================================================
  Files                  199      200       +1     
  Lines                13650    13725      +75     
===================================================
+ Hits                 10201    10223      +22     
- Misses                3439     3492      +53     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.07% <0.00%> (-0.06%) ⬇️
api:routes 8.89% <0.00%> (-0.06%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.08% <7.95%> (-1.05%) ⬇️
ingestion 12.50% <9.09%> (-0.08%) ⬇️
lib 7.77% <11.36%> (-0.02%) ⬇️
lifecycle 18.92% <26.13%> (-0.08%) ⬇️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.70% <12.50%> (-0.03%) ⬇️
unit 51.43% <53.40%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/BackbeatConsumer.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 9d08f7b to 2f7afb0 Compare April 14, 2026 10:44
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 2f7afb0 to d562a0a Compare April 14, 2026 10:47
Comment thread lib/BackbeatConsumer.js Outdated
Comment thread package.json Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread package.json Outdated
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from d562a0a to 51a9f61 Compare April 14, 2026 11:00
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 51a9f61 to b9d3528 Compare April 14, 2026 16:07
Comment thread lib/BackbeatConsumer.js
Comment thread OTEL.md Outdated
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 970a811 to 849d6b0 Compare April 15, 2026 15:22
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/tracing/kafkaTraceContext.js Outdated
Comment thread lib/BackbeatConsumer.js
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot Jun 3, 2026
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 54399e3 to 3189d3a Compare June 3, 2026 17:08
Comment thread package.json Outdated
Comment thread extensions/replication/replicationStatusProcessor/task.js
@scality scality deleted a comment from claude Bot Jun 3, 2026
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 3189d3a to cedc91c Compare June 3, 2026 17:47
Comment thread package.json Outdated
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
@scality scality deleted a comment from claude Bot Jun 3, 2026
Comment thread package.json Outdated
Comment thread lib/BackbeatConsumer.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Comment thread package.json Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread tests/utils/withActiveSpan.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/tasks/LifecycleTask.js Outdated
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Comment thread extensions/replication/replicationStatusProcessor/task.js
Comment thread tests/unit/lib/BackbeatConsumer.spec.js Outdated
Comment thread package.json Outdated
Comment thread package.json Outdated

@DarkIsDude DarkIsDude left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more for me. Code is clean and factorised. No issue found. Congrats

delthas added 2 commits June 10, 2026 21:44
Pin arsenal at the 8.4.6 release tag (shared tracing module + W3C
trace-context stamping on MongoDB metadata writes). Drop the SDK-core
packages now that arsenal carries them as optionalDependencies, and
keep the four instrumentation packages (http, ioredis, mongodb,
aws-sdk) here — the consumer owns and configures them.

Issue: BB-764
lib/tracing/index.js becomes a thin shim over arsenal's shared module:
it carries backbeat's config (serviceName, the http/ioredis/mongodb/
aws-sdk instrumentations, outbound-only HTTP via
makeHttpInstrumentationConfig + disableIncomingRequestInstrumentation)
so the 8 entry points keep calling init() with no args.
kafkaTraceContext.js re-exports arsenal's kafka helpers so the existing
require sites are unchanged. The trust-boundary filter and SDK
bootstrap now live in arsenal.

Issue: BB-764
Comment thread lib/queuePopulator/QueuePopulatorExtension.js
Comment thread extensions/notification/destination/KafkaNotificationDestination.js Outdated
Wire arsenal's tracing into the replication, lifecycle, GC,
notification, and oplog-populator pods: init() at each entry point,
per-pod spans, and trace-context propagation across the Kafka pipeline
(producers stamp traceparent via the kafka helpers; consumers start
linked spans from it). Out-of-process Kafka hops use span links, not
parent/child, so traces stay bounded.

Issue: BB-764
const KafkaBacklogMetrics = require('../../../lib/KafkaBacklogMetrics');

describe('BackbeatConsumer._processTask', () => {
const savedEnv = process.env.ENABLE_OTEL;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The savedEnv snapshot is captured at module-parse time (line 10), before any beforeEach has run. If another test file loaded earlier sets ENABLE_OTEL, this snapshot preserves that value and the afterEach restores it — but the real issue is that savedEnv is a const captured once, so if one test within this file changes it, the restore logic in afterEach always restores to the original module-load value, which is correct. However, if tests run in parallel or another file's after hook deletes ENABLE_OTEL between the const capture and the first afterEach, the restore is stale. A safer pattern is to snapshot inside beforeEach:

js<br>let savedEnv;<br>beforeEach(() => {<br> savedEnv = process.env.ENABLE_OTEL;<br> sinon.stub(KafkaBacklogMetrics, 'onTaskStarted');<br>});<br>

Minor — not a blocker.

— Claude Code

// in-process scan work (Mongo bucket listing) nests under it.
const tracer = trace.getTracer('backbeat');
const span = tracer.startSpan('lifecycle.conductor.scan', {
kind: SpanKind.INTERNAL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OTEL-disabled branch, cb is wrapped in (err, res) => { if (cb) {cb(err, res);} } — but the original processBuckets before this PR passed cb straight through to the body (now _processBucketsInternal), which itself calls cb(...) in multiple places without guarding. This new guard makes the disabled path safer than it was before, but the enabled path (line 433) does the same guard while _processBucketsInternal itself does not guard. Consider either always guarding inside _processBucketsInternal, or not guarding here (keeping the pre-PR contract that cb is required).

Not a bug today since callers always pass cb, but the inconsistency may confuse future readers.

— Claude Code

if (registered) {
return;
}
api.context.setGlobalContextManager(new AsyncLocalStorageContextManager().enable());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setGlobalContextManager and setGlobalPropagator are write-once in the OTEL API — calling them again (from another test file that also uses withActiveSpan) is a silent no-op but returns false. The registered guard handles this within one file, but if any other test file independently calls api.context.setGlobalContextManager(...), the two registrations will race and one silently loses.

This is fine for now since withActiveSpan is the single registering entry point, but worth noting in case a future test sets up its own context manager.

— Claude Code

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Well-structured PR that adds OpenTelemetry tracing across the backbeat pipeline with a clean feature-flag gate (ENABLE_OTEL). The shim pattern, trust-boundary header stripping on the customer Kafka path, and linked (not child) spans across Kafka hops are all sound design choices. Arsenal is correctly pinned to a release tag (8.4.6). No correctness bugs found.

Minor observations posted as inline comments:
- tests/unit/lib/BackbeatConsumer.spec.js:10 — savedEnv snapshot captured at module-parse time; safer to snapshot inside beforeEach
- extensions/lifecycle/conductor/LifecycleConductor.js:419 — if (cb) guard added in the wrapper but not inside _processBucketsInternal — minor inconsistency
- tests/utils/withActiveSpan.js:17 — OTEL global context manager is write-once; the guard handles it today but worth noting for future tests

None of these are blockers.

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review-retro PRs with a Claude Code review that could be improved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants