Skip to content

feat(sdk): add otel support#177

Open
namrataghadi-galileo wants to merge 2 commits intofeature/61576-add-config-driven-sink-selectionfrom
feature/61578-add-otel-support
Open

feat(sdk): add otel support#177
namrataghadi-galileo wants to merge 2 commits intofeature/61576-add-config-driven-sink-selectionfrom
feature/61578-add-otel-support

Conversation

@namrataghadi-galileo
Copy link
Copy Markdown
Contributor

Summary

Added configurable observability sink selection for the Python SDK and server so control events can be routed to the default backend, registered custom sinks, or named sink factories.

Added a built-in OpenTelemetry sink for the SDK, including OTLP configuration via settings/environment variables.

Updated docs and exports so custom sink registration and OTEL usage are available as public integration points.

Scope

User-facing/API changes:

  • Added observability_sink_name and observability_sink_config to SDK initialization/configuration
  • Exposed sink registration helpers and OTEL conversion utilities from the Python SDK public API
  • Added documented support for the SDK otel extra and OTEL-related environment variables
  • Added server observability sink selection/config support for named backends

Internal changes:

  • Introduced shared sink-selection models/registry helpers in telemetry/
  • Refactored SDK observability to resolve active sinks dynamically and support named/custom sinks
  • Refactored server startup/shutdown to resolve observability backends through sink selection
  • Added test coverage for sink selection, lifecycle handling, OTEL sink behavior, and re-init/policy refresh interactions

Out of scope:

  • No changes to core control evaluation semantics
  • No UI changes
  • No new non-OTEL external sink implementation included beyond registration hooks

Risk and Rollout

Risk level: medium

Rollback plan: Revert the sink-selection/OTEL changeset to restore the previous default SDK-to-server observability path only. If needed, disable custom routing by keeping observability_sink_name=default and not installing/configuring the otel extra.

Testing

  • Added or updated automated tests
  • Ran make check — typecheck was attempted but full local make check was not run because uv was unavailable in this environment
  • Manually verified behavior

Checklist

  • Linked issue/spec (if applicable)
  • Updated docs/examples for user-facing changes
  • Included any required follow-up tasks

@namrataghadi-galileo namrataghadi-galileo changed the base branch from main to feature/61576-add-config-driven-sink-selection April 16, 2026 22:36
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 89.26554% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
sdks/python/src/agent_control/otel_sink.py 88.69% 19 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having a built-in OTEL path, but I don't think this version is safe to merge yet. Right now the SDK can tell callers observability is working while the OTEL sink is effectively inert, and the caching logic doesn't compose cleanly with the new otel_* settings. I'd also mark failed control executions as OTEL errors rather than only attaching an exception object.

"""Sink that accepts events but intentionally emits nothing."""

def write_events(self, events: Sequence[ControlExecutionEvent]) -> SinkResult:
return SinkResult(accepted=len(events), dropped=0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acknowledging delivery even though nothing is emitted. That means add_event() returns True and is_observability_enabled() stays truthy while observability is effectively off. I think this should fail closed like the missing named-sink path in the previous PR, not silently ack drops.


def _resolve_otel_sink_config(config: JSONObject) -> OTELSinkConfig:
"""Resolve OTEL sink config from settings with per-sink overrides."""
settings = get_settings()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This factory is reading hidden global otel_* settings, but the named-sink cache still keys only on (name, config). After the first resolution, changing otel_endpoint, headers, service name, or the enabled flag with the same sink selection won't rebuild the provider. Either the fully resolved OTEL config needs to live in observability_sink_config, or the cache key needs to include these settings.

start_time=span_data.start_time_unix_nano,
)
span.set_attributes(span_data.attributes)
if span_data.error_message:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're emitting spans for failed control executions, I think we should set OTEL span status to ERROR here rather than only recording an exception. A lot of backends key alerting and red/error views off status, not exception payloads alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants