You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a public Hedging Detection API to the Cosmos Python SDK so customers can post-hoc determine whether a successful or failed Cosmos point/feed operation went through cross-region hedging, which regions were dispatched against, and which regions responded.
Python has no first-class CosmosDiagnostics object today — only response wrappers (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged) carrying get_response_headers(), plus opt-in CosmosHttpLoggingPolicy log strings. After cross-SDK gate review, the chosen Python shape is three new accessor methods added directly to each of the five wrapper / exception types, backed by a shared private _HedgingDetectionState instance — matching the get_response_headers() precedent on those wrappers and avoiding a response-wrapper refactor.
This is part of a cross-SDK feature being implemented in parallel across .NET, Java, and Python (with a spec-only deliverable for Rust).
Public API additions
# azure/cosmos/__init__.py — new exportsfromazure.cosmos._diagnostics_typesimportRequestedRegion, RequestedRegionReason
# azure/cosmos/_diagnostics_types.py (NEW)fromdataclassesimportdataclassfromenumimportEnum@dataclass(frozen=True, slots=True)classRequestedRegion:
region_name: strreason: "RequestedRegionReason"classRequestedRegionReason(Enum):
INITIAL="initial"OPERATION_RETRY="operation_retry"TRANSPORT_RETRY="transport_retry"# reserved — not populated todayHEDGING="hedging"REGION_FAILOVER="region_failover"CIRCUIT_BREAKER_PROBE="circuit_breaker_probe"# reserved — not populated in v1UNKNOWN="unknown"@classmethoddef_missing_(cls, value): # forward-compat for reasons added in future versionsreturncls.UNKNOWN
Three new methods added directly to each of the five wrapper / exception types (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError):
Backing state lives on a private _HedgingDetectionState class (in a new azure/cosmos/_diagnostics.py module) and is shared via a private ._hedging_state attribute on each type. The state object holds a threading.Lock, a list[RequestedRegion], a list[str], and a bool. Methods on each type forward to the shared state.
Diagnostics flows through execute_with_hedging as a separate closure argument, not on request_params. Rationale: copy.deepcopy(request_params) at _availability_strategy_handler.py:96 would otherwise silently swallow child appends (SE-002 — explicit deepcopy hazard regression test required).
Critical: sync↔async parity
Every code path added in azure/cosmos/ has a matching change in azure/cosmos/aio/. CI script enforces parity (every sync test has an async twin file with same name + _async suffix). This is the #1 historical Python Cosmos bug pattern (SE-004).
HEDGING append fires inside the hedge-arm coroutine body, after the threshold await:
# Correct: append on actual dispatch, not at task-creation timeasyncdefhedge_arm(region, threshold, diagnostics):
awaitasyncio.sleep(threshold) # primary may complete first; coroutine never resumes herediagnostics._record_request(region, RequestedRegionReason.HEDGING) # only fires post-delay, post-non-cancellationreturnawaitissue_request(region)
AC2 Multi-region client, hedging enabled, primary responds under threshold → is_hedging_started() == False; get_requested_regions() has exactly one INITIAL entry; no phantom HEDGING for the cancelled hedge task.
AC3 Multi-region client, hedging enabled, primary slow, hedge arm wins → is_hedging_started() == True; get_requested_regions() has ≥2 entries including (hedge_region, HEDGING); get_responded_regions() has ≥1 entry.
AC4 410 Gone retry on same region → get_requested_regions() includes consecutive entries (region, INITIAL) then (region, OPERATION_RETRY).
AC5 Region failover via _TimeoutFailoverRetryPolicy / _endpoint_discovery_retry_policy → get_requested_regions() includes (originalRegion, INITIAL) then (secondaryRegion, REGION_FAILOVER).
AC6 Unknown reason from a future-version SDK → RequestedRegionReason("future_value") returns RequestedRegionReason.UNKNOWN via _missing_.
AC7 All-regions-down error → CosmosHttpResponseError.get_requested_regions() non-empty; get_responded_regions() may be empty.
AC8Deepcopy regression — explicit test that walks the dispatch path with copy.deepcopy(request_params) in place and confirms appends still reach the final state (SE-002).
AC9 Sync↔async parity — every test file tests/test_hedging_detection.py has a twin tests/test_hedging_detection_async.py exercising the same scenario via azure.cosmos.aio.
AC10 Existing CosmosHttpLoggingPolicy log format and client.last_response_headers behavior unchanged.
AC13 Live multi-region smoke test (≥1) — runs against the team's multi-region test account with hedging enabled, injects primary-slow latency, asserts on a wrapper (e.g., CosmosDict) and an exception (e.g., CosmosHttpResponseError) call site: is_hedging_started() == True, get_requested_regions() includes both regions, get_responded_regions() includes the secondary region. Both sync + async twin.
Full internal spec, landscape research, plan, risk register (side-effects.json), and questions+answers are available from the workflow author (@NaluTripician) on request — they are team-only and not linked here.
Phase 1 review gate completed on 2026-05-14. The chosen Python shape is "three methods per type backed by shared private _HedgingDetectionState" — no new public CosmosDiagnostics class. Earlier closed PR Cosmos Diagnostics #25678 is historical context only; it proposed an Option A shape that the gate rejected.
This issue is being dispatched to the Coding Agent Harness for end-to-end implementation; reviewers may receive a draft PR shortly.
Summary
Add a public Hedging Detection API to the Cosmos Python SDK so customers can post-hoc determine whether a successful or failed Cosmos point/feed operation went through cross-region hedging, which regions were dispatched against, and which regions responded.
Python has no first-class
CosmosDiagnosticsobject today — only response wrappers (CosmosDict,CosmosList,CosmosItemPaged,CosmosAsyncItemPaged) carryingget_response_headers(), plus opt-inCosmosHttpLoggingPolicylog strings. After cross-SDK gate review, the chosen Python shape is three new accessor methods added directly to each of the five wrapper / exception types, backed by a shared private_HedgingDetectionStateinstance — matching theget_response_headers()precedent on those wrappers and avoiding a response-wrapper refactor.This is part of a cross-SDK feature being implemented in parallel across .NET, Java, and Python (with a spec-only deliverable for Rust).
Public API additions
Three new methods added directly to each of the five wrapper / exception types (
CosmosDict,CosmosList,CosmosItemPaged,CosmosAsyncItemPaged,CosmosHttpResponseError,CosmosBatchOperationError,CosmosClientTimeoutError):Backing state lives on a private
_HedgingDetectionStateclass (in a newazure/cosmos/_diagnostics.pymodule) and is shared via a private._hedging_stateattribute on each type. The state object holds athreading.Lock, alist[RequestedRegion], alist[str], and abool. Methods on each type forward to the shared state.Critical: closure-argument passing (NOT
request_params)Diagnostics flows through
execute_with_hedgingas a separate closure argument, not onrequest_params. Rationale:copy.deepcopy(request_params)at_availability_strategy_handler.py:96would otherwise silently swallow child appends (SE-002 — explicit deepcopy hazard regression test required).Critical: sync↔async parity
Every code path added in
azure/cosmos/has a matching change inazure/cosmos/aio/. CI script enforces parity (every sync test has an async twin file with same name +_asyncsuffix). This is the #1 historical Python Cosmos bug pattern (SE-004).HEDGINGappend fires inside the hedge-arm coroutine body, after the threshold await:Acceptance criteria (testable)
read_itemsuccess →response.is_hedging_started() == False;response.get_requested_regions() == (RequestedRegion("centralus", INITIAL),);response.get_responded_regions() == ("centralus",).is_hedging_started() == False;get_requested_regions()has exactly oneINITIALentry; no phantomHEDGINGfor the cancelled hedge task.is_hedging_started() == True;get_requested_regions()has ≥2 entries including(hedge_region, HEDGING);get_responded_regions()has ≥1 entry.get_requested_regions()includes consecutive entries(region, INITIAL)then(region, OPERATION_RETRY)._TimeoutFailoverRetryPolicy/_endpoint_discovery_retry_policy→get_requested_regions()includes(originalRegion, INITIAL)then(secondaryRegion, REGION_FAILOVER).RequestedRegionReason("future_value")returnsRequestedRegionReason.UNKNOWNvia_missing_.CosmosHttpResponseError.get_requested_regions()non-empty;get_responded_regions()may be empty.copy.deepcopy(request_params)in place and confirms appends still reach the final state (SE-002).tests/test_hedging_detection.pyhas a twintests/test_hedging_detection_async.pyexercising the same scenario viaazure.cosmos.aio.CosmosHttpLoggingPolicylog format andclient.last_response_headersbehavior unchanged.mypy --strict azure.cosmos(or equivalent existing infrastructure) passes.CosmosDict) and an exception (e.g.,CosmosHttpResponseError) call site:is_hedging_started() == True,get_requested_regions()includes both regions,get_responded_regions()includes the secondary region. Both sync + async twin.Files in scope
sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics_types.py(dataclass + enum),sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics.py(private_HedgingDetectionState+ helpers)sdk/cosmos/azure-cosmos/azure/cosmos/__init__.py(exports), the five wrapper / exception types (CosmosDict,CosmosList,CosmosItemPaged,CosmosAsyncItemPaged,CosmosHttpResponseError,CosmosBatchOperationError,CosmosClientTimeoutError),_availability_strategy_handler.py:116,aio/_asynchronous_availability_strategy_handler.py:126,_retry_utility.Execute:59,aio/_retry_utility_async.ExecuteAsync:63,__init__.pyitype stubs,sdk/cosmos/azure-cosmos/CHANGELOG.mdtests/test_hedging_detection.py+tests/test_hedging_detection_async.py+tests/test_diagnostics_types.py, live-account multi-region testsdk/cosmos/azure-cosmos/samples/— usage examples forif response.is_hedging_started(): ...Out of scope
azure/cosmos/diagnostics.py(_RecordDiagnostics) — stays deprecated.CosmosHttpLoggingPolicylog format.CosmosDiagnosticsclass (explicitly rejected at the cross-SDK gate in favor of three methods per type).Cross-SDK companion issues
Azure/azure-cosmos-dotnet-v3.Azure/azure-sdk-for-java(sdk/cosmos).Azure/azure-sdk-for-rust(sequenced after PR [AutoPR keyvault/resource-manager] add description of access policies required condition #4330).Notes for the implementer
side-effects.json), and questions+answers are available from the workflow author (@NaluTripician) on request — they are team-only and not linked here._HedgingDetectionState" — no new publicCosmosDiagnosticsclass. Earlier closed PR Cosmos Diagnostics #25678 is historical context only; it proposed anOption Ashape that the gate rejected.