Skip to content

[cosmos] Hedging Detection API — public accessors on response wrappers and exception types #46899

@NaluTripician

Description

@NaluTripician

Summary

Add a public Hedging Detection API to the Cosmos Python SDK so customers can post-hoc determine whether a successful or failed Cosmos point/feed operation went through cross-region hedging, which regions were dispatched against, and which regions responded.

Python has no first-class CosmosDiagnostics object today — only response wrappers (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged) carrying get_response_headers(), plus opt-in CosmosHttpLoggingPolicy log strings. After cross-SDK gate review, the chosen Python shape is three new accessor methods added directly to each of the five wrapper / exception types, backed by a shared private _HedgingDetectionState instance — matching the get_response_headers() precedent on those wrappers and avoiding a response-wrapper refactor.

This is part of a cross-SDK feature being implemented in parallel across .NET, Java, and Python (with a spec-only deliverable for Rust).

Public API additions

# azure/cosmos/__init__.py — new exports
from azure.cosmos._diagnostics_types import RequestedRegion, RequestedRegionReason
# azure/cosmos/_diagnostics_types.py (NEW)
from dataclasses import dataclass
from enum import Enum

@dataclass(frozen=True, slots=True)
class RequestedRegion:
    region_name: str
    reason: "RequestedRegionReason"

class RequestedRegionReason(Enum):
    INITIAL              = "initial"
    OPERATION_RETRY      = "operation_retry"
    TRANSPORT_RETRY      = "transport_retry"          # reserved — not populated today
    HEDGING              = "hedging"
    REGION_FAILOVER      = "region_failover"
    CIRCUIT_BREAKER_PROBE = "circuit_breaker_probe"   # reserved — not populated in v1
    UNKNOWN              = "unknown"

    @classmethod
    def _missing_(cls, value):                        # forward-compat for reasons added in future versions
        return cls.UNKNOWN

Three new methods added directly to each of the five wrapper / exception types (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError):

def is_hedging_started(self) -> bool: ...
def get_requested_regions(self) -> tuple[RequestedRegion, ...]: ...
def get_responded_regions(self) -> tuple[str, ...]: ...

Backing state lives on a private _HedgingDetectionState class (in a new azure/cosmos/_diagnostics.py module) and is shared via a private ._hedging_state attribute on each type. The state object holds a threading.Lock, a list[RequestedRegion], a list[str], and a bool. Methods on each type forward to the shared state.

Critical: closure-argument passing (NOT request_params)

Diagnostics flows through execute_with_hedging as a separate closure argument, not on request_params. Rationale: copy.deepcopy(request_params) at _availability_strategy_handler.py:96 would otherwise silently swallow child appends (SE-002 — explicit deepcopy hazard regression test required).

Critical: sync↔async parity

Every code path added in azure/cosmos/ has a matching change in azure/cosmos/aio/. CI script enforces parity (every sync test has an async twin file with same name + _async suffix). This is the #1 historical Python Cosmos bug pattern (SE-004).

HEDGING append fires inside the hedge-arm coroutine body, after the threshold await:

# Correct: append on actual dispatch, not at task-creation time
async def hedge_arm(region, threshold, diagnostics):
    await asyncio.sleep(threshold)              # primary may complete first; coroutine never resumes here
    diagnostics._record_request(region, RequestedRegionReason.HEDGING)   # only fires post-delay, post-non-cancellation
    return await issue_request(region)

Acceptance criteria (testable)

  • AC1 Single-region client, read_item success → response.is_hedging_started() == False; response.get_requested_regions() == (RequestedRegion("centralus", INITIAL),); response.get_responded_regions() == ("centralus",).
  • AC2 Multi-region client, hedging enabled, primary responds under threshold → is_hedging_started() == False; get_requested_regions() has exactly one INITIAL entry; no phantom HEDGING for the cancelled hedge task.
  • AC3 Multi-region client, hedging enabled, primary slow, hedge arm wins → is_hedging_started() == True; get_requested_regions() has ≥2 entries including (hedge_region, HEDGING); get_responded_regions() has ≥1 entry.
  • AC4 410 Gone retry on same region → get_requested_regions() includes consecutive entries (region, INITIAL) then (region, OPERATION_RETRY).
  • AC5 Region failover via _TimeoutFailoverRetryPolicy / _endpoint_discovery_retry_policyget_requested_regions() includes (originalRegion, INITIAL) then (secondaryRegion, REGION_FAILOVER).
  • AC6 Unknown reason from a future-version SDK → RequestedRegionReason("future_value") returns RequestedRegionReason.UNKNOWN via _missing_.
  • AC7 All-regions-down error → CosmosHttpResponseError.get_requested_regions() non-empty; get_responded_regions() may be empty.
  • AC8 Deepcopy regression — explicit test that walks the dispatch path with copy.deepcopy(request_params) in place and confirms appends still reach the final state (SE-002).
  • AC9 Sync↔async parity — every test file tests/test_hedging_detection.py has a twin tests/test_hedging_detection_async.py exercising the same scenario via azure.cosmos.aio.
  • AC10 Existing CosmosHttpLoggingPolicy log format and client.last_response_headers behavior unchanged.
  • AC11 Type-stub test — mypy --strict azure.cosmos (or equivalent existing infrastructure) passes.
  • AC12 APIView snapshot regenerated; reviewers consulted.
  • AC13 Live multi-region smoke test (≥1) — runs against the team's multi-region test account with hedging enabled, injects primary-slow latency, asserts on a wrapper (e.g., CosmosDict) and an exception (e.g., CosmosHttpResponseError) call site: is_hedging_started() == True, get_requested_regions() includes both regions, get_responded_regions() includes the secondary region. Both sync + async twin.

Files in scope

  • New: sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics_types.py (dataclass + enum), sdk/cosmos/azure-cosmos/azure/cosmos/_diagnostics.py (private _HedgingDetectionState + helpers)
  • Modify: sdk/cosmos/azure-cosmos/azure/cosmos/__init__.py (exports), the five wrapper / exception types (CosmosDict, CosmosList, CosmosItemPaged, CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError, CosmosClientTimeoutError), _availability_strategy_handler.py:116, aio/_asynchronous_availability_strategy_handler.py:126, _retry_utility.Execute:59, aio/_retry_utility_async.ExecuteAsync:63, __init__.pyi type stubs, sdk/cosmos/azure-cosmos/CHANGELOG.md
  • Tests: new tests/test_hedging_detection.py + tests/test_hedging_detection_async.py + tests/test_diagnostics_types.py, live-account multi-region test
  • Samples: sdk/cosmos/azure-cosmos/samples/ — usage examples for if response.is_hedging_started(): ...

Out of scope

  • Restoring the deprecated azure/cosmos/diagnostics.py (_RecordDiagnostics) — stays deprecated.
  • Changing CosmosHttpLoggingPolicy log format.
  • Adding a new public CosmosDiagnostics class (explicitly rejected at the cross-SDK gate in favor of three methods per type).
  • Wiring into OpenTelemetry — separate work item.

Cross-SDK companion issues

Notes for the implementer

  • Full internal spec, landscape research, plan, risk register (side-effects.json), and questions+answers are available from the workflow author (@NaluTripician) on request — they are team-only and not linked here.
  • Phase 1 review gate completed on 2026-05-14. The chosen Python shape is "three methods per type backed by shared private _HedgingDetectionState" — no new public CosmosDiagnostics class. Earlier closed PR Cosmos Diagnostics #25678 is historical context only; it proposed an Option A shape that the gate rejected.
  • This issue is being dispatched to the Coding Agent Harness for end-to-end implementation; reviewers may receive a draft PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions