Skip to content

feat: add evaluate_flags() API for single-call flag evaluation#539

Open
dmarticus wants to merge 7 commits intomainfrom
posthog-code/python-evaluate-flags-api
Open

feat: add evaluate_flags() API for single-call flag evaluation#539
dmarticus wants to merge 7 commits intomainfrom
posthog-code/python-evaluate-flags-api

Conversation

@dmarticus
Copy link
Copy Markdown
Contributor

@dmarticus dmarticus commented Apr 27, 2026

Problem

Phase 1 + Phase 2 of the Server SDK Feature Flag Evaluations RFC for posthog-python. Companion to the Node SDK PR (PostHog/posthog-js#3476).

Today every flag check fires its own /flags request, and capture(send_feature_flags=True) silently fires yet another on every captured event. The flag values on a captured event can diverge from the ones the code actually branched on when person/group properties differ between calls. send_feature_flags also attaches every evaluated flag to every event, which bloats properties on high-volume events.

Changes

New API (Phase 1)

posthog.evaluate_flags(distinct_id, ...) returns a FeatureFlagEvaluations snapshot:

flags = posthog.evaluate_flags(distinct_id, person_properties={"plan": "enterprise"})
if flags.is_enabled("new-dashboard"):
    render_new_dashboard()
posthog.capture("page_viewed", distinct_id=distinct_id, flags=flags)

A single /flags request powers both branching and event enrichment. is_enabled() and get_flag() fire $feature_flag_called events (deduped through the existing cache) with the full metadata — $feature_flag_id, $feature_flag_version, $feature_flag_reason, $feature_flag_request_id — so experiment exposure tracking keeps working.

Two layers of scoping

  • Network-level (flag_keys option): scopes the underlying /flags request itself.

    flags = posthog.evaluate_flags(distinct_id, flag_keys=["new-dashboard", "checkout-flow"])
  • Event-level (filter helpers): narrow which flags get attached to a captured event without re-fetching.

    posthog.capture(..., flags=flags.only_accessed())           # only flags the developer checked
    posthog.capture(..., flags=flags.only(["new-dashboard"]))   # specific keys

Deprecation warnings (Phase 2)

The legacy single-flag surface keeps working but now emits DeprecationWarnings pointing at evaluate_flags():

  • feature_enabled()
  • get_feature_flag()
  • get_feature_flag_payload()
  • capture(send_feature_flags=...) (only when truthy)

feature_enabled and get_feature_flag are restructured to call _get_feature_flag_result directly instead of routing through each other, so a single user-level call emits exactly one warning instead of cascading.

Phase 3 (removal in next major) ships separately.

Local evaluation

Transparent. When the poller resolves a flag, the snapshot carries locally_evaluated=True and reason "Evaluated locally", matching what get_feature_flag() emits today.

Backwards compatibility

No breaking changes. All existing call paths return the same values they did before — the only behavior change is the new DeprecationWarning emissions, which can be silenced via Python's standard warnings filter.

Internals

_capture_feature_flag_called was refactored: the dedup + capture portion is extracted into _capture_feature_flag_called_if_needed, which is shared between the single-flag path and the new FeatureFlagEvaluations object. Both paths now dedupe identically.

Response-level errors (errors_while_computing_flags, quota_limited) are propagated into $feature_flag_called events from the snapshot, matching the granularity of the single-flag path.

Tests

posthog/test/test_evaluate_flags.py — 27 tests covering remote evaluation, local evaluation, filtering helpers, capture integration, flag_keys round-trip, empty-distinct_id safety, error-granularity propagation, and deprecation warning emission (with no-cascade verification).

Full suite: 489 passed. ruff format and ruff check clean.


Created with PostHog Code

Introduce posthog.evaluate_flags(distinct_id, ...) returning a
FeatureFlagEvaluations snapshot. Branch on .is_enabled() / .get_flag()
and pass the snapshot to capture() via a new flags option so events
carry the exact values the code branched on, with no extra /flags
request per capture.

Filtering helpers .only_accessed() and .only([keys]) narrow the flag
set attached to events. Pass flag_keys=[...] to evaluate_flags() to
scope the underlying /flags request. Local evaluation is transparent.

Deprecation (Phase 2 of the RFC):
- feature_enabled, get_feature_flag, get_feature_flag_payload, and
  capture(send_feature_flags=...) now emit DeprecationWarning.
- The deprecated surface remains functional and will be removed in
  the next major version.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
`mock` is not in the project's test dependencies on CI.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

posthog-python Compliance Report

Date: 2026-04-28 18:50:54 UTC
Duration: 159908ms

✅ All Tests Passed!

30/30 tests passed


Capture Tests

29/29 tests passed

View Details
Test Status Duration
Format Validation.Event Has Required Fields 514ms
Format Validation.Event Has Uuid 1506ms
Format Validation.Event Has Lib Properties 1506ms
Format Validation.Distinct Id Is String 1506ms
Format Validation.Token Is Present 1506ms
Format Validation.Custom Properties Preserved 1506ms
Format Validation.Event Has Timestamp 1506ms
Retry Behavior.Retries On 503 9517ms
Retry Behavior.Does Not Retry On 400 3504ms
Retry Behavior.Does Not Retry On 401 3507ms
Retry Behavior.Respects Retry After Header 9508ms
Retry Behavior.Implements Backoff 23531ms
Retry Behavior.Retries On 500 7503ms
Retry Behavior.Retries On 502 7510ms
Retry Behavior.Retries On 504 7511ms
Retry Behavior.Max Retries Respected 23526ms
Deduplication.Generates Unique Uuids 1495ms
Deduplication.Preserves Uuid On Retry 7514ms
Deduplication.Preserves Uuid And Timestamp On Retry 14520ms
Deduplication.Preserves Uuid And Timestamp On Batch Retry 7505ms
Deduplication.No Duplicate Events In Batch 1502ms
Deduplication.Different Events Have Different Uuids 1506ms
Compression.Sends Gzip When Enabled 1506ms
Batch Format.Uses Proper Batch Structure 1506ms
Batch Format.Flush With No Events Sends Nothing 1004ms
Batch Format.Multiple Events Batched Together 1505ms
Error Handling.Does Not Retry On 403 3508ms
Error Handling.Does Not Retry On 413 3507ms
Error Handling.Retries On 408 7512ms

Feature_Flags Tests

1/1 tests passed

View Details
Test Status Duration
Request Payload.Request With Person Properties Device Id 513ms

Phase 2 deprecation warnings on `feature_enabled`, `get_feature_flag`,
`get_feature_flag_payload`, and `capture(send_feature_flags=...)` are
moved to a separate PR so this minor ships only the new API. Gives
users one minor to migrate before runtime warnings start.

The deprecated methods are restored to their original implementations
(no longer need to bypass each other to avoid cascading warnings).

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: posthog/test/test_evaluate_flags.py
Line: 42-44

Comment:
**Missing `tearDown` leaks Client background threads**

`TestEvaluateFlagsRemote` creates a `Client` in `setUp` but never calls `self.client.shutdown()`. The other two test classes in this file (`TestEvaluateFlagsFiltering`, `TestCaptureWithFlagsSnapshot`) both have correct `tearDown` implementations. Without shutdown, the client's background consumer thread is never joined, which can cause test-suite noise, delayed process exit, or flaky behaviour when mocks from one test bleed into the next.

```suggestion
class TestEvaluateFlagsRemote(unittest.TestCase):
    def setUp(self):
        self.client = Client(FAKE_TEST_API_KEY)

    def tearDown(self):
        self.client.shutdown()
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/feature_flag_evaluations.py
Line: 69-80

Comment:
**`flag_definitions_loaded_at` is accepted but never supplied**

`_record_access` (line 242) emits `$feature_flag_definitions_loaded_at` only when `self._flag_definitions_loaded_at is not None`, but `Client.evaluate_flags()` never passes this argument to the `FeatureFlagEvaluations` constructor — so the property is always `None` and the event property is never emitted for locally-evaluated flags. Either wire up the value from the poller or remove the dead parameter and the guarded block in `_record_access` until it's ready.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/test/test_evaluate_flags.py
Line: 42-321

Comment:
**Prefer parameterised tests**

The test suite exercises multiple flag types (boolean, variant, disabled, missing) through per-assertion branches inside single test methods — e.g. `test_is_enabled_returns_correct_values_and_fires_events` and `test_get_flag_returns_variant_or_bool_with_full_metadata`. The project's review standard is to prefer parameterised tests. Using `subTest` (or a library like `ddt`) would give each flag-type scenario its own named, independently-failing case and reduce the assertion density per method.

**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "revert: drop deprecation warnings (Phase..." | Re-trigger Greptile

Comment thread posthog/test/test_evaluate_flags.py
Comment thread posthog/feature_flag_evaluations.py Outdated
Comment thread posthog/test/test_evaluate_flags.py
- Add `tearDown` to `TestEvaluateFlagsRemote` so the Client's background
  consumer thread is joined between tests, matching the pattern in the
  other test classes in this file.
- Remove the dead `flag_definitions_loaded_at` constructor parameter
  from `FeatureFlagEvaluations`. The Python poller doesn't currently
  expose a definitions-loaded timestamp, so the parameter was always
  None and the gated branch in `_record_access` never fired. Trim it
  rather than leaving a confusing no-op; can be re-added with a real
  data source later.
- Convert the two flag-type-variation tests to parameterized form.
  `test_is_enabled` and `test_get_flag_known_keys` now run as
  independently-named cases per flag type, and the missing-key behavior
  is split into its own focused test.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
Copy link
Copy Markdown
Contributor

@dustinbyrne dustinbyrne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have a few questions/considerations

Comment thread posthog/feature_flag_evaluations.py Outdated
Comment on lines +125 to +131
if not self._accessed:
self._host.log_warning(
"FeatureFlagEvaluations.only_accessed() was called before any flags were accessed — "
"attaching all evaluated flags as a fallback. See "
"https://posthog.com/docs/feature-flags/server-sdks for details."
)
return self._clone_with(self._flags)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a legitimate case of nothing being accessed, there's no way to avoid getting all flags. This could be more confusing than getting nothing?

E.g., I'm imaging a case where flags are first retrieved and consumed as needed. Should an early call to capture(flags=flags.only_accessed()) include all flags? I think it's contradictory with the name only_accessed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 95eb1e9. only_accessed() now honors its name and returns an empty snapshot when nothing has been accessed (no fallback, no warning). The previous behavior was a misguided safety net that ended up being more surprising than helpful for exactly the scenario you describe (early capture() before any branching). The Node-side equivalent has the same change queued.

Updated test: test_only_accessed_returns_empty_when_no_flags_accessed.

Comment thread posthog/client.py
)
return to_flags_and_payloads(resp)

def get_flags_decision(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deprecate the other methods?

Copy link
Copy Markdown
Contributor Author

@dmarticus dmarticus Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversed course on this — shipped Phase 2 in eda573d alongside Phase 1 in this PR (rather than splitting it into a follow-up).

feature_enabled, get_feature_flag, get_feature_flag_payload, and capture(send_feature_flags=...) now emit DeprecationWarnings pointing at evaluate_flags(). The methods continue to return the same values; users who pin warnings to errors will get a heads-up on first use, and the rest see them surface via pytest / python -W / IDE inspectors.

feature_enabled and get_feature_flag were restructured to call _get_feature_flag_result directly so a single user-level call emits exactly one warning instead of cascading three.

Phase 3 (removal in next major) is separate.

Comment thread posthog/client.py
Comment on lines +2309 to +2315
local_result, fallback_to_server = self._get_all_flags_and_payloads_locally(
distinct_id,
groups=dict(groups),
person_properties=person_properties,
group_properties=group_properties,
flag_keys_to_evaluate=flag_keys,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass device_id here, as it may be present in the context (via tracing headers).

I don't think it's necessary to add it as a parameter of evaluate_flags as well, but I'll leave it up to you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — wired up in 95eb1e9. evaluate_flags() now resolves device_id from context (get_context_device_id()) at the top of the method, then forwards it to the get_flags_decision(...) call so it lands in the /flags request body. I went with your suggestion and didn't add it as a method parameter — context-only is cleaner and matches the existing distinct_id resolution pattern.

Comment thread posthog/feature_flag_evaluations.py Outdated
if self._evaluated_at and not (flag and flag.locally_evaluated):
properties["$feature_flag_evaluated_at"] = self._evaluated_at
if flag is None:
properties["$feature_flag_error"] = "flag_missing"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that we're losing granularity of $feature_flag_error values here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — fixed in 95eb1e9. The snapshot now tracks response-level errors (errors_while_computing_flags, quota_limited) at construction and _record_access builds a comma-joined $feature_flag_error matching the single-flag path's granularity. So a missing flag during a quota-limited response now reports quota_limited,flag_missing instead of just flag_missing.

New test: test_errors_while_computing_flags_propagates_to_event covers both the standalone (errors_while_computing_flags) and combined (errors_while_computing_flags,flag_missing) cases.

Not covered yet: TIMEOUT/CONNECTION_ERROR/api_error_NNN. Those manifest as exceptions on the request and currently just log + leave the snapshot empty (so flags would surface as flag_missing). Could fold those into the snapshot in a follow-up if useful, but the most common cases (errors_while_computing and quota_limited) are now covered.

Comment thread .sampo/changesets/evaluate-flags-api.md Outdated
Comment thread posthog/__init__.py Outdated
@marandaneto
Copy link
Copy Markdown
Member

how does auto captured events and errors set flags to those events if you need the instance from evaluate_flags

Comment thread posthog/client.py
@marandaneto
Copy link
Copy Markdown
Member

No breaking changes. Existing feature_enabled, get_feature_flag, get_feature_flag_payload, and capture(send_feature_flags=...) paths all continue to work unchanged. Phase 2 (runtime DeprecationWarnings pointing at evaluate_flags) and Phase 3 (removal in next major) ship in follow-up PRs — this minor is API-additive only so users have a quiet release to migrate before warnings start.

i'd say we should deprecate and add a warning now otherwise theres multiple ways of doing the same thing and users and agents will be very confused
theres an RFC about that already

Resolves a typing import conflict in client.py from main and bundles
the review-feedback changes:

- Fix changeset format per RELEASING.md (`pypi/posthog: minor`).
- `only_accessed()` now returns empty when nothing was accessed, instead
  of falling back to all flags. The fallback contradicted the method
  name and surprised reviewers.
- Propagate response-level errors (`errors_while_computing_flags`,
  `quota_limited`) into `\$feature_flag_called` events so each access
  carries the granular error code(s) the single-flag path emits.
- Resolve `device_id` from context in `evaluate_flags()` and pass it
  through to the `/flags` request. Important for experience-continuity
  flag matching when the device id flows in via tracing headers.
- Make the precedence between `flags` and `send_feature_flags` explicit:
  `flags` always wins, and we log a warning when both are passed.
- Clarify the `flag_keys` doc on the module-level `evaluate_flags`.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
Per reviewer feedback, ship Phase 2 in this PR alongside Phase 1 instead
of splitting it into a follow-up. The deprecated methods continue to
work — they just emit a `DeprecationWarning` pointing at `evaluate_flags()`:

- `feature_enabled()`
- `get_feature_flag()`
- `get_feature_flag_payload()`
- `capture(send_feature_flags=...)` (only when truthy)

`feature_enabled` and `get_feature_flag` are restructured to call
`_get_feature_flag_result` directly instead of routing through each
other, so a single user-level call emits exactly one warning instead of
cascading.

Tests cover each warning's emission and the no-cascade behavior.
Existing tests that use the legacy methods will now generate
DeprecationWarnings but otherwise pass unchanged.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
Per PR review feedback (manoel): manual exception captures should be
able to attach a `FeatureFlagEvaluations` snapshot the same way
`capture()` can, so `\$exception` events carry the same flag context as
the rest of the request's events.

`capture_exception` already accepted `**kwargs` and forwarded select
ones to `capture()` — `flags` was just missing from the forwarded set.

This doesn't yet solve the wider question of how *auto*-captured
exceptions (sys.excepthook, context-block exception handler) attach
flags — that requires a separate mechanism (likely context-stashed
flags) and is a follow-up.

Generated-By: PostHog Code
Task-Id: b8a45b11-b41c-4995-8622-acea525e7703
@dmarticus
Copy link
Copy Markdown
Contributor Author

dmarticus commented Apr 28, 2026

how does auto captured events and errors set flags to those events if you need the instance from evaluate_flags

Good question — the answer has two layers.

Manual error captures: capture_exception(exception, ..., flags=flags) now forwards the snapshot through to the inner capture() call so the $exception event carries the same flag context as your other events. This was a small gap I missed — fixed in db6d3c0:

flags = posthog.evaluate_flags(distinct_id)
try:
    risky_thing()
except Exception as e:
    posthog.capture_exception(e, distinct_id=distinct_id, flags=flags)

The same applies to anywhere the developer is the one calling capture() — they can thread flags= through.

Auto-captured exceptions (the case you're really asking about — sys.excepthook in ExceptionCapture, plus the new_context() block-level exception handler): you're right that there's no surface today for the developer to attach a flag snapshot, because they're not the one calling capture. Two paths I see for a follow-up PR:

  1. Stash flags on the context. new_context() already carries distinct_id, session_id, properties — adding flags would mean any exception captured within the block (block-level handler, or even sys.excepthook if the context is active) would automatically attach the snapshot. Most ergonomic for the request-scoped pattern.

  2. A "default flags provider" callback on the client that auto-evaluates at capture time. More magic, more risk of latency surprises. Probably not the right answer.

Happy to do (1) as a separate PR — felt out of scope for this one since it touches the context module and needs its own design pass. Worth tracking as a follow-up issue?

@dmarticus
Copy link
Copy Markdown
Contributor Author

dmarticus commented Apr 28, 2026

No breaking changes. Existing feature_enabled, get_feature_flag, get_feature_flag_payload, and capture(send_feature_flags=...) paths all continue to work unchanged. Phase 2 (runtime DeprecationWarnings pointing at evaluate_flags) and Phase 3 (removal in next major) ship in follow-up PRs — this minor is API-additive only so users have a quiet release to migrate before warnings start.

i'd say we should deprecate and add a warning now otherwise theres multiple ways of doing the same thing and users and agents will be very confused theres an RFC about that already

You're right — reversed course on this and shipped Phase 2 in eda573d alongside Phase 1 in this PR. The PR description has been updated.

feature_enabled, get_feature_flag, get_feature_flag_payload, and capture(send_feature_flags=...) now emit DeprecationWarnings pointing at evaluate_flags(). They keep working unchanged in this minor — Phase 3 (removal) lands in the next major.

feature_enabled and get_feature_flag were restructured to call _get_feature_flag_result directly so a single user-level call emits exactly one warning instead of cascading three.


The returned `FeatureFlagEvaluations` snapshot exposes `is_enabled()`, `get_flag()`, `get_flag_payload()` for branching and `only_accessed()` / `only([keys])` filter helpers. Pass `flag_keys=[...]` to `evaluate_flags()` to scope the underlying `/flags` request itself.

Deprecates `feature_enabled()`, `get_feature_flag()`, `get_feature_flag_payload()`, and `capture(send_feature_flags=...)`. They continue to work but now emit a `DeprecationWarning` pointing at `evaluate_flags()`. Removal is planned for the next major version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's deprecate get_feature_flag_result as well. It lived a short life, but better to clean it up now.

@marandaneto
Copy link
Copy Markdown
Member

@dustinbyrne you'll need to coordinate this with the @PostHog/team-docs-wizard as well otherwise the wizard will be lost or instrumenting new apps with deprecated APIs
we should get the docs here https://posthog.com/docs/references/posthog-python?filter=feature-flags and here https://posthog.com/docs/libraries/python updated and most likely all the other code snippets spread around the codebase
can we also write a skill that we can add to https://github.com/PostHog/skills/tree/main/skills and point out in our docs so users can just run that skill in that codebase for the migration?
this is not a blocker for this PR but ideally somehow synced otherwise we'll just get support tickets I guess

@marandaneto
Copy link
Copy Markdown
Member

I think 'Stash flags on the context' makes sense
if we do this on a follow up PR, does that mean that new exceptions wont have flags context as they did before until we address this issue? or they currently dont have anyway so its not an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants