Skip to content

Revert "fix: always set ReportFullState flag in OpAMP responses (#6831)"#6878

Open
ycombinator wants to merge 2 commits intoelastic:mainfrom
ycombinator:revert-6831-report-full-state
Open

Revert "fix: always set ReportFullState flag in OpAMP responses (#6831)"#6878
ycombinator wants to merge 2 commits intoelastic:mainfrom
ycombinator:revert-6831-report-full-state

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator commented Apr 22, 2026

What is the problem this PR solves?

Since April 17, the daily 10k OpAMP-on-serverless scale test has failed every run (details). The failures correlate with the deployment of PR #6831, which set ServerToAgentFlags_ReportFullState on every ServerToAgent response to every agent.

How does this PR solve the problem?

Reverts #6831 to remove the unconditional ReportFullState flag from every OpAMP response.

This is initially a temporary revert to confirm that #6831 is the root cause of the scale test failures. Once confirmed, the ReportFullState logic can be re-introduced in a more targeted way (e.g. only on enrollment or drift detection, not on every message).

How to test this PR locally

No special local testing needed — the fix will be validated by the daily 10k OpAMP-on-serverless scale test returning to passing state after deployment.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have added tests that prove my fix is effective or that my feature works

Related issues

@ycombinator ycombinator requested a review from a team as a code owner April 22, 2026 22:09
@ycombinator ycombinator added the bug Something isn't working label Apr 22, 2026
@ycombinator ycombinator added the bug Something isn't working label Apr 22, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 22, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🔍 Preview links for changed docs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

✅ Vale Linting Results

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

Copy link
Copy Markdown
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in our standup, even if we revered this change and implemented the spec requirement to collect on seq_no drift; our scale testing should test that a server can gather the full status reports from all agents at once as we would need to ensure we are reliable if an event occurs that causes all agents to drift (restore from snapshot, network connectivity, etc).
I would find out why the test is failing and address the root cause

@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Apr 24, 2026

our scale testing should test that a server can gather the full status reports from all agents at once as we would need to ensure we are reliable if an event occurs that causes all agents to drift (restore from snapshot, network connectivity, etc).

Yes this was my motivation for doing it this way, converting the edge cases into always cases so that we didn't have to discover them through incidents or support cases.

I think reverting this to confirm it is the problem is fine, but if it is you still have to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.4 bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants