Skip to content

Add tiebreaker to stats sort-on-measure IT queries#5435

Open
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:fix/stats-tie-cutoff-tolerance
Open

Add tiebreaker to stats sort-on-measure IT queries#5435
ahkcs wants to merge 1 commit into
opensearch-project:mainfrom
ahkcs:fix/stats-tie-cutoff-tolerance

Conversation

@ahkcs
Copy link
Copy Markdown
Collaborator

@ahkcs ahkcs commented May 12, 2026

Summary

Six tests in StatsCommandIT assert a specific top-K cutoff for queries like | stats … by X | sort - count() | head 5. When count() has ties at the cutoff, PPL spec doesn't define a tiebreaker, so different execution engines (legacy DSL terms-agg, Calcite enumerable sort, analytics-engine hash sort) return different tied rows. The strict containsInAnyOrder assertion only happened to match whichever member the original engine picked.

Address by adding the group-by column(s) as an explicit secondary sort key. The query is now deterministic regardless of engine, and the existing strict assertion can stay. Per @penghuo's review feedback on an earlier revision, this fixes the root cause rather than masking it with a tie-tolerant matcher.

Tests changed (in StatsCommandIT)

Test Query change
testStatsSortOnMeasure Q1 + Q2 sort - count(), state | head 5 / sort count(), state | head 5
testStatsSortOnMeasureWithScript Q1 + Q2 sort - count(), new_state | head 5 / sort count(), new_state | head 5
testStatsSpanSortOnMeasure Q1 + Q2 sort - cnt, \span(birthdate,1month)` | head 5/sort cnt, `span(birthdate,1month)` | head 5`
testStatsSpanSortOnMeasureMultiTerms Q1 + Q2 sort - count(), gender, state | head 5 / sort count(), gender, state | head 5
testStatsSpanSortOnMeasureMultiTermsWithScript Q1 + Q2 sort - count(), new_gender, new_state | head 5 / sort count(), new_gender, new_state | head 5
testStatsSortOnMeasureComplex Q1 + Q2 sort - c, state | head 5 / sort - d, gender, new_state | head 5

Where the original test had separate pushdown / no-pushdown / non-Calcite branches asserting different tied alternatives, all branches collapse into a single expected ordering once the tiebreaker is added.

Verification

CalciteStatsCommandIT (default Calcite-DSL path) against the local cluster:

./gradlew :integ-test:integTestRemote \
  -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9300 -Dtests.clustername=runTask \
  --tests "org.opensearch.sql.calcite.remote.CalciteStatsCommandIT"
→ 63 / 63 pass

Test plan

  • ./gradlew :integ-test:compileTestJava clean.
  • Spotless clean.
  • CalciteStatsCommandIT against a default-routed (non-analytics) cluster: 63 / 63 pass.
  • CI on this PR.

// count=25 is tied across multiple states (AL/ME/TN/WY); accept any tied pair at the
// cutoff — legacy DSL terms-agg, Calcite enumerable sort, and analytics-engine each
// break the tie deterministically but with different orderings.
verifyDataRowsAllowingTiesAt(
Copy link
Copy Markdown
Collaborator

@penghuo penghuo May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just modify the query and add another sort column as a tiebreaker? PPL spec does not define tiebreaker.

When the IT was first written, the author likely didn't realize the PPL query produces non-deterministic results when measure values tie. Adding a deterministic tiebreaker (e.g. sort - count(), state) fixes the root cause

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — applied. Rewrote the PR to drop the verifyDataRowsAllowingTiesAt matcher and instead added an explicit tiebreaker (the group-by column) to each of the six queries' sort clauses, restoring the strict verifyDataRows assertions with the deterministic expected rows.

For the queries that had separate pushdown / no-pushdown / non-Calcite branches asserting different tied alternatives, all branches now collapse to a single expected ordering. CalciteStatsCommandIT against the default Calcite-DSL path: 63 / 63 pass.

When a top-K query sorts by a measure column that has ties at the
cutoff, PPL spec doesn't define a tiebreaker, so different execution
engines (legacy DSL terms-agg, Calcite enumerable sort, analytics-engine
hash sort) return different tied rows. The tests were over-specified —
they asserted whichever member happened to come from the engine used
when they were authored.

Address by adding the group-by column(s) as an explicit secondary sort
key. The query is now deterministic regardless of engine, and the
strict containsInAnyOrder assertion can stay.

Six tests in StatsCommandIT touched:
- testStatsSortOnMeasure                          (Q1, Q2)
- testStatsSortOnMeasureWithScript                (Q1, Q2)
- testStatsSpanSortOnMeasure                      (Q1, Q2 — tiebreaker
  references the span column via its auto-generated name)
- testStatsSpanSortOnMeasureMultiTerms            (Q1, Q2)
- testStatsSpanSortOnMeasureMultiTermsWithScript  (Q1, Q2)
- testStatsSortOnMeasureComplex                   (Q1, Q2)

Verified against the default Calcite-DSL path: 63 / 63 pass.

Suggested by @penghuo on review of an earlier revision that added a
tie-tolerant matcher to MatcherUtils — the tiebreaker approach fixes
the root cause and keeps the existing strict assertion semantics.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the fix/stats-tie-cutoff-tolerance branch from 0243194 to d02e005 Compare May 13, 2026 17:10
@ahkcs ahkcs changed the title Tolerate ties at top-K cutoff in stats sort-on-measure ITs Add tiebreaker to stats sort-on-measure IT queries May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants