Add tiebreaker to stats sort-on-measure IT queries#5435
Conversation
| // count=25 is tied across multiple states (AL/ME/TN/WY); accept any tied pair at the | ||
| // cutoff — legacy DSL terms-agg, Calcite enumerable sort, and analytics-engine each | ||
| // break the tie deterministically but with different orderings. | ||
| verifyDataRowsAllowingTiesAt( |
There was a problem hiding this comment.
Why not just modify the query and add another sort column as a tiebreaker? PPL spec does not define tiebreaker.
When the IT was first written, the author likely didn't realize the PPL query produces non-deterministic results when measure values tie. Adding a deterministic tiebreaker (e.g. sort - count(), state) fixes the root cause
There was a problem hiding this comment.
Good call — applied. Rewrote the PR to drop the verifyDataRowsAllowingTiesAt matcher and instead added an explicit tiebreaker (the group-by column) to each of the six queries' sort clauses, restoring the strict verifyDataRows assertions with the deterministic expected rows.
For the queries that had separate pushdown / no-pushdown / non-Calcite branches asserting different tied alternatives, all branches now collapse to a single expected ordering. CalciteStatsCommandIT against the default Calcite-DSL path: 63 / 63 pass.
When a top-K query sorts by a measure column that has ties at the cutoff, PPL spec doesn't define a tiebreaker, so different execution engines (legacy DSL terms-agg, Calcite enumerable sort, analytics-engine hash sort) return different tied rows. The tests were over-specified — they asserted whichever member happened to come from the engine used when they were authored. Address by adding the group-by column(s) as an explicit secondary sort key. The query is now deterministic regardless of engine, and the strict containsInAnyOrder assertion can stay. Six tests in StatsCommandIT touched: - testStatsSortOnMeasure (Q1, Q2) - testStatsSortOnMeasureWithScript (Q1, Q2) - testStatsSpanSortOnMeasure (Q1, Q2 — tiebreaker references the span column via its auto-generated name) - testStatsSpanSortOnMeasureMultiTerms (Q1, Q2) - testStatsSpanSortOnMeasureMultiTermsWithScript (Q1, Q2) - testStatsSortOnMeasureComplex (Q1, Q2) Verified against the default Calcite-DSL path: 63 / 63 pass. Suggested by @penghuo on review of an earlier revision that added a tie-tolerant matcher to MatcherUtils — the tiebreaker approach fixes the root cause and keeps the existing strict assertion semantics. Signed-off-by: Kai Huang <ahkcs@amazon.com>
0243194 to
d02e005
Compare
Summary
Six tests in
StatsCommandITassert a specific top-K cutoff for queries like| stats … by X | sort - count() | head 5. Whencount()has ties at the cutoff, PPL spec doesn't define a tiebreaker, so different execution engines (legacy DSL terms-agg, Calcite enumerable sort, analytics-engine hash sort) return different tied rows. The strictcontainsInAnyOrderassertion only happened to match whichever member the original engine picked.Address by adding the group-by column(s) as an explicit secondary sort key. The query is now deterministic regardless of engine, and the existing strict assertion can stay. Per @penghuo's review feedback on an earlier revision, this fixes the root cause rather than masking it with a tie-tolerant matcher.
Tests changed (in
StatsCommandIT)testStatsSortOnMeasureQ1 + Q2sort - count(), state | head 5/sort count(), state | head 5testStatsSortOnMeasureWithScriptQ1 + Q2sort - count(), new_state | head 5/sort count(), new_state | head 5testStatsSpanSortOnMeasureQ1 + Q2sort - cnt, \span(birthdate,1month)` | head 5/sort cnt, `span(birthdate,1month)` | head 5`testStatsSpanSortOnMeasureMultiTermsQ1 + Q2sort - count(), gender, state | head 5/sort count(), gender, state | head 5testStatsSpanSortOnMeasureMultiTermsWithScriptQ1 + Q2sort - count(), new_gender, new_state | head 5/sort count(), new_gender, new_state | head 5testStatsSortOnMeasureComplexQ1 + Q2sort - c, state | head 5/sort - d, gender, new_state | head 5Where the original test had separate pushdown / no-pushdown / non-Calcite branches asserting different tied alternatives, all branches collapse into a single expected ordering once the tiebreaker is added.
Verification
CalciteStatsCommandIT(default Calcite-DSL path) against the local cluster:Test plan
./gradlew :integ-test:compileTestJavaclean.CalciteStatsCommandITagainst a default-routed (non-analytics) cluster: 63 / 63 pass.