Remove batch_coalescing_below_network_boundaries rule#407
Draft
EdsonPetry wants to merge 2 commits intodatafusion-contrib:mainfrom
Draft
Remove batch_coalescing_below_network_boundaries rule#407EdsonPetry wants to merge 2 commits intodatafusion-contrib:mainfrom
EdsonPetry wants to merge 2 commits intodatafusion-contrib:mainfrom
Conversation
…trib#386) DataFusion 53 deprecated CoalesceBatchesExec. The optimizer rule that inserted it below network boundaries is dropped, along with the distributed.shuffle_batch_size config and its DistributedExt setters, the CDK benchmark flag that fed it, and the doc/example plan snippets that referenced it.
523eccf to
3ea11ef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes the
batch_coalescing_below_network_boundariesoptimizer rule that inserted the DataFusion-53-deprecatedCoalesceBatchesExecbelow every network boundary. Closes #386.Also removes the
distributed.shuffle_batch_sizeconfig field and itsDistributedExtsetters, they existed only to configure this rule.Why
CoalesceBatchesExec. The rule was carrying#[expect(deprecated)]to keep using it.CoalesceBatchesExecfrom plans #386 hypothesized that the rule's data-copy overhead outweighed the wire-efficiency gains. Confirmed empirically, see benchmarks below.shuffle_batch_size=8192≤execution.batch_size=32768-> rule early-returned). Real plans never gotCoalesceBatchesExecinserted in production callers.Benchmark findings
Ran on a 4-worker EC2 cluster in us-east-1. Three configurations exercised:
batch_sizeshuffle_batch_sizeCoalesceBatchesExecin TPC-H plansCoalesceBatchesExecin TPC-DS plansbatch_size=65,536Fair-comparison: matched effective wire batch size
Both sides produce ~65K-row batches over the wire. Main via the rule's post-hoc coalescing (
batch_size=32K, shuffle_batch_size=65K); this PR via native batching (batch_size=65K).batch_size=65K)Top TPC-H wins: q18 −23%, q7 −19%, q17 −18%, q9 −16%, q22 −20%, q5 −13%.
Top TPC-DS wins: q34 −48%, q15 −43%, q32 −40%, q14 −24%, q72 −13% (absolute: −1s on a 7.7s query).
Confirms the hypothesis from #386: native-large-batch path beats coalesce-after-the-fact by ~12% across both suites. The rule was costing more in data copies than it saved in wire efficiency.
Migration note
The only known caller of
distributed.shuffle_batch_sizeis our ownbenchmarks/cdk/bin/datafusion-bench.ts— updated in this PR. External callers wanting larger wire batches should setdatafusion.execution.batch_sizedirectly. As a bonus, they'll gain ~12% throughput.