Skip to content

fix: SortMergeJoin full outer join incorrectly matches rows when filter evaluates to NULL#21660

Merged
mbutrovich merged 5 commits intoapache:mainfrom
mbutrovich:full_smj_filter
Apr 16, 2026
Merged

fix: SortMergeJoin full outer join incorrectly matches rows when filter evaluates to NULL#21660
mbutrovich merged 5 commits intoapache:mainfrom
mbutrovich:full_smj_filter

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 16, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

While enabling SortMergeJoinExec with filters in DataFusion Comet, I hit a correctness bug in DataFusion's SortMergeJoinExec for full outer joins with nullable filter columns. The bug was originally surfaced via the SPARK-43113 reproducer.

When a join filter expression evaluates to NULL (e.g., l.b < (r.b + 1) where r.b is NULL), the full outer join treats the row pair as matched instead of unmatched. Per SQL semantics, NULL in a boolean filter context is false (not satisfied), so the rows should be emitted as separate unmatched rows.

The bug has been present since filtered full outer join support was added to SMJ in #12764 / #13369. It was never caught because:

  1. The join fuzz tests generate filter column data with Int32Array::from_iter_values(), which never produces NULL values.
  2. No existing unit test or sqllogictest exercised a full outer join filter that evaluates to NULL.

What changes are included in this PR?

Root cause: The full outer join code path had a special case that preserved raw NULL values from the filter expression result (pre_mask) instead of converting them to false via prep_null_mask_filter like left/right outer joins do. This caused two problems:

  1. Streamed (left) side: In get_corrected_filter_mask(), NULL entries in filter_mask are treated as "pass through" (for pre-null-joined rows from append_nulls()). But when the filter expression itself returns NULL, those entries also appear as NULL in the mask — and get incorrectly treated as matched. This produced wrong join output (matched rows instead of unmatched).

  2. Buffered (right) side: BooleanArray::value() was called on NULL positions in pre_mask to update FilterState. At NULL positions, the values buffer contains a deterministic but semantically meaningless result (computed from the default zero-storage of NULL inputs). For some rows this value happens to be true, which incorrectly marks unmatched buffered rows as SomePassed and silently drops them from the output.

Fix: Remove the full outer join special case in materializing_stream.rs. All outer join types now uniformly use the null-corrected mask (where NULLfalse via prep_null_mask_filter) for both deferred filtering metadata and FilterState tracking. Semi/anti/mark joins are unaffected — they use BitwiseSortMergeJoinStream which already converts NULLs to false.

Tests:

  • New unit test join_full_null_filter_result reproducing the SPARK-43113 scenario with a nullable right-side column.
  • Modified make_staggered_batches_i32 in join_fuzz.rs to inject ~10% NULL values into the filter column (x), so the fuzz tests exercise NULL filter handling across all join types.

Are these changes tested?

Yes.

  • New unit test (join_full_null_filter_result) directly reproduces the bug.
  • Existing 57 SMJ unit tests all pass.
  • All 41 join fuzz tests pass with the new nullable filter column data, including test_full_join_1k_filtered which compares HashJoinExec vs SortMergeJoinExec and would have caught this bug if the fuzz data had included NULLs.
  • Will run 100 iterations of the fuzz tests overnight to shake out any remaining nondeterministic issues.
  • Testing in Comet CI (all Spark SQL tests) deps: [DO NOT MERGE] Start testing what will be DataFusion 54.0 datafusion-comet#3916

Are there any user-facing changes?

Full outer sort-merge joins with filters involving nullable columns now produce correct results. Previously, rows where the filter evaluated to NULL were incorrectly included as matched; they are now correctly emitted as unmatched (null-joined) rows.

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Apr 16, 2026
@mbutrovich mbutrovich requested a review from comphead April 16, 2026 00:47
@mbutrovich
Copy link
Copy Markdown
Contributor Author

CI was green in Comet for all Spark SQL tests with SMJ with filters enabled: https://github.com/apache/datafusion-comet/actions/runs/24487781374

The only failure was this Miri issue which is now fixed in DataFusion main: #21663

@mbutrovich mbutrovich added this pull request to the merge queue Apr 16, 2026
Merged via the queue into apache:main with commit 2ef0217 Apr 16, 2026
35 checks passed
@mbutrovich mbutrovich deleted the full_smj_filter branch April 16, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants