Skip to content

Extend V2 SQL parser with JOIN/UNION/MINUS for unified query path#5444

Draft
dai-chen wants to merge 3 commits into
opensearch-project:mainfrom
dai-chen:feature/sql-v2-join-union-minus
Draft

Extend V2 SQL parser with JOIN/UNION/MINUS for unified query path#5444
dai-chen wants to merge 3 commits into
opensearch-project:mainfrom
dai-chen:feature/sql-v2-join-union-minus

Conversation

@dai-chen
Copy link
Copy Markdown
Collaborator

@dai-chen dai-chen commented May 14, 2026

Description

Extend the V2 SQL ANTLR grammar and add ExtendedAstBuilder to support JOIN (INNER/LEFT/RIGHT/CROSS), UNION (ALL/DISTINCT), and MINUS in the unified Calcite-based query path. The base AstBuilder continues to throw SyntaxCheckException for these statements, preserving the legacy engine fallback in production.

Notes

  • Subsearch limits defaulted to 0 in unified query context to avoid LogicalSystemLimit noise in logical plans
  • Fixed Alias in CalciteRelNodeVisitor because SQL V2 AstBuilder always wraps select items which breaks UNION/MINUS with explicit select lists

Planned PRs

Related Issues

Part of #5248

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Extend V2 ANTLR grammar with JOIN, UNION (ALL/DISTINCT), and MINUS
rules. Add ExtendedAstBuilder in SqlV2QueryParser to produce AST nodes
for the Calcite-based unified query path. The base AstBuilder throws
SyntaxCheckException to preserve legacy engine fallback.

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen self-assigned this May 14, 2026
@dai-chen dai-chen added enhancement New feature or request SQL labels May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 45a0e3b.

PathLineSeverityDescription
api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java150lowPPL_SUBSEARCH_MAXOUT and PPL_JOIN_SUBSEARCH_MAXOUT are changed from SysLimit.DEFAULT to SysLimit.UNLIMITED_SUBSEARCH (value 0 = unlimited). While documented as intentional for the Calcite-based unified query path, removing subsearch limits could allow unbounded result sets that consume excessive memory or CPU in adversarially crafted nested queries. The change is coherent with the PR purpose and explained in comments, but warrants a review of whether callers of this API enforce their own resource bounds.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit 1fc2dac)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

In visitJoinClause, the Join node is created but never attached to the left relation. The method returns a standalone Join node with no left child. The caller in AstBuilder.visitFromClause attempts ((Join) joinPlan).attach(result), but Join.attach() is not overridden, so it uses the default implementation which may not correctly set the left child. This will likely cause the join to have a missing left operand when analyzed, breaking query execution.

public UnresolvedPlan visitJoinClause(JoinClauseContext ctx) {
  JoinType joinType = toJoinType(ctx);
  UnresolvedPlan right = visit(ctx.relation());
  Optional<UnresolvedExpression> condition =
      Optional.ofNullable(ctx.expression()).map(this::visitAstExpression);
  return join(right, joinType, condition);
}
Possible Issue

The new Alias case in expandProjectFields unconditionally adds the alias to expandedFields without checking if it was already added via addedFields.add(fieldName). If the same alias appears multiple times in the select list, it will be added multiple times, causing duplicate columns in the projection. This breaks queries like SELECT a AS x, a AS x FROM t.

case Alias alias -> {
  expandedFields.add(rexVisitor.analyze(alias, context));
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Latest suggestions up to 1fc2dac

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Validate union dataset requirements

The constructors don't validate that datasets is non-null and contains at least two
elements, which is required for a valid union operation. This could lead to runtime
errors when the union is processed. Add validation to fail fast with a clear error
message.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [35-43]

 /** SQL: UNION ALL or UNION DISTINCT (no schema unification). */
 public Union(List<UnresolvedPlan> datasets, boolean distinct) {
+  if (datasets == null || datasets.size() < 2) {
+    throw new IllegalArgumentException("Union requires at least two datasets");
+  }
   this(datasets, distinct, false, null);
 }
 
 /** PPL: UNION ALL with maxout limit and schema unification. */
 public Union(List<UnresolvedPlan> datasets, Integer maxout) {
+  if (datasets == null || datasets.size() < 2) {
+    throw new IllegalArgumentException("Union requires at least two datasets");
+  }
   this(datasets, false, true, maxout);
 }
Suggestion importance[1-10]: 7

__

Why: Adding validation to the constructors prevents runtime errors by failing fast with clear error messages. However, the validation logic is already present in CalciteRelNodeVisitor.visitUnion() (line 2943-2946 in the diff), so this is defensive programming rather than fixing a critical bug.

Medium
Validate cross join has no condition

For CROSS JOIN, the grammar allows an optional ON condition, but semantically cross
joins should not have join conditions. If ctx.expression() is non-null for a CROSS
JOIN, this creates an invalid plan. Validate that cross joins don't have conditions.

api/src/main/java/org/opensearch/sql/api/parser/SqlV2QueryParser.java [71-77]

 @Override
 public UnresolvedPlan visitJoinClause(JoinClauseContext ctx) {
   JoinType joinType = toJoinType(ctx);
   UnresolvedPlan right = visit(ctx.relation());
   Optional<UnresolvedExpression> condition =
       Optional.ofNullable(ctx.expression()).map(this::visitAstExpression);
+  
+  if (joinType == JoinType.CROSS && condition.isPresent()) {
+    throw new IllegalArgumentException("CROSS JOIN cannot have an ON condition");
+  }
+  
   return join(right, joinType, condition);
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that CROSS JOIN should not have an ON condition semantically. However, the grammar in OpenSearchSQLParser.g4 (line 127-128) shows that CROSS JOIN is defined as (INNER | CROSS)? JOIN relation (ON expression)?, meaning the ON clause is optional for all join types. The validation would catch invalid SQL, but this is more of a semantic validation improvement rather than a critical bug fix.

Low

Previous suggestions

Suggestions up to commit 45a0e3b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Discarded attach() return value breaks join

The code calls attach() on the Join node but discards its return value, then assigns
the unmodified joinPlan to result. The attach() method returns a new Join instance
with the child attached, so the current code leaves result pointing to a join
without its left child properly connected.

sql/src/main/java/org/opensearch/sql/sql/parser/AstBuilder.java [143-148]

 // Dispatch each joinClause to visitJoinClause (throws in base, overridden in subclass)
 for (var joinCtx : ctx.joinClause()) {
   UnresolvedPlan joinPlan = visit(joinCtx);
-  ((Join) joinPlan).attach(result);
-  result = joinPlan;
+  result = joinPlan.attach(result);
 }
Suggestion importance[1-10]: 10

__

Why: Critical bug. The code calls attach() but ignores its return value, then assigns the unmodified joinPlan to result. Since attach() returns a new instance with the child attached, this leaves the join without its left child properly connected, breaking the query plan.

High
General
Potential duplicate alias field expansion

Adding Alias handling in expandProjectFields may cause duplicate field expansion
when aliases are already processed elsewhere in the projection logic. Verify that
this doesn't conflict with existing alias resolution in visitProject or cause fields
to appear twice in the final projection list.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [539-541]

 case Alias alias -> {
-  expandedFields.add(rexVisitor.analyze(alias, context));
+  // Only add if not already handled by outer projection logic
+  if (!addedFields.contains(alias.getName())) {
+    expandedFields.add(rexVisitor.analyze(alias, context));
+    addedFields.add(alias.getName());
+  }
 }
Suggestion importance[1-10]: 3

__

Why: The concern about duplicate field expansion is valid but speculative. The addedFields set already tracks added fields in the surrounding code, so adding a redundant check may not be necessary. The suggestion would be more valuable with evidence of actual duplication.

Low
Suggestions up to commit ebbee64
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix join attachment and remove unsafe cast

The code casts joinPlan to Join without verification, which will throw
ClassCastException if the visitor returns a different node type. Additionally,
Join.attach() returns a new node but the result is discarded. The returned node
should be assigned to result.

sql/src/main/java/org/opensearch/sql/sql/parser/AstBuilder.java [143-148]

 @Override
 public UnresolvedPlan visitFromClause(FromClauseContext ctx) {
   UnresolvedPlan result = visit(ctx.relation());
 
   // Dispatch each joinClause to visitJoinClause (throws in base, overridden in subclass)
   for (var joinCtx : ctx.joinClause()) {
     UnresolvedPlan joinPlan = visit(joinCtx);
-    ((Join) joinPlan).attach(result);
-    result = joinPlan;
+    result = joinPlan.attach(result);
   }
Suggestion importance[1-10]: 9

__

Why: Critical bug fix. The code discards the return value of attach() and performs an unsafe cast to Join. The attach() method returns a new node that must be captured, and the cast could fail if visitJoinClause returns a different type.

High
General
Prevent duplicate alias field expansion

Adding Alias handling in expandProjectFields may cause duplicate field expansion
when aliases are already processed elsewhere in the projection logic. Verify that
this doesn't create redundant columns or interfere with existing alias resolution in
visitProject.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [539-541]

 case Alias alias -> {
-  expandedFields.add(rexVisitor.analyze(alias, context));
+  // Only expand if not already handled by outer projection
+  if (!addedFields.contains(alias.getName())) {
+    expandedFields.add(rexVisitor.analyze(alias, context));
+    addedFields.add(alias.getName());
+  }
 }
Suggestion importance[1-10]: 5

__

Why: Valid concern about potential duplicate field expansion when handling Alias nodes. However, the suggestion may be overly cautious since the addedFields set is already used for tracking in the surrounding code. The check would add safety but needs verification of actual duplication issues.

Low

Add grammar rules (inSubqueryPredicate, existsSubqueryExpressionAtom)
and wire them through ExtendedAstExpressionBuilder to produce InSubquery
and ExistsSubquery AST nodes. Base AstExpressionBuilder throws
SyntaxCheckException to preserve legacy engine fallback.

Also simplify joinClause by inlining join types (no separate joinType
rule) and make AstBuilder.expressionBuilder overridable via factory
method.

Signed-off-by: Chen Dai <daichen@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 45a0e3b

- Add inSubqueryPredicate and existsSubqueryExpressionAtom grammar rules
- Add ExtendedAstExpressionBuilder for subquery AST construction
- Base AstExpressionBuilder throws SyntaxCheckException for fallback
- Add unifySchema flag to Union: false for SQL (positional), true for
  PPL (schema merge via SchemaUnifier)
- Remove SchemaUnifier from visitMinus (SQL-only, always positional)
- Inline joinType into joinClause, use switch in toJoinType
- Add AstDSL.inSubquery() and existsSubquery()
- Add LegacyFallbackIT for JOIN and IN subquery fallback

Signed-off-by: Chen Dai <daichen@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 1fc2dac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request SQL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant