Skip to content

feat: add Firebird SQL dialect support#362

Closed
simeonbodurov wants to merge 6 commits into
DerekStride:mainfrom
simeonbodurov:main
Closed

feat: add Firebird SQL dialect support#362
simeonbodurov wants to merge 6 commits into
DerekStride:mainfrom
simeonbodurov:main

Conversation

@simeonbodurov

Copy link
Copy Markdown

Summary

  • program rule now accepts ^ as an alternate statement terminator and handles Firebird's caret-mode procedures/triggers as self-terminating top-level tokens
  • New regex terminals: set_term, declare_external_function, fb_proc_or_trigger, fb_var_declaration — absorb Firebird-specific constructs atomically so complex procedure bodies
    (including triple-quote strings and EXECUTE STATEMENT) cannot corrupt the GLR parse state
  • Removed ^ as a binary operator (binary_exp) since Firebird uses it exclusively as a statement terminator
  • New keywords: term, domain, exception, generator, active, inactive, position, suspend, exit, variable, entry_point, module_name
  • New DDL rules: create_domain, create_exception, create_generator
  • create_trigger extended to support Firebird's FOR tablename ACTIVE BEFORE/AFTER … POSITION n AS syntax
  • create_procedure / create-function: added RETURNS (params) and fb_var_declaration repeat before BEGIN

Test plan

  • All 518 existing corpus tests pass (tree-sitter test)
  • New test/corpus/firebird.txt covers set_term, create_domain, create_exception, create_generator, and Firebird trigger syntax
  • test_speedy.py integration test parses a 56,000-line real-world Firebird 5 schema and finds all 218 create_table nodes (214 regular + 4 GLOBAL TEMPORARY TABLE)

Adds first-class support for the Firebird SQL dialect so the grammar can
parse real-world Firebird database schemas without losing CREATE TABLE nodes
to parser error recovery.

Key changes:
- program rule: accept '^' as an alternate statement terminator; add
  set_term, declare_external_function, and fb_proc_or_trigger as
  self-terminating top-level choices (Firebird caret mode)
- set.js: add fb_proc_or_trigger regex terminal that atomically absorbs
  CREATE PROCEDURE/TRIGGER/FUNCTION...END^ blocks, preventing complex
  Firebird bodies (triple-quote strings, EXECUTE STATEMENT) from
  corrupting the GLR parse state
- set.js: add set_term, declare_external_function, fb_var_declaration
  regex terminals
- expressions.js: remove '^' binary operator (Firebird uses ^ as
  statement terminator, not as power/XOR)
- keywords.js: add 12 Firebird keywords (term, domain, exception,
  generator, active, inactive, position, suspend, exit, variable,
  entry_point, module_name)
- create.js: add create_domain, create_exception, create_generator;
  extend create_trigger to support Firebird FOR tablename syntax
- create-procedure.js / create-function.js: add RETURNS params and
  fb_var_declaration repeat before BEGIN
- test/corpus/firebird.txt: corpus tests for all new Firebird nodes
- test_speedy.py: integration test — parses Speedy.sql (56k lines,
  Firebird 5) and asserts 218 create_table nodes are found

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@DerekStride DerekStride left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few definite blockers that need to be addressed before I'll do a more in-depth review.

Just my gut reaction here is that there are a lot of differences, and I'm wondering if trying to support this dialect might be more difficult than others, which might make this a challenging thing to land.

Not to say that we won't, but I think we want to clean it up a little bit to make it more isolated.

One thing that I've considered is taking a lot of these function‑body rules and moving them into a separate parser, because these procedures, functions, etc., are quite different from a SELECT statement—and an INSERT statement.

Maybe splitting it at a DML versus DDL boundary might make sense here, and then relying on Tree‑Sitter injections to support this. That way you can load the parsers for the particular SQL dialect that you're using. We could have a generalized DML and then provide injection points for the DDL or function bodies based on a more specific parser.

But that's a long‑term play, and I don't expect that to be done here. It's just something I wanted to call out as a potential way to unblock this if it gets more complicated.

Comment thread grammar/expressions.js Outdated
$._function_return,
),

_tsql_function_body_statement: $ => seq(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered implementing a separate _fb_function_body_statement & branching further up the call stack?

Comment thread build_and_test.py Outdated
Comment thread test_speedy.py Outdated
Simeon Bodurov and others added 2 commits May 14, 2026 20:21
build_and_test.py and test_speedy.py are local-only helpers that depend
on a specific Speedy.sql file path. Move them to local_test/ and add
that folder to .gitignore so they don't clutter the repository.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address PR review feedback from @DerekStride:

1. Restore ['^', 'binary_exp'] in expressions.js — removing it broke
   other dialects (PostgreSQL, MySQL) that use ^ as XOR/power. The
   fb_proc_or_trigger regex absorbs CREATE...END^ atomically at the
   lexer level, so restoring ^ as a binary operator creates no ambiguity
   with Firebird's SET TERM caret mode. The bare ^ statement terminator
   is removed from the program rule (all Firebird caret terminators are
   already handled by fb_proc_or_trigger and set_term regexes).

2. Add _fb_function_body_statement and _fb_procedure_body_statement —
   Firebird-specific body variants that require at least one DECLARE
   VARIABLE declaration (repeat1) before BEGIN. This cleanly isolates
   Firebird syntax from _tsql_function/procedure_body_statement and
   branches further up the call stack as requested. Bodies without
   DECLARE VARIABLE declarations parse as the T-SQL variant (identical
   syntax, no ambiguity).

All 518 corpus tests pass. Speedy.sql integration test: 218 tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@simeonbodurov

simeonbodurov commented May 14, 2026 via email

Copy link
Copy Markdown
Author

Simeon Bodurov and others added 3 commits May 14, 2026 22:06
…ndency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, FK USING INDEX)

Add comprehensive Firebird 5 SQL dialect support to eliminate ERROR nodes
when parsing Firebird database scripts (tested on a 56k-line production schema):

**Column definitions**
- COMPUTED [BY] (expr) columns without explicit type (prec 5 alternative in
  column_definition) and as optional constraint after explicit type
- CHARACTER SET <charset> and COLLATE <collation> column constraints

**Table constraints**
- USING [ASC|DESC|ASCENDING|DESCENDING] INDEX <name> on PRIMARY KEY,
  UNIQUE, and FOREIGN KEY constraints (_fb_using_index rule)
- CHECK constraints accept any _expression (not just binary_expression),
  enabling BETWEEN, IN, IS NULL, etc.

**CREATE TABLE / CREATE INDEX**
- GLOBAL TEMPORARY tables (optional keyword_global before _temporary)
- ON COMMIT PRESERVE|DELETE ROWS clause in _table_settings
- ASC/DESC direction tokens accepted alongside ASCENDING/DESCENDING in
  create_index

**Self-terminating Firebird tokens** (absorb statement + trailing ;/^)
- fb_grant_revoke: GRANT/REVOKE privilege statements
- fb_update_or_insert: UPDATE OR INSERT (upsert) statements
- fb_connect_statement: CONNECT script-level statement
- fb_proc_or_trigger: CREATE [OR REPLACE|ALTER] PROCEDURE/TRIGGER/FUNCTION
  bodies in caret-terminator mode (absorbs entire body through END^)
- declare_external_function: DECLARE EXTERNAL FUNCTION declarations
- Program rule lists all self-terminating tokens before seq(statement, ;)

**Expressions and parameters**
- :varname Firebird named parameters added to parameter rule
- Trailing comma allowed in list rule (Firebird IN-list quirk)
- SUBSTRING(expr FROM pos FOR len) invocation form (FOR required to
  distinguish from EXTRACT(unit FROM expr))
- DATEADD(quantity unit TO date) invocation form

**ALTER / SET statements**
- ALTER ROLE name SET AUTO ADMIN MAPPING [ON|OFF]
- SET GENERATOR name TO value
- SET AUTO ADMIN MAPPING [ON|OFF]

**Transactions**
- COMMIT WORK / ROLLBACK WORK (keyword_work added alongside keyword_transaction)

**Keywords added**
- keyword_global, keyword_preserve, keyword_rows, keyword_work

**Conflicts**
- [$.term] added to GLR conflict array for DATEADD/alias ambiguity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ion bodies

Previously, when both DOLLAR_QUOTED_STRING and DOLLAR_QUOTED_STRING_END_TAG
were valid at the same parse position (i.e. inside a dollar-quoted function
body where an inline literal is also grammatically possible), the END_TAG
branch ran first and always returned — either true (matching tag) or false
(non-matching tag).  In the false case the DOLLAR_QUOTED_STRING branch was
never reached, so a literal with a different tag, e.g.

    $b$text$b$   inside   as $a$ … $a$

was silently rejected instead of being parsed as a string literal.

Fix: restructure the scanner so DOLLAR_QUOTED_STRING is tried first.

* Add scan_dollar_tag_body() — reads the tag name and closing '$' after the
  caller has already consumed the opening '$'.  This lets callers call
  mark_end() between the two steps.

* DOLLAR_QUOTED_STRING now:
    1. Consumes the opening '$' and calls mark_end() — so tree-sitter resets
       to just after '$' on failure rather than before it.
    2. Uses scan_dollar_tag_body() to read the rest of the tag.
    3. If the tag matches state->start_tag AND END_TAG is also valid, emits
       DOLLAR_QUOTED_STRING_END_TAG directly (the tag is already consumed).
    4. Otherwise scans for the matching closing tag and emits
       DOLLAR_QUOTED_STRING as before.

* A standalone END_TAG path remains for parse states where only END_TAG is
  valid (e.g. immediately after keyword_end in a function body).  It uses the
  same mark_end-after-'$' convention for consistent failure behaviour.

Update the three previously-failing corpus tests in test/corpus/errors.txt
to reflect the actual (correct) parse trees produced by the fixed scanner.
The core behavioural change — $b$text$b$ now parses as (literal) — is
captured in the "Function body with a unbalanced $$" test case.  The other
two tests document the error-recovery output for pathological inputs
(same-name embedded tag, whitespace in tag name).

All 518 corpus tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@simeonbodurov simeonbodurov requested a review from DerekStride May 27, 2026 14:00
@DerekStride

Copy link
Copy Markdown
Owner

Hi Simeon, I have no problems with using agents to contribute, however, I don't think this PR is in a good state yet so you may want to guide it more. There's a lot of changes together making it hard to review. Could you break this up into easier to review chunks? If you're familiar with stacking PRs / using --update-refs it would be a good strategy for breaking this up.

@simeonbodurov

simeonbodurov commented May 29, 2026 via email

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants