Skip to content

Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68

Open
WindSoilder wants to merge 38 commits intonushell:mainfrom
WindSoilder:ast_nodes
Open

Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68
WindSoilder wants to merge 38 commits intonushell:mainfrom
WindSoilder:ast_nodes

Conversation

@WindSoilder
Copy link
Copy Markdown
Contributor

@WindSoilder WindSoilder commented Mar 16, 2026

As title, this pr is a huge refactor of the codebase, it splits a general AstNode to different nodes.

- A BlockNode, it contains
- A list of StatementNode or ExpressionNode

We can look into enum StatementNode and enum ExpressionNode to see what they can be.

It's a little different to another pr #54 , this pr saves all these nodes into Compiler, so we have the following fields:

  • ast_nodes (It contains some smaller set of original AstNode)
  • expression_nodes
  • block_nodes
  • pipeline_nodes
  • name_nodes (explained in next paragraph)
  • string_nodes (explained in next paragraph)
  • variable_nodes (explained in next paragraph)

For Expression Node, I dupliated the storage of NameNode, StringNode and VariableNode to an individual place, because they can be re-used in many places, after that, we can define StatementNode::Let, StatementNode::Def easier.

After we pushing these node, we also put an Indexer into Compiler.indexer, we can still get all nodes sequentially from this field.

For every type of node, they have a new id type, for example:

  • NameNode: NameNodeId
  • StringNode: StringNodeId
  • VariableNode: VariableNodeId
    And here we define NodeIdGetter trait for NodeId and NodePush trait for Node. So we can push node to compiler and get node information from compiler easily.

What can we actheves after this pr:

  • A better runtime performance, because we don't need to check AstNode every time after we get a node, we are more likely to get right node when we get from a dedicated typed id.
  • Less possibility of bugs, because we can make use of Rust's type system to make sure we are using the right node in the right place.

Something might goes bad after this pr:

  • Some spaces overhead because we need to store more vectors to store these nodes.
  • More complicated output in display_state method, especially in Compiler, because it outputs the indirect NameNode, StringNode, VariableNode, Statement etc as well.

I had done tests for files under tests manually to update snapshot, but something might still broken

@WindSoilder WindSoilder marked this pull request as ready for review April 14, 2026 05:52
@stormasm
Copy link
Copy Markdown

@WindSoilder Let us wait for a couple of weeks to see if @kubouch or @ysthakur is able and has the time to review this PR and give feed back as they are the most knowledgeable about ideas on how things should work.

@stormasm
Copy link
Copy Markdown

In the meantime there is no risk of code rot etc... because we will not land any other PR's here until your PR is resolved...

@stormasm
Copy link
Copy Markdown

stormasm commented Apr 24, 2026

On April 23 @kubouch mentioned on the core team channel that when he has some free time he will take a look at this PR. So lets wait on his feedback 😄

Copy link
Copy Markdown
Contributor

@kubouch kubouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first made the inline comments as I was understanding the code before writing this, so this message is the primary feedback. Some of the inline feedback may not be relevant.

I think the PR targets the right issue of having a more programmer-friendly interface to the AST nodes, but I'm not convinced about this approach because it adds a lot of complexity of its own. We have now four "top-level" vectors of Nodes (plus some extra for strings, variables etc.), there are extra hash maps and several new traits. Furthermore, one of the arguments was performance, so I ran the benchmarks and it seems the parser stage can be 2x slower and all stages except lexing take a performance hit with this PR. (The benchmarks seem a bit broken, but you can still run them with cargo export target/benchmarks/ast_nodes -- bench, then target/benchmarks/ast_nodes/benchmarks solo -s 20 --warmup true, requires the cargo-export plugin.)

I checked the code for some offending pattern that we'd like to improve and found this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let AstNode::Block(block_id) = self.compiler.ast_nodes[node_id.0] else {
            panic!(
                "Expected block to typecheck, got '{:?}'",
                self.compiler.ast_nodes[node_id.0]
            );
        };
        let block = &self.compiler.blocks[block_id.0];
        (...)

I think this snippet well illustrates the problem: We get a generic NodeId and then assume it's a block (and panic if it's not). What we could do is to refactor it like this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let block = &self.compiler.get_block(node_id);
        (...)

where

impl Compiler {
    pub fn get_block(&self, node_id: NodeId) -> &Block {
        let AstNode::Block(block_id) = self.ast_nodes[node_id.0] else {
            panic!(
                "Expected block, got '{:?}'",
                self.ast_nodes[node_id.0]
            );
        };
        &self.compiler.blocks[block_id.0]
    }
}

Just using helpers like this would clean up the codebase without making things too complex/slow. If we were concerned about the performance impact of the runtime check (how much it actually impacts would need to be checked, I'd suspect that modern branch predictors would be quite good at predicting such a static check), I can think of two options:

  1. Use cold_path hint which I just learned about.
  2. Reinterpret self.ast_nodes[node_id.0] as BlockId and assume it will always be called correctly. That's unsafe, but once parsing is complete without errors, we can assume it's been parsed correctly. I'd do this only if we prove significant performance benefits.

There are also some things I noted in the inline comments that I didn't quite understand, like making the return types Option<> in the parser.

This is clearly a lot of work, but I'm unsure if it goes the right direction. We could iterate on some concrete code samples to see if we can design a bit lighter solution, like I did above using the typecheck_block(), it's easier when you have some concrete example to improve.

Comment thread src/ast_nodes.rs
compiler
.string_to_expression
.get(&self)
.expect("internal error: name node should have a corresponding expression node")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar errors repeat "name node" also for non-name nodes

Comment thread src/compiler.rs
spans: Vec<Span>,
}

impl<T> NodeSpans<T> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like grouping nodes and spans like this. If needed, it could have .get(&self, i: usize) -> (&T, Span) as well.

And maybe one more small thing: The .get methods could take i: NodeId instead of i: usize. It reduces the need for node_id.0 in the rest of the code and makes it explicit that it's addressed by NodeId.

Comment thread src/compiler.rs
.spans
.get(node_id.0)
.expect("internal error: missing span of node")
/// TODO: no need this.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some TODO comments which are a bit unclear. Should these be removed?

Comment thread src/compiler.rs
/// Get the source contents of a node
pub fn node_as_str(&self, node_id: NodeId) -> &str {
std::str::from_utf8(self.get_span_contents(node_id))
/// TODO: use generic rather than NodeIndexer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is generic? NodeId?

Comment thread src/ast_nodes.rs
Garbage,
}

pub trait NodeIdGetter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments about these getter/pusher traits:

  • The traits add a layer of complexity
  • It would feel more natural to do compiler.get_node(node_id) instead of node.get_node(&mut compiler).

I think both could be solved by implementing these as methods of the Compiler, for example Compiler::push_string(&mut self, span: Span, string_node: StringNode) -> StringNodeId { ... } or Compiler::get_string_mut(&mut self, id: StringNodeId) -> &'a mut StringNode { ... }. You'd still get the type safety and IMO it feels more natural.

Comment thread src/compiler.rs
pub ast_nodes: Vec<AstNode>,
// different types of nodes.
pub name_nodes: NodeSpans<NameNode>,
pub string_nodes: NodeSpans<StringNode>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a separate Vec for these trivial type adds additional indirection (instead of addressing expression_nodes[i] directly, it needs to go string_nodes[expression_nodes[i]]. I'm wondering if we could achieve type safety without having these separate Vecs.

Comment thread src/ast_nodes.rs
Xor,
Or,

// Assignments
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assignments are statements, could be moved there.

Comment thread src/ast_nodes.rs
FlagShortGroup,

// ??? should statement belongs to AstNode?
Statement(StatementNodeId),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a separate storage for statements (statement_nodes), is this necessary anymore?

Comment thread src/ast_nodes.rs
pub struct NodeId(pub usize);

#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub enum NodeIndexer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming issue: The NodeIndexer becomes the "top-level" node type now, so I'd name it AstNode or just Node. And the former AstNode now becomes more like a "misc" node, could be named GeneralNode or similar.

Also, If I understood it correctly, now there is no single storage anymore for the AST nodes. What previously was compiler.ast_nodes is now broken down to four Vecs.

Comment thread src/compiler.rs
pub errors: Vec<SourceError>,

// cache mapping
pub name_to_expression: HashMap<NameNodeId, ExpressionNodeId>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these HashMaps are needed to be able to determine which ExpressionNodeId corresponds to each of the expression nodes? Because now when they are different types, they don't know about each other. This adds a lot of complexity, the hash map lookups are also more expensive than just index to a Vec. I'm wondering if we really need those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants