Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68
Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68WindSoilder wants to merge 38 commits intonushell:mainfrom
Conversation
…lineId into Expression indexer via pipeline_to_expression mapping
…exer::Pipeline variant is removed
|
@WindSoilder Let us wait for a couple of weeks to see if @kubouch or @ysthakur is able and has the time to review this PR and give feed back as they are the most knowledgeable about ideas on how things should work. |
|
In the meantime there is no risk of code rot etc... because we will not land any other PR's here until your PR is resolved... |
|
On April 23 @kubouch mentioned on the core team channel that when he has some free time he will take a look at this PR. So lets wait on his feedback 😄 |
There was a problem hiding this comment.
I first made the inline comments as I was understanding the code before writing this, so this message is the primary feedback. Some of the inline feedback may not be relevant.
I think the PR targets the right issue of having a more programmer-friendly interface to the AST nodes, but I'm not convinced about this approach because it adds a lot of complexity of its own. We have now four "top-level" vectors of Nodes (plus some extra for strings, variables etc.), there are extra hash maps and several new traits. Furthermore, one of the arguments was performance, so I ran the benchmarks and it seems the parser stage can be 2x slower and all stages except lexing take a performance hit with this PR. (The benchmarks seem a bit broken, but you can still run them with cargo export target/benchmarks/ast_nodes -- bench, then target/benchmarks/ast_nodes/benchmarks solo -s 20 --warmup true, requires the cargo-export plugin.)
I checked the code for some offending pattern that we'd like to improve and found this:
fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
let AstNode::Block(block_id) = self.compiler.ast_nodes[node_id.0] else {
panic!(
"Expected block to typecheck, got '{:?}'",
self.compiler.ast_nodes[node_id.0]
);
};
let block = &self.compiler.blocks[block_id.0];
(...)I think this snippet well illustrates the problem: We get a generic NodeId and then assume it's a block (and panic if it's not). What we could do is to refactor it like this:
fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
let block = &self.compiler.get_block(node_id);
(...)where
impl Compiler {
pub fn get_block(&self, node_id: NodeId) -> &Block {
let AstNode::Block(block_id) = self.ast_nodes[node_id.0] else {
panic!(
"Expected block, got '{:?}'",
self.ast_nodes[node_id.0]
);
};
&self.compiler.blocks[block_id.0]
}
}Just using helpers like this would clean up the codebase without making things too complex/slow. If we were concerned about the performance impact of the runtime check (how much it actually impacts would need to be checked, I'd suspect that modern branch predictors would be quite good at predicting such a static check), I can think of two options:
- Use cold_path hint which I just learned about.
- Reinterpret
self.ast_nodes[node_id.0]asBlockIdand assume it will always be called correctly. That's unsafe, but once parsing is complete without errors, we can assume it's been parsed correctly. I'd do this only if we prove significant performance benefits.
There are also some things I noted in the inline comments that I didn't quite understand, like making the return types Option<> in the parser.
This is clearly a lot of work, but I'm unsure if it goes the right direction. We could iterate on some concrete code samples to see if we can design a bit lighter solution, like I did above using the typecheck_block(), it's easier when you have some concrete example to improve.
| compiler | ||
| .string_to_expression | ||
| .get(&self) | ||
| .expect("internal error: name node should have a corresponding expression node") |
There was a problem hiding this comment.
This and similar errors repeat "name node" also for non-name nodes
| spans: Vec<Span>, | ||
| } | ||
|
|
||
| impl<T> NodeSpans<T> { |
There was a problem hiding this comment.
I like grouping nodes and spans like this. If needed, it could have .get(&self, i: usize) -> (&T, Span) as well.
And maybe one more small thing: The .get methods could take i: NodeId instead of i: usize. It reduces the need for node_id.0 in the rest of the code and makes it explicit that it's addressed by NodeId.
| .spans | ||
| .get(node_id.0) | ||
| .expect("internal error: missing span of node") | ||
| /// TODO: no need this. |
There was a problem hiding this comment.
There are some TODO comments which are a bit unclear. Should these be removed?
| /// Get the source contents of a node | ||
| pub fn node_as_str(&self, node_id: NodeId) -> &str { | ||
| std::str::from_utf8(self.get_span_contents(node_id)) | ||
| /// TODO: use generic rather than NodeIndexer |
| Garbage, | ||
| } | ||
|
|
||
| pub trait NodeIdGetter { |
There was a problem hiding this comment.
I have two comments about these getter/pusher traits:
- The traits add a layer of complexity
- It would feel more natural to do
compiler.get_node(node_id)instead ofnode.get_node(&mut compiler).
I think both could be solved by implementing these as methods of the Compiler, for example Compiler::push_string(&mut self, span: Span, string_node: StringNode) -> StringNodeId { ... } or Compiler::get_string_mut(&mut self, id: StringNodeId) -> &'a mut StringNode { ... }. You'd still get the type safety and IMO it feels more natural.
| pub ast_nodes: Vec<AstNode>, | ||
| // different types of nodes. | ||
| pub name_nodes: NodeSpans<NameNode>, | ||
| pub string_nodes: NodeSpans<StringNode>, |
There was a problem hiding this comment.
Adding a separate Vec for these trivial type adds additional indirection (instead of addressing expression_nodes[i] directly, it needs to go string_nodes[expression_nodes[i]]. I'm wondering if we could achieve type safety without having these separate Vecs.
| Xor, | ||
| Or, | ||
|
|
||
| // Assignments |
There was a problem hiding this comment.
Assignments are statements, could be moved there.
| FlagShortGroup, | ||
|
|
||
| // ??? should statement belongs to AstNode? | ||
| Statement(StatementNodeId), |
There was a problem hiding this comment.
Since there is a separate storage for statements (statement_nodes), is this necessary anymore?
| pub struct NodeId(pub usize); | ||
|
|
||
| #[derive(Clone, Copy, PartialEq, Eq, Hash)] | ||
| pub enum NodeIndexer { |
There was a problem hiding this comment.
Just a naming issue: The NodeIndexer becomes the "top-level" node type now, so I'd name it AstNode or just Node. And the former AstNode now becomes more like a "misc" node, could be named GeneralNode or similar.
Also, If I understood it correctly, now there is no single storage anymore for the AST nodes. What previously was compiler.ast_nodes is now broken down to four Vecs.
| pub errors: Vec<SourceError>, | ||
|
|
||
| // cache mapping | ||
| pub name_to_expression: HashMap<NameNodeId, ExpressionNodeId>, |
There was a problem hiding this comment.
I guess these HashMaps are needed to be able to determine which ExpressionNodeId corresponds to each of the expression nodes? Because now when they are different types, they don't know about each other. This adds a lot of complexity, the hash map lookups are also more expensive than just index to a Vec. I'm wondering if we really need those.
As title, this pr is a huge refactor of the codebase, it splits a general
AstNodeto different nodes.We can look into
enum StatementNodeandenum ExpressionNodeto see what they can be.It's a little different to another pr #54 , this pr saves all these nodes into
Compiler, so we have the following fields:AstNode)For Expression Node, I dupliated the storage of
NameNode,StringNodeandVariableNodeto an individual place, because they can be re-used in many places, after that, we can defineStatementNode::Let,StatementNode::Defeasier.After we pushing these node, we also put an Indexer into
Compiler.indexer, we can still get all nodes sequentially from this field.For every type of node, they have a new id type, for example:
NameNode:NameNodeIdStringNode:StringNodeIdVariableNode:VariableNodeIdAnd here we define
NodeIdGetter traitforNodeIdandNodePush traitforNode. So we can push node to compiler and get node information from compiler easily.What can we actheves after this pr:
AstNodeevery time after we get a node, we are more likely to get right node when we get from a dedicated typed id.Something might goes bad after this pr:
display_statemethod, especially inCompiler, because it outputs the indirect NameNode, StringNode, VariableNode, Statement etc as well.I had done tests for files under
testsmanually to update snapshot, but something might still broken