You might well ask - can't we generate the IL code directly from AST tree? Irony parser produces AST tree, so can we just go one step forward and generate IL code by iterating the tree and invoking some "GenerateIL" method for each node? It is possible, and might seem like a viable solution. This is the approach that DLR team had chosen. With DLR, you build a parser for your language that translates the source code into AST tree consisting of DLR-provided AST nodes. Then DLR directly compiles the AST tree into IL instructions. Sounds simple and straightforward, and that's what I originally planned to do as well. This approach is a shortcut of what is described in compiler textbooks and papers (including the Dragon book):
- The parser produces an AST tree.
- The AST tree is then translated into intermediate code representation - not MSIL, don't be confused; it is a custom internal representation of linear sequences of some virtual commands enriched with extra custom meta-data information
- The intermediate code is analyzed, optimized, restructed for execution in VM or physical processor.
- The machine instructions (or IL in our case) are generated from the intermediate representation.
Basicallly, when moving to the executable code, we have to "linearize" the high-level tree-like structures. It is better to do it with some intermediate linear pseudo-code that can preserve important meta-data information from the AST tree. We will be able to do some analysis and transformations with this intermediate representation. Note that most of the important code analysis algorithms work with linear code representations, not with AST trees. They cannot be applied to IL/assembly level code, because it is too primitive and does not contain enough meta-data that is used and produced by the algorithms. All these "lattices" and propagation algorithms of code analysis are defined over linear sequences of pseudo-instructions.
Why we need code analysis at all, you may ask. For several reasons:
- First of all, certain code analysis is unavoidable in any case. Even if we try to generate the IL code directly from AST tree, before we start producing IL, we need to perform several passes over the AST tree doing things like variable allocations, bindings, aggregating meta information - and this is in fact code analysis. Most of these operations are actually easier to do over linear intermediate code than over AST tree.
- Secondly, if we ever hope to do any optimization of the generated code, then we need it, as well as code representation expected by most analysis/optimization algorithms.
- Finally, it would be nice to have some static code analysis to detect things like use of uninitialized local variables, computed but never used values, etc. This is done by well-known code analysis algorithms which again expect a linear intermediate representation.
What do you think?