Sep 23, 2009

About code generation

As I've written in my previous post, I decided not to pursue IL code generation in Irony for initial release. The reason for this is that to do it properly I need to build more stuff than I thought originally.

You might well ask - can't we generate the IL code directly from AST tree? Irony parser produces AST tree, so can we just go one step forward and generate IL code by iterating the tree and invoking some "GenerateIL" method for each node? It is possible, and might seem like a viable solution. This is the approach that DLR team had chosen. With DLR, you build a parser for your language that translates the source code into AST tree consisting of DLR-provided AST nodes. Then DLR directly compiles the AST tree into IL instructions. Sounds simple and straightforward, and that's what I originally planned to do as well. This approach is a shortcut of what is described in compiler textbooks and papers (including the Dragon book): 
  1. The parser produces an AST tree.
  2. The AST tree is then translated into intermediate code representation - not MSIL, don't be confused; it is a custom internal representation of linear sequences of some virtual commands enriched with extra custom meta-data information
  3. The intermediate code is analyzed, optimized, restructed for execution in VM or physical processor.
  4. The machine instructions (or IL in our case) are generated from the intermediate representation.
One might think that steps 2 and 3 are unnecessary or needed only for really advanced compilers. But this middle phase is there for a reason. There is a huge structural mismatch/gap between AST (a tree), and executable code - a plain linear sequence of primitive commands. It is quite difficult to cross this gap directly in one step. The structure of the AST reflects the logical structure of a program using high-level abstractions of the programming languages. IL code on the other hand reflects how the program is actually executed - linearly, as a stream of instructions.
Basicallly, when moving to the executable code, we have to "linearize" the high-level tree-like structures. It is better to do it with some intermediate linear pseudo-code that can preserve important meta-data information from the AST tree. We will be able to do some analysis and transformations with this intermediate representation. Note that most of the important code analysis algorithms work with linear code representations, not with AST trees. They cannot be applied to IL/assembly level code, because it is too primitive and does not contain enough meta-data that is used and produced by the algorithms. All these "lattices" and propagation algorithms of code analysis are defined over linear sequences of pseudo-instructions.
Why we need code analysis at all, you may ask. For several reasons:
  • First of all, certain code analysis is unavoidable in any case. Even if we try to generate the IL code directly from AST tree, before we start producing IL, we need to perform several passes over the AST tree doing things like variable allocations, bindings, aggregating meta information - and this is in fact code analysis. Most of these operations are actually easier to do over linear intermediate code than over AST tree.
  • Secondly, if we ever hope to do any optimization of the generated code, then we need it, as well as code representation expected by most analysis/optimization algorithms.
  • Finally, it would be nice to have some static code analysis to detect things like use of uninitialized local variables, computed but never used values, etc. This is done by well-known code analysis algorithms which again expect a linear intermediate representation.
In conclusion, there is a lot of valuable results and algorithms in code analysis arena, and Irony should definitely take advantage of all this stuff. However, after doing some research and prototyping I see that this should be a project of its own, and probably put off to the next release.
What do you think?

Sep 20, 2009

Time to talk about the future

Had been quite busy for last few weeks, both at work and at home. The Irony project progress had been slow lately, but I'm trying to catch up. I want to share some thoughts on the future plans for Irony, list some near and longer term goals, and invite some feedback on these plans.

  1. Final 1.0 Irony release by the end of the year. I think it's time to start wrapping it up. My initial plan was to cover it all in the very first release: compiler front-end, some code analysis and IL code generation. Now I see that it's better to focus on front-end only - scanner+parser+AST generation, kind of Lex/Yacc equivalent, and to leave the rest to later versions. No IL code generation in initial release, sorry folks if you were waiting for it. More on code generation in a separate post. What I hope to implement instead is a simple interpreting engine that directly "executes" the AST tree. The AST node set would be minimal, just to support a simple expression language. You can see some sketch of this engine in the source code; for now the major part of it is a dynamic operator evaluator.

  2. Short-term plan - to post the updated version on downloads page. Latest Irony sources are quite different (and much better) than the code in the old release, and this apparently becomes more and more confusing. As I noticed, most people download the Release version, thus getting quite wrong impression of the current state of the project. Back in April I posted a big update to the sources with completely refactored code base, with a lot of improvements and new functionality. However, I did not update download version, for the reason that some features supported in old download version (like interpreter) were not supported yet in refactored code. Now the current source version is finally up to the feature set of the old release. It contains AST interpreter, although quite limited - only for expression evaluator. But the only thing lacking compared to old download is a sample Scheme interpreter. I don't think this sample is important enough to hold back the upgrade of download, so it's time to create a new, intermediate release - let's call it Alpha2. Expect it in the next 2-3 weeks. I will remove unused and research-stage classes, and brush-up and clean-up the code in general. The next pre-release drop, after Alpha2, should be already BETA, with complete feature set for final release of Irony 1.0.
Feature set planned for 1.0 release
Here's a list of features I'm planning to implement before the 1.0 release.

Parser/Scanner
The first goal here is to complete the Irony's terminal set covering most of the common and maybe unusual lexical constructs for popular languages like Python, Ruby, c#, Basic, JavaScript and others. I'm looking closely at different languages trying to identify all the special cases of syntax not supported yet by standard Irony terminals. If you can recommend me such constructs - please let me know. I might add sample grammars for more languages and maybe revive Python and Ruby samples. The purpose of this is not to have complete grammars for these languages but rather to play with some fancy lexical/syntactic constructs from these languages and to check that Irony terminals can support these cases. Some of these terminals, token types and more general facilities:
  • Date literal in Visual Basic (example: #01/05/2005") . A simple but important enough case. I'm thinking about creating a generic delimited Literal terminal (delimited by start/end symbol like # for VB dates) that can support this VB date and maybe some other literals in other languages - any suggestions of such literals?
  • Support for exponent symbol (D vs E) that identifies float number type. For example, in Scheme and Fortran "1.5D01" identifies "double" float value, while opposed to "1.5E01" is a single-precision number.
  • Implicit operator/precedence hint - something that Yacc has, and it might be important for some cases.
  • Create facility for processing documentation strings (Python) or XML comments (c#). Maybe implement Doc grammar hint? - might be useful for python doc strings which are essentially regular string literals but become doc pieces when they appear in certain places.
  • CompilerDirective terminal for c#, with ability to analyze boolean expressions over defined symbols in "#if" directive. This would require ability to define sub-grammars inside main grammar to parse and evaluate these expressions.
  • Templated string literal with embedded expressions, like Ruby's "Name: #{customer.Name}" - this also requires sub-grammar facility.
  • Other exotic terminals: Ruby HereDoc;
  • Thinking also about implementing Wiki grammar/parser - might be an interesting case.

Other features

  • Localization support - put all error messages into localizable resources.
  • Symbol table implementation
  • Interpreter for direct interpretation of AST tree; no code analysis or IL code generation. Implementing basic runtime and object model for interpreter; basic generic infrastructure, not necessarily implementations for specific languages. Basic AST node set, maybe only for expression evaluator.
  • Template processor to support processsing of files like Ruby's .rhtml templates.
  • Moving to VS 2010 when it's out, support for DynamicObject facility from .NET 4.0.
  • Finish VS Integration support implementation.
  • Write Xml documentation for core classes - this is a huge effort; also will try to create some introductory quick-start guide

Asking for feedback

Please let me know what you think about all this. If you have an idea or suggestion about features that you think should be included into release - please post a comment or shoot me an email. Especially looking for suggestions about exotic lexical constructs found in existing languages that are not yet supported by Irony terminals; if you know such construct - please let me know.