Oct 14, 2009

Updated Alpha release version, Oct 2009

I finally upgraded the download (Alpha-release) version and brought it in sync with latest sources. I removed some dead and not-used-yet code, did some code cleanup and a bit of refactoring. Aside this, there are a few changes and improvements that I would like to comment on.
The first important change is renaming SymbolTerminal to KeyTerm. I think that the name SymbolTerminal was misleading and a bit confusing, and it became more so as I faced a need for a terminal for scanning elements known as "symbols" in many programming languages; for example, in Ruby it is an identifier preceeded by a colon, like :name. The former SymbolTerminal, on the other hand, was created to recognize the language-defined keywords and special characters (operator, assignment, braces, etc). So the name change makes sense, and "key" prefix in KeyTerm comes from the "keyword". This name change caused obvious name changes to corresponding dictionaries, lists and fields. It also made sense to rename the  Symbol() method in the Grammar class - it is used to explicitly convert the first string in string-only BNF expressions and allow proper method overload to kick in. For example:

BinOp.Rule = Symbol("+") | "-" | "*" | "/";

I changed the name of the method to ToTerm(string), but I kept the old version Symbol() and decorated it with Obsolete attribute. The Symbol() method is heavily used in grammars, and leaving an old method for a while would allow to compile the old grammars without errors. The compiler will only give a warning that the method was deprecated and you should use ToTerm instead. Now the rule for BinOp should look like this:

BinOp.Rule = ToTerm("+") | "-" | "*" | "/";

Please make a change in your existing grammars.
Continuing with this theme, I'm thinking about better ways to handle keywords internally. It looks like this Grammar.CurrentGrammar thread-static property is more trouble than it's worth. The main problem is that keyword terminals can be created in static operator overloads (BnfTerm.Op_Pipe, Op_Plus etc). At the same time the terminal for a particular keyword should be instantiated once in a grammar, and with proper case sensitivity arrangement. Therefore, the grammar (an instance) should have a list/dictionary of keyword terminals which should be accessible from static operator overload methods. Currently it is done through this Grammar.CurrentGrammar property, but I'm wondering if there's a better way. Will see.
Another change is a new Grammar.RunSample virtual method. The purpose of this method is to allow custom implementations of running a sample code in the Grammar Explorer. The base implementation in Grammar class executes the code using Irony's AST interpreter - this works well for the expression evaluator grammar, as an example. The SearchGrammar (Google-to-SQL search syntax converter) overrides this method and provides it's own implementation - it converts the input string into SQL server syntax using its own custom method. Using the RunSample override allowed to get rid of a separate sample project I had before for testing the search grammar - now you can do it directly in the Grammar Explorer.
One more thing: the language flag SupportsInterpreter is renamed to CanRunSample. If this flag is set, the Grammar Explorer enables the Run button and calls the Grammar.RunSample method on click.
That's it for now - more stuff is coming. We are on track for release by the end of the year.
Enjoy it!

Sep 23, 2009

About code generation

As I've written in my previous post, I decided not to pursue IL code generation in Irony for initial release. The reason for this is that to do it properly I need to build more stuff than I thought originally.

You might well ask - can't we generate the IL code directly from AST tree? Irony parser produces AST tree, so can we just go one step forward and generate IL code by iterating the tree and invoking some "GenerateIL" method for each node? It is possible, and might seem like a viable solution. This is the approach that DLR team had chosen. With DLR, you build a parser for your language that translates the source code into AST tree consisting of DLR-provided AST nodes. Then DLR directly compiles the AST tree into IL instructions. Sounds simple and straightforward, and that's what I originally planned to do as well. This approach is a shortcut of what is described in compiler textbooks and papers (including the Dragon book): 
  1. The parser produces an AST tree.
  2. The AST tree is then translated into intermediate code representation - not MSIL, don't be confused; it is a custom internal representation of linear sequences of some virtual commands enriched with extra custom meta-data information
  3. The intermediate code is analyzed, optimized, restructed for execution in VM or physical processor.
  4. The machine instructions (or IL in our case) are generated from the intermediate representation.
One might think that steps 2 and 3 are unnecessary or needed only for really advanced compilers. But this middle phase is there for a reason. There is a huge structural mismatch/gap between AST (a tree), and executable code - a plain linear sequence of primitive commands. It is quite difficult to cross this gap directly in one step. The structure of the AST reflects the logical structure of a program using high-level abstractions of the programming languages. IL code on the other hand reflects how the program is actually executed - linearly, as a stream of instructions.
Basicallly, when moving to the executable code, we have to "linearize" the high-level tree-like structures. It is better to do it with some intermediate linear pseudo-code that can preserve important meta-data information from the AST tree. We will be able to do some analysis and transformations with this intermediate representation. Note that most of the important code analysis algorithms work with linear code representations, not with AST trees. They cannot be applied to IL/assembly level code, because it is too primitive and does not contain enough meta-data that is used and produced by the algorithms. All these "lattices" and propagation algorithms of code analysis are defined over linear sequences of pseudo-instructions.
Why we need code analysis at all, you may ask. For several reasons:
  • First of all, certain code analysis is unavoidable in any case. Even if we try to generate the IL code directly from AST tree, before we start producing IL, we need to perform several passes over the AST tree doing things like variable allocations, bindings, aggregating meta information - and this is in fact code analysis. Most of these operations are actually easier to do over linear intermediate code than over AST tree.
  • Secondly, if we ever hope to do any optimization of the generated code, then we need it, as well as code representation expected by most analysis/optimization algorithms.
  • Finally, it would be nice to have some static code analysis to detect things like use of uninitialized local variables, computed but never used values, etc. This is done by well-known code analysis algorithms which again expect a linear intermediate representation.
In conclusion, there is a lot of valuable results and algorithms in code analysis arena, and Irony should definitely take advantage of all this stuff. However, after doing some research and prototyping I see that this should be a project of its own, and probably put off to the next release.
What do you think?

Sep 20, 2009

Time to talk about the future

Had been quite busy for last few weeks, both at work and at home. The Irony project progress had been slow lately, but I'm trying to catch up. I want to share some thoughts on the future plans for Irony, list some near and longer term goals, and invite some feedback on these plans.

  1. Final 1.0 Irony release by the end of the year. I think it's time to start wrapping it up. My initial plan was to cover it all in the very first release: compiler front-end, some code analysis and IL code generation. Now I see that it's better to focus on front-end only - scanner+parser+AST generation, kind of Lex/Yacc equivalent, and to leave the rest to later versions. No IL code generation in initial release, sorry folks if you were waiting for it. More on code generation in a separate post. What I hope to implement instead is a simple interpreting engine that directly "executes" the AST tree. The AST node set would be minimal, just to support a simple expression language. You can see some sketch of this engine in the source code; for now the major part of it is a dynamic operator evaluator.

  2. Short-term plan - to post the updated version on downloads page. Latest Irony sources are quite different (and much better) than the code in the old release, and this apparently becomes more and more confusing. As I noticed, most people download the Release version, thus getting quite wrong impression of the current state of the project. Back in April I posted a big update to the sources with completely refactored code base, with a lot of improvements and new functionality. However, I did not update download version, for the reason that some features supported in old download version (like interpreter) were not supported yet in refactored code. Now the current source version is finally up to the feature set of the old release. It contains AST interpreter, although quite limited - only for expression evaluator. But the only thing lacking compared to old download is a sample Scheme interpreter. I don't think this sample is important enough to hold back the upgrade of download, so it's time to create a new, intermediate release - let's call it Alpha2. Expect it in the next 2-3 weeks. I will remove unused and research-stage classes, and brush-up and clean-up the code in general. The next pre-release drop, after Alpha2, should be already BETA, with complete feature set for final release of Irony 1.0.
Feature set planned for 1.0 release
Here's a list of features I'm planning to implement before the 1.0 release.

The first goal here is to complete the Irony's terminal set covering most of the common and maybe unusual lexical constructs for popular languages like Python, Ruby, c#, Basic, JavaScript and others. I'm looking closely at different languages trying to identify all the special cases of syntax not supported yet by standard Irony terminals. If you can recommend me such constructs - please let me know. I might add sample grammars for more languages and maybe revive Python and Ruby samples. The purpose of this is not to have complete grammars for these languages but rather to play with some fancy lexical/syntactic constructs from these languages and to check that Irony terminals can support these cases. Some of these terminals, token types and more general facilities:
  • Date literal in Visual Basic (example: #01/05/2005") . A simple but important enough case. I'm thinking about creating a generic delimited Literal terminal (delimited by start/end symbol like # for VB dates) that can support this VB date and maybe some other literals in other languages - any suggestions of such literals?
  • Support for exponent symbol (D vs E) that identifies float number type. For example, in Scheme and Fortran "1.5D01" identifies "double" float value, while opposed to "1.5E01" is a single-precision number.
  • Implicit operator/precedence hint - something that Yacc has, and it might be important for some cases.
  • Create facility for processing documentation strings (Python) or XML comments (c#). Maybe implement Doc grammar hint? - might be useful for python doc strings which are essentially regular string literals but become doc pieces when they appear in certain places.
  • CompilerDirective terminal for c#, with ability to analyze boolean expressions over defined symbols in "#if" directive. This would require ability to define sub-grammars inside main grammar to parse and evaluate these expressions.
  • Templated string literal with embedded expressions, like Ruby's "Name: #{customer.Name}" - this also requires sub-grammar facility.
  • Other exotic terminals: Ruby HereDoc;
  • Thinking also about implementing Wiki grammar/parser - might be an interesting case.

Other features

  • Localization support - put all error messages into localizable resources.
  • Symbol table implementation
  • Interpreter for direct interpretation of AST tree; no code analysis or IL code generation. Implementing basic runtime and object model for interpreter; basic generic infrastructure, not necessarily implementations for specific languages. Basic AST node set, maybe only for expression evaluator.
  • Template processor to support processsing of files like Ruby's .rhtml templates.
  • Moving to VS 2010 when it's out, support for DynamicObject facility from .NET 4.0.
  • Finish VS Integration support implementation.
  • Write Xml documentation for core classes - this is a huge effort; also will try to create some introductory quick-start guide

Asking for feedback

Please let me know what you think about all this. If you have an idea or suggestion about features that you think should be included into release - please post a comment or shoot me an email. Especially looking for suggestions about exotic lexical constructs found in existing languages that are not yet supported by Irony terminals; if you know such construct - please let me know.

May 4, 2009

About Keywords and Reserved words
In my previous post I forgot to describe what happened to keywords and reserved words in new version, and how their treatment had changed. These terms are mentioned in the caption of one of the sections but I forgot to actually describe the new functionality. So here it is.
In the previous versions the Grammar class had a Keywords field which was a string list. The programmer had to explicitly add the language keywords to the list, so that they could be highlighted in the code editor, for ex. It was a bit superflous as each of the keywords was already used in the grammar (and therefore Irony could discover them), so what's the point of listing them again? In addition, there was some confusion about what this list should contain - keywords or reserved words only. Now the issue had been clarified. Irony now maintains a clear distinction between the two terms.
A keyword is any word (symbol starting with a letter) used in language/grammar definition. A reserved word is a keyword that is reserved by the language as special symbol and cannot be used as identifier or anything else. New version of Irony does not hold separate lists for these special words: keywords and reserved words are symbol terminals that are marked by special flags: TermOptions.IsKeyword and Term.Options.IsReservedWord. Any token produced by scanner has its Terminal object attached to it, so tokens containing special words would have SymbolTerminal in Terminal field with appropriate flag set.
Now, how these flags are set. Irony sets the IsKeyword flag automatically for any word (symbol starting with a letter) that it finds in grammar rules. The grammar author does not need to add keywords explicitly to any lists - Irony finds them automatically, and highlighter would color them appropriately. On the other hand, the reserved words in the language should be marked explicitly using MarkReservedWords(...) method. This call is important for languages with reserved words, as it tells scanner and parser to always recongize the word as keyword, and never as indentifier or something else. Without this call the grammar may become ambiguous and parser can make wrong decisions.
In conclusion - a little less code to write for you, fellow language creators. Enjoy it!

Apr 27, 2009

Irony, April 2009 update - what's new

Lot's of things have changed: namespaces and classes renamed, some parts are rewritten from scratch; there are breaking changes in public API. Breaking public API is a bad thing, I know, but I think it was unavoidable result of code refactoring aimed at fixing existing problems in Irony functionality. I hope you will agree after reading this blog that fixes and improvements in new version were worth it, and Irony has become a much better tool as a result.
Here is a quick list of changes:

  • New namespaces, folders, classes
  • New LALR construction implementation
  • NLALR parsing algorithm
  • Concrete and abstract syntax trees
  • Parser/Scanner/Token-Filters pipeline changes
  • Scanner conflicts resolution, Parser-Scanner link
  • Grammar Symbols, Keywords, Reserved Words and Case Sensitivity
  • Scripting namespace, AST classes, interpreter
  • Diagnostics, Parser Tracing, Grammar Explorer changes
New namespaces, folders, classes
The code had been reorganized. The Irony.Compiler namespace had been renamed to Irony.CompilerServices - this allowed using Compiler for class name, so former LanguageCompiler is now simply Compiler. All code and entities involved in construction of parsing automaton - computing states and transitions - is moved into a new namespace Irony.CompilerServices.Construction, with dedicated classes for building GrammarData, ScannerData, ParserData.
The parsing automaton data is also separated from transient data used during automaton construction. The ParserState class has a special field BuilderData that holds the temporary construction data. This field may be cleared when construction is completed to release the memory.
Former Irony.Runtime namespace was merged into new namespace Irony.Scripting, which now has AST sub-namespace with all AST node classes. All AST nodes defined in Irony so far were implementations of executable nodes for scripting/dynamic languages, so it makes sense to move all these stuff to the new Scripting namespace. Irony.Diagnostics is a new namespace containing debugging and tracing classes and utilities.

New LALR Construction Implementation
I have to admit that the implementation of LALR algorithm in previous version was not quite correct. Initially I followed the LALR description in Dragon book, and there is one subtle moment in lookahead propagation piece that is not clearly emphasized there. It was only after reading through LALR description in "Parsing Techniques" (2nd edition, 2007 by Grune, Jacobs) that I understood my misenterpretation. If you are into parsing and compilers - get this excellent book!
As for the fault in original LALR implementation in Irony, it is explained on page 308. The last paragraph on this page starts with the sentence "Now it's tempting to say that ... ... ...but that is not true". It is this tempting but wrong conjecture that was assumed true in old implementation. It turns out that the wrong Irony's implementation is a legitimate algorithm known as NQLALR method (Not-Quite LALR). It works well, except it potentially might have more conflicts in parsing automaton in complex grammars compared to pure LALR.
The new LALR construction follows the outline in "Parsing techniques", except two details.
Firstly, the algorithm known as DeRemer and Pennello algorithm, based on notions of transitions, lookbacks and "includes relation". In its second part involving "includes relation propagation" it suggests using efficient transitive closure computation based on SCC (Strongly-Connected Components algorithm). This optimization part is not implemented, and still awaits its hero coder. That explains why processing complex grammars like c# takes so long (about 500 ms) - should be much faster when optimization is in place.
Secondly, the original algorithm computes lookaheads, types of expected tokens, strings in fact. Irony's computation is a bit more complex - it collects not lookaheads themselves but lookahead sources - LR items that are actually produce the lookaheads. The reason for this is that Irony now implements NLALR algorithm (non-canonical variation of LALR) which is built as an extension of LALR. NLALR uses LALR automaton as a base. Skipping some technical details, let's just say that NLALR uses the lookahead sources (LR items) to construct new non-canonical states that are used as base points for exploration of the further input and for building new non-terminal lookaheads. With these two exceptions the Irony LALR implementation follows quite closely the implementation described in Grune,Jacobs.
Note: the term "transition" used in the book maps to Irony's "ShiftTransition" class.

NLALR parsing algorithm
One big addition in Irony is NLALR parsing method (Non-canonical LALR), with optional automatic grammar transformation. The Irony.Samples project includes several demo grammars in NonCanonTestGrammars folder that clearly illustrate the concept. There is also a text file _about_noncanon_grammars.txt in the same folder with comments and explanations of how it works and what is so good about non-canonical parsing. The ultimate goal of NLALR implementation is to build a parsing algorithm that is free of usual limitations of LR methods with limited number of lookahead symbols. It is now working for simple grammars like the ones provided as demo, but is not yet ready to handle complex grammars like c# sample.
The current NLALR implemenation needs more work: automaton construction is quite slow, and number of extra non-canonical states generated seems excessive, etc.As for the sample c# grammar in particular, it is not exactly the grammar from c# spec, I've changed it quite a lot previously to fit into LALR method. I'm not sure what is the trouble for NLALR with this grammar - ambiguities in original grammar or the tweaks I added. More experimentation and investigation is needed.

Concrete and abstract syntax trees
In the previous version Irony parser produced the abstract syntax tree (AST) directly. The AST nodes were created based on types specified in NodeType property of non-terminals. The base idea was that language implementors were free to build the AST node hierarchy, but in fact Irony was forcing them to use the Irony-defined AstNode as a base class.
This is fixed now. The AST tree creation process follows a standard two-step pattern described in compiler literature. The first step is creating the so-called concrete syntax tree that directly maps the elements in the source stream into the tree. It is called concrete because it reflects the specific syntactic elements of concrete programming language. The alternative name is "parse tree". Irony parser produces the ParseTree instance, which holds the root node of the tree and all nodes underneath as child nodes of the root and so on. The nodes are instances of class ParseTreeNode - a container for all information about the parsed syntax element: grammar term.
The AST nodes now can be instances of any type. There are two ways to instantiate and initialize nodes.
The first method is to implement IAstNodeInit interface in all your custom node classes. Your custom nodes also must have default (parameterless) constructor. You then specify node types for each non-terminal in extra parameter of non-terminal's constructor. Irony then would create all nodes and initialize them while it builds the parse tree.
The second method is to use NodeCreator methods that can be attached to non-terminals (through constructor parameter) and create AST nodes in these methods. Note: this functionality might be broken still, as all AST and interpreter infrastructure had not been brought back to life yet. I will fix it ASAP.

Parser/Scanner/Token filters pipeline changes
The parser and scanner classes and the way they communicate (pipeline) had been refactored. The main purpose is to simplify the public API for consumers like code colorizers.
The token filters pipeline had been moved inside Scanner, so now the scanner returns the token stream passed through all filters specified in grammar. Previously, the token stream produced by core scanner was in fact incomplete if there were filters specified.This presented a complication for consumers that had to either use raw, incomplete stream, or build and manage the entier pipeline from scanner and filters.

The other change is that there are two parser classes now. The former Parser class implementing LALR algorithm was renamed to CoreParser - it is "LALR parser" without scanner. A new Parser class combines both core parser and scanner, so it is a full parser transforming text into parse tree. Again, this is for simplifying the API for external consumers that need a combined scanner+parser object and don't care much about details.

Scanner conflicts resolution, scanner-parser link
Scanner often runs into situation when input stream is ambigous - the input word can be associated with more than one terminal from the list of terminals.
One case of ambiguity that often comes up is ambiguity between identifier and keyword. Often keywords in the grammar can also be used as identifiers, unless a keyword is explicitly marked as "reserved" word in grammar. In previous version the Irony scanner/parser used some resolution mechanism that allowed the input stream to be correctly parsed, even when scanner labeled input token "incorrectly". In new version the resolution algorithm was improved.
As before, the scanner initially recognizes keyword/identifier tokens as identifiers, but also sets an alternative terminal for token in AsSymbol field, when it finds that there is exactly matching keyword in grammar, and it is not Reserved word (more about Reserved words later). The parser first checks this AsSymbol field and if it is not null it tries to find an action for this symbol. If an action is defined in the current state, it means that "the keyword wins", and parser does some extra back-patching - this is new in Irony. Parser changes the "main" Terminal assigned to the token from Identifier to this AsSymbol terminal. The token in fact changes its identity, and might be correctly interpreted later (by code colorizer for example). If on the other hand, the parser does not find an action for AsSymbol terminal, it continues with assumption that is in fact Identifier and retrieves the action for the Identifier terminal.
The other case of scanning ambiguity is a bit more complicated. Sometimes grammar defines several non-terminals (non symbols) that can be associated with the same words/tokens. For example, GwBasic sample grammar defines 3 number terminals: regular number (int or float) used in arithmetic expressions, lineNumber - for line numbers, and fileNumber - integers identifying file# in input/output statements. Scanner by itself cannot possibly identify what is "123" in the input stream - any of the three terminals would match it.
In new version in situation like this the scanner checks with parser - which terminals are actually expected in current parser state. Each ParserState instance has a field ExpectedTerms which contains grammar terms (terminals and non-terminals) which are expected in input in this state. This set is precalculated at parser construction time. So when the new Irony scanner finds that more than one terminal can be used for matching the input (based on current input char), it checks the list of candidates against the ExpectedTerms set in the current parser state. That should leave only one terminal (unless the grammar is really ambiguous) which is then used for producing the token.

Grammar Symbols, Keywords, Reserved Words and Case Sensitivity
The Symbol terms dictionary and surrounding functionality had been refactored. The biggest trouble with old architecture was that Symbols dictionary was a static singleton field in SymbolTerminals class. This was obviously bad because this dictionary would be shared between several grammars in one application, with all negative consequences. But having this dictionary in static field was necessary as it was accessed by operator overrides (for + and operations) which are static methods.
Even a bigger problem was the fact that the dictionary was always initially case sensitive, it treated differently-cased words as different symbols, and this caused all kinds of problems for case-insensitive grammars. Grammar-processing code tried to reconcile different cAsE instances of the same symbol, but it didn't always work properly: for example the operator precedence in SQL grammar for logical operators (AND, OR) didn't work because of this mess.
The new solution in Irony is the following.
First, the SymbolTerms dictionary is an instance field in Grammar class. Static operator overloads access this dictionary through an extra copy in so called thread-static field of Grammar class. Thread-static fields are identified by [ThreadStatic] attribute, and are just like normal static fields except there is a separate instance for each accessing thread, so there is no danger of concurrent access to this dictionary from different grammars constructed on different threads.
Another change is that the symbol dictionary is created as case-insensitive for grammar-insensitive grammars. It is instantiated in base grammar constructor, and case sensitivity flag is now provided as a parameter to the constructor. The CaseSensitive Grammar's field is now read-only. This is one breaking change for existing case insensitive grammars, but fortunately it's an easy one to fix in existing grammars.
These changes simplified and straightened out the symbol terms support in Irony. Operator precedence now works correctly in sample SQL grammar.

Scripting namespace, AST classes, interpreter
This version introduces new namespace Irony.Scripting that contains all runtime and interpreter-related classes, including AST nodes that are in fact nodes suitable for interpreting the code. The interpreter infrastructure is not back to life yet, most of the code is commented out. Will be fixing it and probably rewriting some code - most likely "code analysis" methods will be gone, and functionality merged into evaluation methods.

Diagnostics, Parser Tracing, Grammar Explorer changes
New Diagnostics namespace contains several classes and utilities for parser debugging and information printout. Grammar Explorer now shows parser trace in the bottom area, and trace entries are active links - double-clicking in grid cell brings to you place in source code or parser state printout.

Enjoy it!

Apr 26, 2009

Welcome to my blog about Irony project!

Hi and welcome to my blog about Irony - .NET Language Implementation Kit - an open-source project hosted on CodePlex!

I've decided to start this blog to communicate what I'm doing in Irony and why, to explain and discuss design decisions and all things that come up as the project goes ahead.

Certainly, the blog is not a replacement for a real documentation and user or programmer guides - this will come eventually. The main trouble with documentation is that it does not make much sense to document things that continue to change - and at least until now - April 2009 - the Irony code and public API-s were quite unstable and changing all the time. Now it looks like things get settled and pretty soon I will start thinking about documenting it. This future documentation will eventually explain how to use it and how it works to some extent. This blog on the other hand will be providing some background explanations why things were done in a certain way.

So, let's get started - my first post will be about latest code drop (April 2009), and big improvements that came with it.