Jan 24, 2010

January 2010 Update

Hi everybody!
We are in 2010 now. Back in September, I talked about plans to make an official release of Irony before the end of 2009. Obviously, that did not happen. Still too many issues to fix and features to add before Irony can be declared "complete" for the first release. My apologies to everybody. It looks like it always takes longer than it takes. I was a bit busy with other things lately (crunch at day-time job), so I could not spend enough time on Irony. I apologize for the failed promise, and will try to make it happen this year, hopefully in a few months.
Now back to the current business. I'm making a sizable update to Irony today and would like to go over what is there. We'll start with small things and then move to bigger and more important changes.

Minor Fixes
I have fixed several bugs and inconsistencies reported previously.
  1. FreeTextLiteral is fixed - now correctly grabs the text when it ends at EOF.
  2. SourceStream.TabWidth property - now writable, so it can be set before parsing. That should fix an issue reported previously that this setting was useless in current implementation - there was no way to set it properly.
  3. Fixed Scanner.vsScanToken method, adjusted to latest changes in token scanning mechanism. Did not test it with VS integration - don't have a working test package, but hopefully the reported problems are fixed. Let me know if you find any problems.
  4. Changed the scanner error message to show the expected list of terminals, similar to what parser reports in case of error. This message is shown by the Scanner when it fails to produce a token in the current position.
  5. Added validation of ErrorRule property of non-terminals, more precisely - productions that contain SyntaxError term. These rules/productions are instructions to Parser telling it how to recover in case of error to continue parsing and reporting other possible errors. I have noticed that these productions might be set incorrectly (it happened in one of the grammars I have seen recently), so I added an automatic validation at grammar construction time. The rule is that SyntaxError term inside the rule must always be followed by some terminal, so SyntaxError cannot be the last term in production. The meaning of error production is to be an instruction to the parser: "If you encounter error here, skip everything until this terminal X", where X is the terminal after the SyntaxError term. The added validation code checks there is in fact such a terminal X.

Handling transient and list non-terminals
Refactored the parser code that handles the transient and list nodes. Now there are some explicit restrictions on non-terminals that are declared transient: first, list nodes cannot be declared transient; secondly, the rule for transient non-terminal must include 0 or 1 non-punctuation nodes, no more than one.
A transient term is an "abstract" non-terminal that is introduced for better grammar structuring, but has no concrete implementation in the language. For example, a "statement" non-terminal is usually defined as an OR of concrete statement types:
    Stmt.Rule = AssingnmentStmt | ForStmt |WhileStmt...;

so there is no such thing as "statement", it is an aggregating term for different kind of statements. By declaring it transient we tell parser to replace the Stmt node (when it is about to produce it) with its direct child node - a specific implementation.
In light of this explanation, the added restrictions make sense - we make sure we have a single child for transient nodes. In case of 0 children, I decided to allow this for now - these nodes will be eliminated if they are members of the list, so list constructor would skip them. Banning optional 0-child nodes seem to make it harder to go around in some common cases.

MakePlusRule, MakeStarRule enhancement
An important improvement is an addition to MakePlusRule and MakeStartRule method. Now they have an exra optional boolean parameter allowTrailingDelimiter. Setting this parameter to True allows you to create lists that have an optional extra delimiter after the list. This is quite common situation. For example, in c# the enum declaration allows a comma after the last element:
  public enum MyEnum {
- notice a trailing comma after the Third - this is valid. With this extra parameter Irony will take care of the lists like this for you.

New property NodeCaptionTemplate in NonTerminal and GetParseNodeCaption method
The purpose of this property is to allow customization of the node caption in the parse tree in Grammar Explorer. For example, if you parse a GwBasic program (in the old version of Irony), the parse tree on the right would show a root PROGRAM node with a bunch of LINE nodes as its children. Each line's caption is just LINE, so you have to expand the node to see its contents and actually understand which particular line it is. In the new version of GwBasic grammar, the LINE non-terminal is defined as follows (I skip some details):
  LINE.Rule = lineNumber + LINE_CONTENT_OPT + COMMENT_OPT + NewLine;
  LINE.NodeCaptionTemplate = "Line #{0}";
NodeCaptionTemplate specifies that node caption should be a word "Line" followed by the contents of the child node at 0 index - which is line number. So line nodes in the tree now show up as "Line 10", "Line 20", etc. For other examples, see miniPython grammar - I assigned this property to non-terminals for function definition and function call, so now the show up as "def myFun(...)" and "call myFun(...)", showing the function name instead of simply node type.
Additionally, the process of constructing the caption now goes through the virtual Grammar method GetParseNodeCaption, which you can override in your grammar and provide custom captions for specific nodes in your grammar.

Dispatching the implementation method call in AstNode base class
One important improvement to AstNode base class. Previously, the AstNode.Evaluate method was calling virtual EvaluateNode method, which is supposed to be overridden in subclassed nodes and provide a specific implementation of node evaluation. It is still the case, but with a twist. There is an additonal protected field EvaluateRef in AstNode class - it is a pointer to the implementation of evaluation method, so Evaluate dispatches the call to specific implementation through this reference. This field is initially set to pointer to EvaluateNode method, so everything works as before. But now you can provide several implementations of "EvaluateNode" in specific node, each for some specific situation, and choose which to use by setting the pointer EvaluateRef during the initialization of the node. This provides a way for language implementor to combine several specific implementations in one AstNode subclass without creating several more specific subclasses.

UnExprNode class is refactored into several more specific classes
The previous implementation of UnExpNode was handling all kinds of unary expressions including inc/dec operators (++/--). It was quite messy, so I decided to break it into several more specific classes. See the expression evaluator grammar for a usage example of these new node classes.

New Features
New terminal: ImpliedSymbolTerminal
Some grammars and notation conventions have a concept of an implied (invisible) operator in expressions when two operands are sitting together without any delimiter between them. For example, in mathematical formulas the multiplication operator is usually skipped, so 'X Y' stands for 'X * Y'. The other example is SearchGrammar in samples. The input syntax allows two consequitive terms in the input stream, which implies "AND" operation, so 'T1 T2' stands for 'T1 AND T2'.
The problem with implementing such grammars is use of operator precedence. Because there is not physical terminal for the implicit operator, you cannot assign a precedence or associativity value to it, so it becomes impossible to use operator precedence facility for automatic conflict resolution. You might think that we could create an empty non-terminal with the rule
    ImpliedAnd.Rule = Empty;
and use it in the rule and assign a precedence value to it - but this does not work. The node for this empty non-terminal would not exist when parser is looking at the preview token - non-terminals are instantiated on top of the stack, so the parser cannot use the precedence to decide on conflict resolution. This was the case with the previous version of SearchGrammar - we could not use precedence there, so we had to fallback to an old ugly way of defining each type of binary expressions - AndExpression, OrExpression, etc, and express precedence rules through careful definition of each expression in terms of each other.
Now we have a solution for cases like this: ImpiedSymbolTerminal. It is created by the scanner from the 'air' - it does not consume any characters in the input stream. How does the scanner know that in some particular place it must inject this impied symbol? Through scanner-parser link. The scanner looks at the current state of the parser and asks which terminals are expected. Let's say the state indicates that the terminals expected are explicit operators (like '*', '+' etc) or an implicit operator. First the scanner tries explicit operators, and if it cannot create any of them, it creates an implicit symbol and passes it to the parser.
It should be clear from this explanation that using ImpliedSymbolTerminal requires that Parser-Scanner link is enabled for the grammar, so there is a check of this flag in the ImpliedSymbolTerminal init method. Look at SearchGrammar to see the use of this facility - the grammar had been simplified quite a bit I think, and is now more concise and clear.

BnfTerm.AstNodeConfig property
This property allows you to pass some extra setup information to the AstNode initialization method.
One usage example of this facility is StringLiteral class. AstNodeConfig property holds the instance of StringTemplateSettings (subclass of AstNodeConfig base class) in cases when embedded expressions are allowed in strings. See next section for more information about string templates and embedded expressions.
I am planning to use this facility extensively in the near future, when implementing general-purpose AST nodes for interpreter. The idea is to provide a way for language writer to customize the behavior of general AST nodes for particular language through settings in AstNodeConfig properties of non-terminals (in addition to AstNodeType) assigned in grammar constructor.

Multiple Grammar Roots and strings templates with embedded expressions
Have a look at new version of ExpressionEvaluator - it now support string operands with embedded expressions, Ruby-style syntax:
  name = "Joe"
  hello = "Hello, #{name}"
I added this facility to ExpressionEvaluator, just to use it as a test case in the absense of Ruby sample, will probably remove it in the future. You can use full-blown expressions inside curly braces, not only single variables.
What is important here is not the fancy Ruby strings, but what is involved in supporting this functionality. What is required in this case is that the parser can parse and evaluate a stand-alone expression, a snippet, which is not necessarily an entire program module. (In Ruby case, it is easier - a simple expression is a valid Ruby program, so we can treat embedded expressions as little programs, but in general this is not the case). The other example of applications where such facility is needed is a code colorizer, which can convert a code snippet in some language into an HTML fragment that we can embed into a web page. We want the colorizer to be able to accept small snippets - classes, single methods, or even standalone statements as an input. But Irony parser (as it existed before) was able to accept only full-module fragments, identified by the Root non-terminal.
Now you can have multiple roots. In addition to Grammar.Root property, there is a new Grammar.SnippetRoots set. These are additional roots for the grammar. You can add non-terminals representing "snippets" to this set in grammar constructor. The parser construction algorithm creates additional initial states in the grammar automaton for each of the snippet roots. To create a parser for a particular snippet, use constructor overload with snippet non-terminal as a parameter:
    var parser = new Parser(myLanguage, mySnippetRoot)
Look at StringTemplateNode.cs for an example of use of this facility. I plan to use this feature in the near future when I get to code colorizers for Wiki - if I ever get there...

Symbol Table Implementation
This is quite important new feature. All strings in the langauge/grammar and in particular program are now registered in the global symbol table and represented as instances of the Symbol class. This is a standard mechanism found in most implementations of compilers and interpreters. I would say that the need for this extra layer of symbol representation is pretty obvious in unmanaged implementations - to manage effectively the names at compile- and run- time. For implementations like Irony, built on top of .NET, the benefits are not so obvious - .NET runtime manages the strings quite well behind the scene. Irony so far was doing quite well without it, and for some time I even thought it might not be needed at all. But at the end, I decided it is good to have it, it can bring some performance benefits at runtime, and it can also open possibilities for some elaborate optimizations in the future.
One of the tricky things was to accomodate the need to work with both case sensitive and insensitive languages. I think I figured it out. Symbols are always treated as case-sensitive in internal tables. Each symbol has internal reference to all-lower symbol, which can be used for fast symbol comparison in case-insensitive languages.
Now all key terms (keywords in grammar) and identifiers do not use plain strings to hold the textual content. Instead, they have a reference to a Symbol that represents this textual string. The implementation is not complete yet, it is more like a draft, but all public API and changes to high-visibility classes (like terminals) are implemented and put in place. I will work on this further, mainly completing internal workings of the parser and interpreter, without making breaking changes to public API.

What's Next
For the next few weeks I would like to turn to interpreter functionality and finally expand it to more realistic feature set. I see some of you guys starting playing with runtime and implementing interpreters. Nothing wrong with that, but because of incompleteness of this functionality in Irony, every time you have to reinvent the wheel. I think it's time to invent the wheel once and forever, and let you guys decorate it with specific artifacts of your languages.
Another note on Wiki terminal and Wiki grammars in samples. Currently this implementation is quite messy and incomplete. I'm planning to get back to this functionality and rework it. So don't use it for anything but as samples - quite ugly samples in fact. It will be done better, I promise.
Until next time - have fun with Irony!

Oct 14, 2009

Updated Alpha release version, Oct 2009

I finally upgraded the download (Alpha-release) version and brought it in sync with latest sources. I removed some dead and not-used-yet code, did some code cleanup and a bit of refactoring. Aside this, there are a few changes and improvements that I would like to comment on.
The first important change is renaming SymbolTerminal to KeyTerm. I think that the name SymbolTerminal was misleading and a bit confusing, and it became more so as I faced a need for a terminal for scanning elements known as "symbols" in many programming languages; for example, in Ruby it is an identifier preceeded by a colon, like :name. The former SymbolTerminal, on the other hand, was created to recognize the language-defined keywords and special characters (operator, assignment, braces, etc). So the name change makes sense, and "key" prefix in KeyTerm comes from the "keyword". This name change caused obvious name changes to corresponding dictionaries, lists and fields. It also made sense to rename the  Symbol() method in the Grammar class - it is used to explicitly convert the first string in string-only BNF expressions and allow proper method overload to kick in. For example:

BinOp.Rule = Symbol("+") | "-" | "*" | "/";

I changed the name of the method to ToTerm(string), but I kept the old version Symbol() and decorated it with Obsolete attribute. The Symbol() method is heavily used in grammars, and leaving an old method for a while would allow to compile the old grammars without errors. The compiler will only give a warning that the method was deprecated and you should use ToTerm instead. Now the rule for BinOp should look like this:

BinOp.Rule = ToTerm("+") | "-" | "*" | "/";

Please make a change in your existing grammars.
Continuing with this theme, I'm thinking about better ways to handle keywords internally. It looks like this Grammar.CurrentGrammar thread-static property is more trouble than it's worth. The main problem is that keyword terminals can be created in static operator overloads (BnfTerm.Op_Pipe, Op_Plus etc). At the same time the terminal for a particular keyword should be instantiated once in a grammar, and with proper case sensitivity arrangement. Therefore, the grammar (an instance) should have a list/dictionary of keyword terminals which should be accessible from static operator overload methods. Currently it is done through this Grammar.CurrentGrammar property, but I'm wondering if there's a better way. Will see.
Another change is a new Grammar.RunSample virtual method. The purpose of this method is to allow custom implementations of running a sample code in the Grammar Explorer. The base implementation in Grammar class executes the code using Irony's AST interpreter - this works well for the expression evaluator grammar, as an example. The SearchGrammar (Google-to-SQL search syntax converter) overrides this method and provides it's own implementation - it converts the input string into SQL server syntax using its own custom method. Using the RunSample override allowed to get rid of a separate sample project I had before for testing the search grammar - now you can do it directly in the Grammar Explorer.
One more thing: the language flag SupportsInterpreter is renamed to CanRunSample. If this flag is set, the Grammar Explorer enables the Run button and calls the Grammar.RunSample method on click.
That's it for now - more stuff is coming. We are on track for release by the end of the year.
Enjoy it!