Jan 24, 2010

January 2010 Update

Hi everybody!
We are in 2010 now. Back in September, I talked about plans to make an official release of Irony before the end of 2009. Obviously, that did not happen. Still too many issues to fix and features to add before Irony can be declared "complete" for the first release. My apologies to everybody. It looks like it always takes longer than it takes. I was a bit busy with other things lately (crunch at day-time job), so I could not spend enough time on Irony. I apologize for the failed promise, and will try to make it happen this year, hopefully in a few months.
Now back to the current business. I'm making a sizable update to Irony today and would like to go over what is there. We'll start with small things and then move to bigger and more important changes.

Minor Fixes
I have fixed several bugs and inconsistencies reported previously.
  1. FreeTextLiteral is fixed - now correctly grabs the text when it ends at EOF.
  2. SourceStream.TabWidth property - now writable, so it can be set before parsing. That should fix an issue reported previously that this setting was useless in current implementation - there was no way to set it properly.
  3. Fixed Scanner.vsScanToken method, adjusted to latest changes in token scanning mechanism. Did not test it with VS integration - don't have a working test package, but hopefully the reported problems are fixed. Let me know if you find any problems.
  4. Changed the scanner error message to show the expected list of terminals, similar to what parser reports in case of error. This message is shown by the Scanner when it fails to produce a token in the current position.
  5. Added validation of ErrorRule property of non-terminals, more precisely - productions that contain SyntaxError term. These rules/productions are instructions to Parser telling it how to recover in case of error to continue parsing and reporting other possible errors. I have noticed that these productions might be set incorrectly (it happened in one of the grammars I have seen recently), so I added an automatic validation at grammar construction time. The rule is that SyntaxError term inside the rule must always be followed by some terminal, so SyntaxError cannot be the last term in production. The meaning of error production is to be an instruction to the parser: "If you encounter error here, skip everything until this terminal X", where X is the terminal after the SyntaxError term. The added validation code checks there is in fact such a terminal X.

Enhancements
Handling transient and list non-terminals
Refactored the parser code that handles the transient and list nodes. Now there are some explicit restrictions on non-terminals that are declared transient: first, list nodes cannot be declared transient; secondly, the rule for transient non-terminal must include 0 or 1 non-punctuation nodes, no more than one.
A transient term is an "abstract" non-terminal that is introduced for better grammar structuring, but has no concrete implementation in the language. For example, a "statement" non-terminal is usually defined as an OR of concrete statement types:
    Stmt.Rule = AssingnmentStmt | ForStmt |WhileStmt...;

so there is no such thing as "statement", it is an aggregating term for different kind of statements. By declaring it transient we tell parser to replace the Stmt node (when it is about to produce it) with its direct child node - a specific implementation.
In light of this explanation, the added restrictions make sense - we make sure we have a single child for transient nodes. In case of 0 children, I decided to allow this for now - these nodes will be eliminated if they are members of the list, so list constructor would skip them. Banning optional 0-child nodes seem to make it harder to go around in some common cases.

MakePlusRule, MakeStarRule enhancement
An important improvement is an addition to MakePlusRule and MakeStartRule method. Now they have an exra optional boolean parameter allowTrailingDelimiter. Setting this parameter to True allows you to create lists that have an optional extra delimiter after the list. This is quite common situation. For example, in c# the enum declaration allows a comma after the last element:
  public enum MyEnum {
    First,
    Second,
    Third,
  }
- notice a trailing comma after the Third - this is valid. With this extra parameter Irony will take care of the lists like this for you.

New property NodeCaptionTemplate in NonTerminal and GetParseNodeCaption method
The purpose of this property is to allow customization of the node caption in the parse tree in Grammar Explorer. For example, if you parse a GwBasic program (in the old version of Irony), the parse tree on the right would show a root PROGRAM node with a bunch of LINE nodes as its children. Each line's caption is just LINE, so you have to expand the node to see its contents and actually understand which particular line it is. In the new version of GwBasic grammar, the LINE non-terminal is defined as follows (I skip some details):
  LINE.Rule = lineNumber + LINE_CONTENT_OPT + COMMENT_OPT + NewLine;
  LINE.NodeCaptionTemplate = "Line #{0}";
NodeCaptionTemplate specifies that node caption should be a word "Line" followed by the contents of the child node at 0 index - which is line number. So line nodes in the tree now show up as "Line 10", "Line 20", etc. For other examples, see miniPython grammar - I assigned this property to non-terminals for function definition and function call, so now the show up as "def myFun(...)" and "call myFun(...)", showing the function name instead of simply node type.
Additionally, the process of constructing the caption now goes through the virtual Grammar method GetParseNodeCaption, which you can override in your grammar and provide custom captions for specific nodes in your grammar.

Dispatching the implementation method call in AstNode base class
One important improvement to AstNode base class. Previously, the AstNode.Evaluate method was calling virtual EvaluateNode method, which is supposed to be overridden in subclassed nodes and provide a specific implementation of node evaluation. It is still the case, but with a twist. There is an additonal protected field EvaluateRef in AstNode class - it is a pointer to the implementation of evaluation method, so Evaluate dispatches the call to specific implementation through this reference. This field is initially set to pointer to EvaluateNode method, so everything works as before. But now you can provide several implementations of "EvaluateNode" in specific node, each for some specific situation, and choose which to use by setting the pointer EvaluateRef during the initialization of the node. This provides a way for language implementor to combine several specific implementations in one AstNode subclass without creating several more specific subclasses.

UnExprNode class is refactored into several more specific classes
The previous implementation of UnExpNode was handling all kinds of unary expressions including inc/dec operators (++/--). It was quite messy, so I decided to break it into several more specific classes. See the expression evaluator grammar for a usage example of these new node classes.

New Features
New terminal: ImpliedSymbolTerminal
Some grammars and notation conventions have a concept of an implied (invisible) operator in expressions when two operands are sitting together without any delimiter between them. For example, in mathematical formulas the multiplication operator is usually skipped, so 'X Y' stands for 'X * Y'. The other example is SearchGrammar in samples. The input syntax allows two consequitive terms in the input stream, which implies "AND" operation, so 'T1 T2' stands for 'T1 AND T2'.
The problem with implementing such grammars is use of operator precedence. Because there is not physical terminal for the implicit operator, you cannot assign a precedence or associativity value to it, so it becomes impossible to use operator precedence facility for automatic conflict resolution. You might think that we could create an empty non-terminal with the rule
    ImpliedAnd.Rule = Empty;
and use it in the rule and assign a precedence value to it - but this does not work. The node for this empty non-terminal would not exist when parser is looking at the preview token - non-terminals are instantiated on top of the stack, so the parser cannot use the precedence to decide on conflict resolution. This was the case with the previous version of SearchGrammar - we could not use precedence there, so we had to fallback to an old ugly way of defining each type of binary expressions - AndExpression, OrExpression, etc, and express precedence rules through careful definition of each expression in terms of each other.
Now we have a solution for cases like this: ImpiedSymbolTerminal. It is created by the scanner from the 'air' - it does not consume any characters in the input stream. How does the scanner know that in some particular place it must inject this impied symbol? Through scanner-parser link. The scanner looks at the current state of the parser and asks which terminals are expected. Let's say the state indicates that the terminals expected are explicit operators (like '*', '+' etc) or an implicit operator. First the scanner tries explicit operators, and if it cannot create any of them, it creates an implicit symbol and passes it to the parser.
It should be clear from this explanation that using ImpliedSymbolTerminal requires that Parser-Scanner link is enabled for the grammar, so there is a check of this flag in the ImpliedSymbolTerminal init method. Look at SearchGrammar to see the use of this facility - the grammar had been simplified quite a bit I think, and is now more concise and clear.

BnfTerm.AstNodeConfig property
This property allows you to pass some extra setup information to the AstNode initialization method.
One usage example of this facility is StringLiteral class. AstNodeConfig property holds the instance of StringTemplateSettings (subclass of AstNodeConfig base class) in cases when embedded expressions are allowed in strings. See next section for more information about string templates and embedded expressions.
I am planning to use this facility extensively in the near future, when implementing general-purpose AST nodes for interpreter. The idea is to provide a way for language writer to customize the behavior of general AST nodes for particular language through settings in AstNodeConfig properties of non-terminals (in addition to AstNodeType) assigned in grammar constructor.

Multiple Grammar Roots and strings templates with embedded expressions
Have a look at new version of ExpressionEvaluator - it now support string operands with embedded expressions, Ruby-style syntax:
  name = "Joe"
  hello = "Hello, #{name}"
I added this facility to ExpressionEvaluator, just to use it as a test case in the absense of Ruby sample, will probably remove it in the future. You can use full-blown expressions inside curly braces, not only single variables.
What is important here is not the fancy Ruby strings, but what is involved in supporting this functionality. What is required in this case is that the parser can parse and evaluate a stand-alone expression, a snippet, which is not necessarily an entire program module. (In Ruby case, it is easier - a simple expression is a valid Ruby program, so we can treat embedded expressions as little programs, but in general this is not the case). The other example of applications where such facility is needed is a code colorizer, which can convert a code snippet in some language into an HTML fragment that we can embed into a web page. We want the colorizer to be able to accept small snippets - classes, single methods, or even standalone statements as an input. But Irony parser (as it existed before) was able to accept only full-module fragments, identified by the Root non-terminal.
Now you can have multiple roots. In addition to Grammar.Root property, there is a new Grammar.SnippetRoots set. These are additional roots for the grammar. You can add non-terminals representing "snippets" to this set in grammar constructor. The parser construction algorithm creates additional initial states in the grammar automaton for each of the snippet roots. To create a parser for a particular snippet, use constructor overload with snippet non-terminal as a parameter:
    var parser = new Parser(myLanguage, mySnippetRoot)
Look at StringTemplateNode.cs for an example of use of this facility. I plan to use this feature in the near future when I get to code colorizers for Wiki - if I ever get there...

Symbol Table Implementation
This is quite important new feature. All strings in the langauge/grammar and in particular program are now registered in the global symbol table and represented as instances of the Symbol class. This is a standard mechanism found in most implementations of compilers and interpreters. I would say that the need for this extra layer of symbol representation is pretty obvious in unmanaged implementations - to manage effectively the names at compile- and run- time. For implementations like Irony, built on top of .NET, the benefits are not so obvious - .NET runtime manages the strings quite well behind the scene. Irony so far was doing quite well without it, and for some time I even thought it might not be needed at all. But at the end, I decided it is good to have it, it can bring some performance benefits at runtime, and it can also open possibilities for some elaborate optimizations in the future.
One of the tricky things was to accomodate the need to work with both case sensitive and insensitive languages. I think I figured it out. Symbols are always treated as case-sensitive in internal tables. Each symbol has internal reference to all-lower symbol, which can be used for fast symbol comparison in case-insensitive languages.
Now all key terms (keywords in grammar) and identifiers do not use plain strings to hold the textual content. Instead, they have a reference to a Symbol that represents this textual string. The implementation is not complete yet, it is more like a draft, but all public API and changes to high-visibility classes (like terminals) are implemented and put in place. I will work on this further, mainly completing internal workings of the parser and interpreter, without making breaking changes to public API.

What's Next
For the next few weeks I would like to turn to interpreter functionality and finally expand it to more realistic feature set. I see some of you guys starting playing with runtime and implementing interpreters. Nothing wrong with that, but because of incompleteness of this functionality in Irony, every time you have to reinvent the wheel. I think it's time to invent the wheel once and forever, and let you guys decorate it with specific artifacts of your languages.
Another note on Wiki terminal and Wiki grammars in samples. Currently this implementation is quite messy and incomplete. I'm planning to get back to this functionality and rework it. So don't use it for anything but as samples - quite ugly samples in fact. It will be done better, I promise.
Until next time - have fun with Irony!

14 comments:

  1. Wow, I really like new features, especially SnippetRoots! :)

    But what is the purpose of SymbolTable?
    Is it only for quicker string lookup/comparison?
    I.e. comparing two pointers is faster than comparing two strings char-by-char. Am I right?

    ReplyDelete
  2. Very interest, and great work! Thanks!

    ReplyDelete
  3. To yallie:
    yes, one reason is comparison efficiency - it should be faster. But also there may be some extra information about symbol that we can put there, and use it for some tricks (like Symbol.LastTimeModified - to enable efficient values cashing inside nodes)

    ReplyDelete
  4. Roman,

    Other than the samples, what will be the best way to get familiar with various termial and non-terminal classes? Thanks

    Duncan

    ReplyDelete
  5. The only other thing I can suggest - look at TerminalFactory class, and compare the Terminals' setup code to formal definitions of lexical elements in corresponding languages. Then inspect the code of terminal classes. I think it is clear enough what it is doing.

    ReplyDelete
  6. This is ground-breaking work, but it needs documentation. Why not start a public Irony wiki that we could all contribute to? I'm currently using it to implement a language service, and I'm sure there are many facets of the language I just haven't figured out yet.

    ReplyDelete
  7. To John: Started Irony wiki book:
    http://en.wikibooks.org/wiki/Irony_-_Language_Implementation_Kit

    go ahead, you're more than welcome to contribute!
    Thanks in advance

    ReplyDelete
  8. This is excellent, but I have a problem with linking the interpreter to the outside world.

    There doesn't seem to be a clean way to make the interpreter aware of the context from which it has been executed, so that the AstNodes it generates can then interact with that context.

    For example, I'm adding scripting to a UI where the script language will control various UI features. So the EvaluateNode override on each of my action nodes needs to perform the relevant action on the specific instance of the UI from which the interpreter is being executed.

    But I can't see an elegant way to do this. The only way I can see is to add the UI instance to the globals where it doesn't really belong. There's nowhere in the runtime where the context can be extended to include a parent.

    Am I missing something?

    ReplyDelete
  9. I should add - I'm achieving this now by using the AST node delegate creation to populate the Ast nodes with a scripting context, but that is a bit heavy handed especially with lots of actions to code!

    It would be nicer if it could be applied into the runtime context passed to EvaluateNode.

    ReplyDelete
  10. I think Globals is the right place for it. In fact, the UI/form instance is a global in the context of script execution. So script writer can do smth like "Form.Caption = "Hello!", where Form is a global reference to actual UI form inside which it is running

    ReplyDelete
  11. Irony is a masterpiece. Please take my compliments.

    ReplyDelete
  12. This comment has been removed by a blog administrator.

    ReplyDelete
  13. This comment has been removed by a blog administrator.

    ReplyDelete