Here is a quick list of changes:
- New namespaces, folders, classes
- New LALR construction implementation
- NLALR parsing algorithm
- Concrete and abstract syntax trees
- Parser/Scanner/Token-Filters pipeline changes
- Scanner conflicts resolution, Parser-Scanner link
- Grammar Symbols, Keywords, Reserved Words and Case Sensitivity
- Scripting namespace, AST classes, interpreter
- Diagnostics, Parser Tracing, Grammar Explorer changes
The code had been reorganized. The Irony.Compiler namespace had been renamed to Irony.CompilerServices - this allowed using Compiler for class name, so former LanguageCompiler is now simply Compiler. All code and entities involved in construction of parsing automaton - computing states and transitions - is moved into a new namespace Irony.CompilerServices.Construction, with dedicated classes for building GrammarData, ScannerData, ParserData.
The parsing automaton data is also separated from transient data used during automaton construction. The ParserState class has a special field BuilderData that holds the temporary construction data. This field may be cleared when construction is completed to release the memory.
Former Irony.Runtime namespace was merged into new namespace Irony.Scripting, which now has AST sub-namespace with all AST node classes. All AST nodes defined in Irony so far were implementations of executable nodes for scripting/dynamic languages, so it makes sense to move all these stuff to the new Scripting namespace. Irony.Diagnostics is a new namespace containing debugging and tracing classes and utilities.
New LALR Construction Implementation
I have to admit that the implementation of LALR algorithm in previous version was not quite correct. Initially I followed the LALR description in Dragon book, and there is one subtle moment in lookahead propagation piece that is not clearly emphasized there. It was only after reading through LALR description in "Parsing Techniques" (2nd edition, 2007 by Grune, Jacobs) that I understood my misenterpretation. If you are into parsing and compilers - get this excellent book!
As for the fault in original LALR implementation in Irony, it is explained on page 308. The last paragraph on this page starts with the sentence "Now it's tempting to say that ... ... ...but that is not true". It is this tempting but wrong conjecture that was assumed true in old implementation. It turns out that the wrong Irony's implementation is a legitimate algorithm known as NQLALR method (Not-Quite LALR). It works well, except it potentially might have more conflicts in parsing automaton in complex grammars compared to pure LALR.
The new LALR construction follows the outline in "Parsing techniques", except two details.
Firstly, the algorithm known as DeRemer and Pennello algorithm, based on notions of transitions, lookbacks and "includes relation". In its second part involving "includes relation propagation" it suggests using efficient transitive closure computation based on SCC (Strongly-Connected Components algorithm). This optimization part is not implemented, and still awaits its hero coder. That explains why processing complex grammars like c# takes so long (about 500 ms) - should be much faster when optimization is in place.
Secondly, the original algorithm computes lookaheads, types of expected tokens, strings in fact. Irony's computation is a bit more complex - it collects not lookaheads themselves but lookahead sources - LR items that are actually produce the lookaheads. The reason for this is that Irony now implements NLALR algorithm (non-canonical variation of LALR) which is built as an extension of LALR. NLALR uses LALR automaton as a base. Skipping some technical details, let's just say that NLALR uses the lookahead sources (LR items) to construct new non-canonical states that are used as base points for exploration of the further input and for building new non-terminal lookaheads. With these two exceptions the Irony LALR implementation follows quite closely the implementation described in Grune,Jacobs.
Note: the term "transition" used in the book maps to Irony's "ShiftTransition" class.
NLALR parsing algorithm
One big addition in Irony is NLALR parsing method (Non-canonical LALR), with optional automatic grammar transformation. The Irony.Samples project includes several demo grammars in NonCanonTestGrammars folder that clearly illustrate the concept. There is also a text file _about_noncanon_grammars.txt in the same folder with comments and explanations of how it works and what is so good about non-canonical parsing. The ultimate goal of NLALR implementation is to build a parsing algorithm that is free of usual limitations of LR methods with limited number of lookahead symbols. It is now working for simple grammars like the ones provided as demo, but is not yet ready to handle complex grammars like c# sample.
The current NLALR implemenation needs more work: automaton construction is quite slow, and number of extra non-canonical states generated seems excessive, etc.As for the sample c# grammar in particular, it is not exactly the grammar from c# spec, I've changed it quite a lot previously to fit into LALR method. I'm not sure what is the trouble for NLALR with this grammar - ambiguities in original grammar or the tweaks I added. More experimentation and investigation is needed.
Concrete and abstract syntax trees
In the previous version Irony parser produced the abstract syntax tree (AST) directly. The AST nodes were created based on types specified in NodeType property of non-terminals. The base idea was that language implementors were free to build the AST node hierarchy, but in fact Irony was forcing them to use the Irony-defined AstNode as a base class.
This is fixed now. The AST tree creation process follows a standard two-step pattern described in compiler literature. The first step is creating the so-called concrete syntax tree that directly maps the elements in the source stream into the tree. It is called concrete because it reflects the specific syntactic elements of concrete programming language. The alternative name is "parse tree". Irony parser produces the ParseTree instance, which holds the root node of the tree and all nodes underneath as child nodes of the root and so on. The nodes are instances of class ParseTreeNode - a container for all information about the parsed syntax element: grammar term.
The AST nodes now can be instances of any type. There are two ways to instantiate and initialize nodes.
The first method is to implement IAstNodeInit interface in all your custom node classes. Your custom nodes also must have default (parameterless) constructor. You then specify node types for each non-terminal in extra parameter of non-terminal's constructor. Irony then would create all nodes and initialize them while it builds the parse tree.
The second method is to use NodeCreator methods that can be attached to non-terminals (through constructor parameter) and create AST nodes in these methods. Note: this functionality might be broken still, as all AST and interpreter infrastructure had not been brought back to life yet. I will fix it ASAP.
Parser/Scanner/Token filters pipeline changes
The parser and scanner classes and the way they communicate (pipeline) had been refactored. The main purpose is to simplify the public API for consumers like code colorizers.
The token filters pipeline had been moved inside Scanner, so now the scanner returns the token stream passed through all filters specified in grammar. Previously, the token stream produced by core scanner was in fact incomplete if there were filters specified.This presented a complication for consumers that had to either use raw, incomplete stream, or build and manage the entier pipeline from scanner and filters.
The other change is that there are two parser classes now. The former Parser class implementing LALR algorithm was renamed to CoreParser - it is "LALR parser" without scanner. A new Parser class combines both core parser and scanner, so it is a full parser transforming text into parse tree. Again, this is for simplifying the API for external consumers that need a combined scanner+parser object and don't care much about details.
Scanner conflicts resolution, scanner-parser link
Scanner often runs into situation when input stream is ambigous - the input word can be associated with more than one terminal from the list of terminals.
One case of ambiguity that often comes up is ambiguity between identifier and keyword. Often keywords in the grammar can also be used as identifiers, unless a keyword is explicitly marked as "reserved" word in grammar. In previous version the Irony scanner/parser used some resolution mechanism that allowed the input stream to be correctly parsed, even when scanner labeled input token "incorrectly". In new version the resolution algorithm was improved.
As before, the scanner initially recognizes keyword/identifier tokens as identifiers, but also sets an alternative terminal for token in AsSymbol field, when it finds that there is exactly matching keyword in grammar, and it is not Reserved word (more about Reserved words later). The parser first checks this AsSymbol field and if it is not null it tries to find an action for this symbol. If an action is defined in the current state, it means that "the keyword wins", and parser does some extra back-patching - this is new in Irony. Parser changes the "main" Terminal assigned to the token from Identifier to this AsSymbol terminal. The token in fact changes its identity, and might be correctly interpreted later (by code colorizer for example). If on the other hand, the parser does not find an action for AsSymbol terminal, it continues with assumption that is in fact Identifier and retrieves the action for the Identifier terminal.
The other case of scanning ambiguity is a bit more complicated. Sometimes grammar defines several non-terminals (non symbols) that can be associated with the same words/tokens. For example, GwBasic sample grammar defines 3 number terminals: regular number (int or float) used in arithmetic expressions, lineNumber - for line numbers, and fileNumber - integers identifying file# in input/output statements. Scanner by itself cannot possibly identify what is "123" in the input stream - any of the three terminals would match it.
In new version in situation like this the scanner checks with parser - which terminals are actually expected in current parser state. Each ParserState instance has a field ExpectedTerms which contains grammar terms (terminals and non-terminals) which are expected in input in this state. This set is precalculated at parser construction time. So when the new Irony scanner finds that more than one terminal can be used for matching the input (based on current input char), it checks the list of candidates against the ExpectedTerms set in the current parser state. That should leave only one terminal (unless the grammar is really ambiguous) which is then used for producing the token.
Grammar Symbols, Keywords, Reserved Words and Case Sensitivity
The Symbol terms dictionary and surrounding functionality had been refactored. The biggest trouble with old architecture was that Symbols dictionary was a static singleton field in SymbolTerminals class. This was obviously bad because this dictionary would be shared between several grammars in one application, with all negative consequences. But having this dictionary in static field was necessary as it was accessed by operator overrides (for + and operations) which are static methods.
Even a bigger problem was the fact that the dictionary was always initially case sensitive, it treated differently-cased words as different symbols, and this caused all kinds of problems for case-insensitive grammars. Grammar-processing code tried to reconcile different cAsE instances of the same symbol, but it didn't always work properly: for example the operator precedence in SQL grammar for logical operators (AND, OR) didn't work because of this mess.
The new solution in Irony is the following.
First, the SymbolTerms dictionary is an instance field in Grammar class. Static operator overloads access this dictionary through an extra copy in so called thread-static field of Grammar class. Thread-static fields are identified by [ThreadStatic] attribute, and are just like normal static fields except there is a separate instance for each accessing thread, so there is no danger of concurrent access to this dictionary from different grammars constructed on different threads.
Another change is that the symbol dictionary is created as case-insensitive for grammar-insensitive grammars. It is instantiated in base grammar constructor, and case sensitivity flag is now provided as a parameter to the constructor. The CaseSensitive Grammar's field is now read-only. This is one breaking change for existing case insensitive grammars, but fortunately it's an easy one to fix in existing grammars.
These changes simplified and straightened out the symbol terms support in Irony. Operator precedence now works correctly in sample SQL grammar.
Scripting namespace, AST classes, interpreter
This version introduces new namespace Irony.Scripting that contains all runtime and interpreter-related classes, including AST nodes that are in fact nodes suitable for interpreting the code. The interpreter infrastructure is not back to life yet, most of the code is commented out. Will be fixing it and probably rewriting some code - most likely "code analysis" methods will be gone, and functionality merged into evaluation methods.
Diagnostics, Parser Tracing, Grammar Explorer changes
New Diagnostics namespace contains several classes and utilities for parser debugging and information printout. Grammar Explorer now shows parser trace in the bottom area, and trace entries are active links - double-clicking in grid cell brings to you place in source code or parser state printout.