Wednesday 11 November 2015

ANTLR 4

I've been working with ANTLR version 4 which is a refreshing change as it adds listeners and visitors.

This allows you to separate the parser and the actions making the code much more readable. It is also intended to allow the grammars to be easily re-used with a different target language.

For the listener ANTLR generates a Java interface and an abstract class that implements each of the methods as stubs. You can then extend the abstract class and implement the methods which are useful to you. This also has the added benefit of not making your code fail to compile when you add to your parser grammar and new methods appear.

Maven ANTLR v4 Setup

To start with Maven add ANTLR version 4 as a dependency and the plugin to compile the grammars:

<project>
  [...]
  <dependencies>
    [...]

    <dependency>
      <groupId>org.antlr</groupId>
      <artifactId>antlr4-runtime<artifactId>
      <version>4.3</version>
    </dependency>


  <build>
    [...]


    <plugins>
      <plugin>
        <groupId>org.antlr</groupId>
        <artifactId>antlr4-maven-plugin</artifactId>
        <version>4.3</version>
        <executions>
        <execution>
          <goals>
          <goal>antlr4</goal>
          </goals>
        </execution>
        </executions>
      </plugin>


Create a src/main/antlr4 directory with sub-directories for the java package you would like the resulting code to be in. ANTLR v4 lexer and parser grammar files should use the extension .g4

Lexer and Grammar - Combined or Apart

You can make a combined lexer & parser grammar in one file, or in separate files. While the combined grammar may keep it simpler to begin with, I would suggest rather keeping them separate to avoid the confusing their roles.

As an example of where the confusion creeps in with the combined grammar you can place string literals in a parser rule.

e.g: myRule: 'Hello' WHITESPACE 'World!';

The result is that the string literals become implicit lexer rules compared to WHITESPACE which is an explicit lexer rule. The implicit lexer rules have priority over the explicit lexer rules.

I'm not a fan of implicit conversions and rules (within reason) that make code less readable. 

There are also some nice features such as lexer modes which aren't available in a combined grammar.

Example Lexer Grammar (filename MyLexer.g4)

 

lexer grammar MyLexer;

fragment DIGIT 
    : [0-9] 
    ;

DATE
    : DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT DIGIT DIGIT
    ;   

INT
    : DIGIT+
    ;

NEWLINE
    : '\r'? '\n'
    ;

WHITESPACE
    : (' ' | '\t') -> skip
    ;

Lexer rules always start with a uppercase letter and the grammar here defines the tokens DATE, INT, NEWLINE and WHITESPACE.

The fragment prefix to DIGIT indicates that it will only be used in other lexer rules and will not become a token itself. DIGIT will therefore not be available for use in the parser grammar.

The WHITESPACE rule has a lexer command indicated by the arrow. The skip command tells the lexer to discard the token and not pass it on to the parser. In this case it would discard all spaces and tabs.

The order the rules appear in the file sets their precedence, with those appearing first having the higher precedence.

Example Parser Grammar (filename MyParser.g4)


parser grammar MyParser;

options { tokenVocab=MyLexer; }

file
    : (line NEWLINE)*
    EOF
    ;

line
    : DATE
    | INT
    ;

Parser rules always begin with a lowercase letter. The tokenVocab option ties the lexer and the parser together. If the file method in the generated parser is executed it will look for zero or more lines consisting of either a DATE or an INTEGER.

EOF is a built-in token which matches the end of the file.

Maven structure and generated code

The generated Java source should appear in the target/generated-sources directory.

The file rule from the parser grammar will generate a method like:

public final FileContext file() throws RecognitionException {

in the MyParser.java file which is the method you would call to parse the file. (The method name would be whatever you called the rule)


This listener interface would be generated with an abstract class implementing all its methods with stubs:

public interface MyParserListener extends ParseTreeListener {
    void enterFile(@NotNull MyParser.FileContext ctx);
    void exitFile(@NotNull MyParser.FileContext ctx);
    void enterLine(@NotNull MyParser.LineContext ctx);
    void exitLine(@NotNull MyParser.LineContext ctx);
}


Usage


The usage with a listener implementation called MyParserListenerImpl would be something like this:


MyLexer lexer = new MyLexer(new ANTLRInputStream(reader));
CommonTokenStream tokens = new CommonTokenStream(lexer);

MyParser parser = new MyParser(tokens);
parser.addErrorListener(new BaseErrorListener() {

  @Override
  public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
    throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
  }
});

parser.addParseListener(new MyParserListenerImpl());
// parser.setTrace(true);
FileContext f = parser.file();