I've been working with ANTLR version 4 which is a refreshing change as it adds listeners and visitors.
This
allows you to separate the parser and the actions making the code much
more readable. It is also intended to allow the grammars to be easily
re-used with a different target language.
For the
listener ANTLR generates a Java interface and an abstract class that
implements each of the methods as stubs. You can then extend the
abstract class and implement the methods which are useful to you. This
also has the added benefit of not making your code fail to compile when
you add to your parser grammar and new methods appear.
Maven ANTLR v4 Setup
To start with Maven add ANTLR version 4 as a dependency and the plugin to compile the grammars:
<project>
[...]
<dependencies>
[...]
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime<artifactId>
<version>4.3</version>
</dependency>
<build>
[...]
<plugins>
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>4.3</version>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
</plugin>
Create
a src/main/antlr4 directory with sub-directories for the java package
you would like the resulting code to be in. ANTLR v4 lexer and parser
grammar files should use the extension .g4
Lexer and Grammar - Combined or Apart
You can make a
combined lexer & parser grammar in one file, or in separate files.
While the combined grammar may keep it simpler to begin with, I would
suggest rather keeping them separate to avoid the confusing their roles.
As an example of where the confusion creeps in with the combined grammar you can place string literals in a parser rule.
e.g: myRule: 'Hello' WHITESPACE 'World!';
The result is that the string literals become implicit lexer rules compared to WHITESPACE which is an explicit lexer rule. The implicit lexer rules have priority over the explicit lexer rules.
I'm
not a fan of implicit conversions and rules (within reason) that make
code less readable.
There are also some nice features such as lexer modes which aren't available in a combined grammar.
Example Lexer Grammar (filename MyLexer.g4)
lexer grammar MyLexer;
: [0-9]
;
DATE
: DIGIT DIGIT '-' DIGIT DIGIT '-' DIGIT DIGIT DIGIT DIGIT
;
INT
: DIGIT+
;
NEWLINE
: '\r'? '\n'
;
WHITESPACE
: (' ' | '\t') -> skip
;
Lexer rules always start with a uppercase letter and the grammar here defines the tokens DATE, INT, NEWLINE and WHITESPACE.
The fragment prefix to DIGIT indicates that it will only be used in other lexer rules and will not become a token itself. DIGIT will therefore not be available for use in the parser grammar.
The WHITESPACE rule
has a lexer command indicated by the arrow. The skip command tells the
lexer to discard the token and not pass it on to the parser. In this
case it would discard all spaces and tabs.
The order the rules appear in the file sets their precedence, with those appearing first having the higher precedence.
Example Parser Grammar (filename MyParser.g4)
parser grammar MyParser;
options { tokenVocab=MyLexer; }
: (line NEWLINE)*
EOF
;
line
: DATE
| INT
;
Parser
rules always begin with a lowercase letter. The tokenVocab option ties
the lexer and the parser together. If the file method in the generated
parser is executed it will look for zero or more lines consisting of
either a DATE or an INTEGER.
EOF is a built-in token which matches the end of the file.
Maven structure and generated code
The generated Java source should appear in the target/generated-sources directory.The file rule from the parser grammar will generate a method like:
public final FileContext file() throws RecognitionException {
in the MyParser.java file which is the method you would call to parse the file. (The method name would be whatever you called the rule)
This listener interface would be generated with an abstract class implementing all its methods with stubs:
public interface MyParserListener extends ParseTreeListener {
void enterFile(@NotNull MyParser.FileContext ctx);
void exitFile(@NotNull MyParser.FileContext ctx);
void enterLine(@NotNull MyParser.LineContext ctx);
void exitLine(@NotNull MyParser.LineContext ctx);
}
Usage
The usage with a listener implementation called MyParserListenerImpl would be something like this:
MyLexer lexer = new MyLexer(new ANTLRInputStream(reader));
CommonTokenStream tokens = new CommonTokenStream(lexer);
MyParser parser = new MyParser(tokens);
parser.addErrorListener(new BaseErrorListener() {
@Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
parser.addParseListener(new MyParserListenerImpl());
// parser.setTrace(true);
FileContext f = parser.file();