In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
.+?
at the end of a lexer rule will always match a single character. But .+
will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).
What you can do is just match a single char, and "glue" these together in the parser:
unknowns : Unknown+ ;
...
Unknown : . ;
EDIT
... but I only have a lexer, no parsers ...
Ah, I see. Then you could override the nextToken()
method:
lexer grammar Lex;
@members {
public static void main(String[] args) {
Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));
for(Token t : lex.getAllTokens()) {
System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());
}
}
private java.util.Queue<Token> queue = new java.util.LinkedList<Token>();
@Override
public Token nextToken() {
if(!queue.isEmpty()) {
return queue.poll();
}
Token next = super.nextToken();
if(next.getType() != Unknown) {
return next;
}
StringBuilder builder = new StringBuilder();
while(next.getType() == Unknown) {
builder.append(next.getText());
next = super.nextToken();
}
// The `next` will _not_ be an Unknown-token, store it in
// the queue to return the next time!
queue.offer(next);
return new CommonToken(Unknown, builder.toString());
}
}
Whitespace : [ \t\n\r]+ -> skip ;
Punctuation : [.,:;?!] ;
Unknown : . ;
Running it:
java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4 javac -cp antlr-4.0-complete.jar *.java java -cp .:antlr-4.0-complete.jar Lex
will print:
Unknown 'foo' Punctuation ',' Unknown 'bar' Punctuation '.' Punctuation '.' Punctuation '.'
The accepted answer works, but it only works for Java.
I converted the provided Java code for use with the C# ANTLR runtime. If anyone else needs it... here ya go!
@members {
private IToken _NextToken = null;
public override IToken NextToken()
{
if(_NextToken != null)
{
var token = _NextToken;
_NextToken = null;
return token;
}
var next = base.NextToken();
if(next.Type != UNKNOWN)
{
return next;
}
var originalToken = next;
var lastToken = next;
var builder = new StringBuilder();
while(next.Type == UNKNOWN)
{
lastToken = next;
builder.Append(next.Text);
next = base.NextToken();
}
_NextToken = next;
return new CommonToken(
originalToken
)
{
Text = builder.ToString(),
StopIndex = lastToken.Column
};
}
}