Are ">>"s in type parameters tokenized using a special rule?
Java 10 Language Specification (3.2 Lexical Translations) states:
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >.
The input characters a--b are tokenized (§3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.
Without the rule for > characters, two consecutive > brackets in a type such as List<List<String>>
would be tokenized as the signed right shift operator >>, while three consecutive > brackets in a type such as List<List<List<String>>>
would be tokenized as the unsigned right shift operator >>>. Worse, the tokenization of four or more consecutive > brackets in a type such as List<List<List<List<String>>>>
would be ambiguous, as various combinations of >, >>, and >>> tokens could represent the >>>> characters.
The earlier versions of C++ too apparently suffered from this and hence required at least one blank space between the two adjacent less than(<) and greater than(>) symbols like vector <vector<int> >
. Fortunately, not any more.
Based on reading the code linked by @sm4, it looks like the strategy is:
tokenize the input normally. So
A<B<C>> i;
would be tokenized asA, <, B, <, C, >>, i, ;
-- 8 tokens, not 9.during hierarchical parsing, when working on parsing generics and a
>
is needed, if the next token starts with>
-->>
,>>>
,>=
,>>=
, or>>>=
-- just knock the>
off and push a shortened token back onto the token stream. Example: when the parser gets to>>, i, ;
while working on the typeArguments rule, it successfully parses typeArguments, and the remaining token stream is now the slightly different>, i, ;
, since the first>
of>>
was pulled off to match typeArguments.
So although tokenization does happen normally, some re-tokenization occurs in the hierarchical parsing phase, if necessary.