What is a token?
The following is taken from the TeXBook:
When TeX reads a line of text from a file, or a line of text that you entered directly on your terminal, it converts that text into a list of "tokens." A token is either (a) a single character with an attached category code, or (b) a control sequence. For example, if the normal conventions of plain TeX are in force, the text
{\hskip 36 pt}
is converted into a list of eight tokens:
{
1\hskip
3
126
1210
p
11t
11}
2The subscripts here are the category codes, as listed earlier: 1 for "beginning of group," 12 for "other character," and so on. The
\hskip
doesn't get a subscript, because it represents a control sequence token instead of a character token. Notice that the space after\hskip
does not get into the token list, because it follows a control word.It is important to understand the idea of token lists, if you want to gain a thorough understanding of TeX, and it is convenient to learn the concept by thinking of TeX as if it were a living organism. The individual lines of input in your files are seen only by TeX's "eyes" and "mouth"; but after that text has been gobbled up, it is sent to TeX's "stomach" in the form of a token list, and the digestive processes that do the actual typesetting are based entirely on tokens. As far as the stomach is concerned, the input flows in as a stream of tokens, somewhat as if your TeX manuscript had been typed all on one extremely long line.
You should remember two chief things about TeX's tokens: (1) A control sequence is considered to be a single object that is no longer composed of a sequence of symbols. Therefore long control sequence names are no harder for TeX to deal with than short ones, after they have been replaced by tokens. Furthermore, spaces are not ignored after control sequences inside a token list; the ignore-space rule applies only in an input file, during the time that strings of characters are being tokenized. (2) Once a category code has been attached to a character token, the attachment is permanent. For example, if character
{
were suddenly declared to be of category 12 instead of category 1, the characters{
1 already inside token lists of TeX would still remain of category 1; only newly made lists would contain{
12 tokens. In other words, individual characters receive a fixed interpretation as soon as they have been read from a file, based on the category they have at the time of reading. Control sequences are different, since they can change their interpretation at any time. TeX's digestive processes always know exactly what a character token signifies, because the category code appears in the token itself; but when the digestive processes encounter a control sequence token, they must look up the current definition of that control sequence in order to figure out what it means.
For a list of these category codes, as reference, see What are category codes?
The description given in the Wikipedia page is good for languages such as C, but not for TeX. The principle is however similar: TeX inputs sequences of bytes and transforms them into objects that it can operate on, which are called tokens.
The precise rules are described in chapter 7 of the TeXbook and chapter 2 of TeX by Topic. Let me try and give a short description of the process.
TeX reads one record at a time (a line in the input file, more or less) and discards the end-of-record terminator along with all spaces (characters with ASCII code 32) that immediately precede this end-of-record; then it appends its internally defined end-of-record byte (the byte corresponding to
\endlinechar
, usually 13 that's the same as "carriage return").For the sake of simplicity, I'll assume that
\endlinechar=13
and that the category code of byte 13 is 5.After this normalization step it resumes working on the line byte by byte. To each byte (character, if you prefer), it associates a category code, that is, a number from 0 to 15, according to the rules explained at the end.
First of all, bytes with category code 10 which are at the start of a line are ignored. If only the end-of-record remains, TeX inserts in its main token list the token
\par
, otherwise (if it's not reading the first input line) it inserts a space token (a pair (32,10), see later). TeX is in "skipping blanks state".When it finds a byte with category code different from 10 and 5, TeX enters a new state, its normal one: every byte with category code different from 0, 9, 14 or 15 becomes a character token, so a pair (charcode, catcode), for subsequent processing. Bytes with catcode 9 or 15 are ignored (in the latter case an error message is issued), while bytes with category code 14 cause TeX to ignore it along with everything that remains on the line, including the end-of-record byte.
When it finds a byte with category code 0, usually
\
, TeX enters a new state, where it forms a symbolic token. If the byte following\
has category code different from 11, a token is formed having as its name that byte (examples:\1
,\%
and so on) and resumes the previous normal state.If the byte following
\
has category code 11 (a letter, normally), TeX starts forming a control word: it goes along until a byte with category code different from 11 is found. The bytes thus stored form the name of this control word which becomes a symbolic token and TeX returns to the skipping blanks state.
Thus tokens can be either a (charcode, catcode) pair or a symbolic token. Note that this process doesn't look at the meaning of these tokens, which is examined only at a later stage.
What's very important to know is when this happens. Actually TeX's processing of the input is a combination of tokenization and macro expansion. TeX looks at the input whenever it needs some token to operate on. For instance, if TeX is supplied a symbolic token that's expandable (macro or primitive), it reads and tokenizes the input until the arguments of this expandable token are found.
This is why \verb
cannot be in the argument of another command. Suppose we have \mbox{\verb|{ab}|}
in our input. TeX finds the symbolic token \mbox
and recognizes it's expandable (because it's a macro defined with \def
) and that it requires an argument
\def\mbox#1{\leavevmode\hbox{#1}}
Therefore TeX reads the input for getting another token, which happens to be {
(precisely the ({
,1) pair) and so it knows it must read up to the matching }
for determining the argument. During this process, tokens are formed as described above, so the argument is the following sequence of tokens
\verb
|
12{
1a
11b
11}
2|
12
and this disrupts the working of \verb
that has to temporarily change the category codes of bytes, which is now impossible because they have already been stored as tokens.
Conversely, when \verb|{ab}|
is not in the argument to a command, \verb
operates in a delayed fashion.
\verb
has one argument, the delimiter character; its first operation is opening a group with\bgroup
and changing the category code of|
(in this example) to 2, so that it will match the inserted\bgroup
and undo all assignments of category codes (including changing the catcode of|
to 2).It changes all category codes of non letters to 12 (except for
|
), so they become printable (well, this is no the complete truth, but a good approximation to it) and lets TeX go along tokenising: no symbolic token will be formed, because no byte has category code 0, in this setting. When the final|
is found, it closes the group, undoes the category code assignments and TeX will return to its normal operations.
The rules for assigning category codes
TeX maintains a catcode vector, that is, an array with 256 slots numbered from 0 to 255; each slot contains an integer in the range 0–15. When TeX reads a character from the input file (we're at the second step in the first list above), it knows its ASCII code so it looks in the catcode vector and assigns this character the corresponding category code. An entry in this vector can be modified by
\catcode <integer> = <4 bit integer>
This vector is suitably extended when XeTeX or LuaTeX, that understand the full Unicode range, is used; however, the possible category codes are still 16:
0 means “escape”; usually only one character has this code, the backslash
\
, but in some circumstances it's necessary to temporarily use another one1 means “beginning of group”, usually
{
; such a character starts a group or delimits an argument;2 means “end of group”, usually
}
;3 means “math shift”, usually
$
;4 means “tabulation”, usually
&
;5 means “end-of-line”; usually we have
\catcode\endlinechar=5
, see above for the role of this category code, which is decisive for the well known feature that an empty line is equivalent to ending a paragraph;6 means “parameter”, usually
#
; it's obviously the one used in macro definitions;7 means “superscript”, usually
^
; it has also another usage in combinations such as^^M
, but it would be too long to explain (it can be found in TeX by Topic or the TeXbook);8 means “subscript”, usually
_
;9 means “ignored”; usually byte 0 has this category code;
10 means “space”; usually bytes 32 and 10 (space and tab) share this category code;
11 means “letter”; printable letters have this code;
12 means “other printable character”; punctuation symbols, for instance;
13 means “active”; another long to explain feature; the
~
is usually active, such a character behaves like a macro and needs to be suitably defined;14 means “comment”; usually
%
; look at step 4 above for it's role;15 means “invalid”; usually byte 127 has this code.
Why categories 9 and 15? When TeX was being developed, more than three decades ago, some operating systems used fixed length record; this usage came from punch cards. These records were padded using byte 0. Byte 127 could mean “don't count this, it was set by error”, but it was also used in communications with teletypers for denoting a backspace.
Until 2019, also tabs were discarded at the end of a line in most TeX implementations, but this is no longer the case, making TeX behave as in the original implementations.