Character bytes and character tokens: If newlines are converted to spaces, then where does catcode 5 come into the picture?
There are never any tokens with catcode 5.
Initex sets up
\catcode`\^^M=5
But that acts in a similar way to
\catcode`\%=14
which makes %
catcode 14, but there are no tokens with that catcode, If a character with catcode 14 is scanned then that character and the rest of the line are discarded.
A character of catcode 5 generates a space character and puts tex's scanner in a special mode that causes immediately following characters of catcode 10 to be discarded "white space at the beginning of a line" and a following character of catcode 5 to be tokenised as \par
not a space token "blank line is equivalent to \par
"
So note that the first newline always makes a space, subsequent ones make \par
so a blank line is usually equivalent to space\par
.
Characters have a category code; they can generate a character token during the tokenization phase, but they need not to.
Category codes are used with twofold purposes: they are looked at during tokenization (when TeX absorbs text from an input file or the terminal) but also during token list processing.
Only characters with category code
1 2 3 4 6 7 8 10 11 12 13
can generate character tokens (with the same category code), respectively
begin group
end group
math shift
alignment
parameter
superscript
subscript
space
letter
other character
active character
Characters with category code 0 5 9 14 15 will never generate a character token with the same category code: there is no way a character token with those category codes can get through in TeX internal token processor:
a character with category code 0 triggers the formation of a control sequence
a character with category code 9 is ignored
a character with category code 15 raises an error and then it is ignored
a character with category code 14 tells the tokenization processor to ignore it together with all other characters on the line
More interesting is category code 5, which is the object of your question. When TeX finds one, it discards whatever remains on the input line, generates a space character with character code 32 and category code 10 as if it had been on the line to begin with and sets the scanner in the special state of ignoring blank spaces (category code 10) until coming to something different: if this is another character of category code 5, TeX generates a \par
token, otherwise enters the normal state.
Note the emphasis on space character above: this space character is tokenized according to the normal rules, so it will get ignored if it follows a control word (like \foo
) but not after a control symbol (like \~
).
A consequence of this is that the following inputs
\foo\baz
\foo \baz
\foo
\baz
are completely equivalent. Note that if the end-of-line in the last input generated a space token, there would be a difference. But indeed a space character (not yet tokenized) is generated instead.
Note. What said above about ignored characters might be misleading when confronted with control word formation. The formation of a control word starts with a category code 0 character followed by one of category code 11. Any character with category code different from 11 will stop the scanning, cause tokenization of the formed control word and be examined anew (for being ignored, for instance, in case it has category code 9).
Addendum about XeTeX and LuaTeX. When a UTF-8 encoded file is fed to the Unicode aware engines, it's immaterial whether a character is single, two, three or four byte long in its UTF-8 representation. These two engines do a preliminary step transforming UTF-8 combinations into Unicode entities, so what the tokenization processor sees is just one character (with its category code as assigned in the initialization table). The two engines are also able to cope with UTF-16 or UTF-32, little or big endian.
TeX does read input line by line: A line of input will be read and processed. Then another line of input will be read and processed. ...
One of the first things that TeX does after reading a line of input is converting the characters from the computer platform's character encoding scheme to the TeX engine's internal character encoding scheme. With traditional TeX engines the internal character encoding scheme is ASCII, the American Standard Code for Information Interchange. With TeX engines based on LuaTeX or XeTeX the internal character encoding scheme is Unicode whereof ASCII is a strict subset.
After that TeX deletes any space character at the right end of the line. More precisely: After that TeX deletes any character at the right end of the line whose character code is 32. (. 32 is the number of the code-point of the space character both in ASCII and in Unicode.)
Then TeX inserts a character at the right end of the line whose character code equals the value of the integer-parameter \endlinechar
.
Usually the value of \endlinechar
is 13 and the category code of character 13(return character) is 5(end of line).
This implies that usually TeX encounters a character whose category code is 5(end of line) when during tokenizing the line reaching the end of the line.
Then TeX starts tokenizing the line. I.e., TeX "looks" at the characters which the line contains and produces control sequence tokens and character tokens according to the category code table and according to the state of the reading apparatus.
At the time of reading and tokenizing input, the reading apparatus of TeX can be in one of three states:
State S: Skipping blanks. The reading apparatus will be in state S
- after processing a character from the input whose category code is 10(space).
- after processing a sequence of two equal characters of category code 7(superscript) followed by a sequence of two characters forming the character code in lowercase hexadecimal notation of a character whose category code is 10(space).
[Example: Usually the category code of^
is 7(superscript) while the category code of character 32(space character; hex 20) usually is 10(space). Therefore the notation^^20
usually is treated like processing the character 32(space character) from the input whose category code usually is 10(space).] - after processing a sequence of two equal characters of category code 7(superscript) followed by a character where - in case of the character's character code being in the range from 64 to 127 - the category code of the character whose character code is obtained by subtracting 64 is 10(space).
[Example: As the category code of^
usually is 7(superscript) and the character code of`
is 96 while 96-64=32 and the category code of character 32(space character) usually is 10(space), the notation^^`
usually is treated like processing the character 32(space character) from the input whose category code usually is 10(space).] - after processing a sequence of two equal characters of category code 7(superscript) followed by a character where - in case of the character's character code being in the range from 0 to 63 - the category code of the character whose character code is obtained by adding 64 is 10(space).
- after producing a control word token.
- after producing a control symbol token whose name is formed by a character of category code 10(space). E.g., after producing the control symbol token
\␣
(control space).
While in state S, both processing a character whose category code is 10(space) and processing a ^^..
-sequence/<superscript-char><superscript-char>..
-sequence considered equivalent to a character of category code 10(space) will yield not producing any token and not changing the state of the reading apparatus.
Usually the space character (character code 32) and the horizontal tab character (character code 9) are the only characters whose category code is 10(space).
That's why you can have several consecutive space characters or horizontal-tab-characters in the input usually yielding only one space token, in turn yielding whatsoever horizontal glue for only one horizontal space in case of TeX being in one of the modes where space tokens yield horizontal glue (, i.e., in horizontal mode, in restricted horizontal mode but neither in vertical mode, nor in internal vertical mode, nor in math mode, nor in display math mode).
State M: Middle of line. The reading apparatus will be in state M
- after producing a non-space character token.
- after producing a control symbol token whose name is formed by a character which is not of category code 10 (space).
While in state M, both processing a character whose category code is 10(space) and processing a ^^..
-sequence/<superscript-char><superscript-char>..
-sequence considered equivalent to a character of category code 10(space) will yield producing a space token, i.e., a character token whose charcode is 32(space character) and whose catcode is 10(space), and switching the state of the reading apparatus to state S.
State N: New line. The reading apparatus is in state N when about to start reading another line of input. While in state N, both processing a character whose category code is 10(space) and processing a ^^..
-sequence/<superscript-char><superscript-char>..
-sequence considered equivalent to a character of category code 10(space) will yield not producing any token and not changing the state of the reading apparatus.
If TeX encounters a character of category code 5(end of line) while the reading apparatus is in state S, TeX will not produce any token at all.
If TeX encounters a character of category code 5(end of line) while the reading apparatus is in state M, TeX will produce a space token, i.e., a character token whose characode is 32(space character) and whose catcode is 10(space).
If TeX encounters a character of category code 5(end of line) while the reading apparatus is in state N, TeX will produce the control word token \par
.
After encountering a character of category code 5(end of line), TeX will - no matter what state the reading apparatus is in - in any case drop any further information on the current line and start reading another line of input. Hereby the reading apparatus of TeX will be switched to state N.
Due to the above-mentioned \endlinechar
-thingie usually an empty line after a non-empty line will yield having TeX process two consecutive return-characters (character code 13) whose category code is 5(end of line).
At the time of encountering the first of these return-characters, which is in the non-empty line, the reading apparatus might be in state S or in state M and therefore the first one might yield no token at all or yield a space token.
In any case after encountering the first of these return-characters, the reading apparatus will be switched to state N.
Therefore at the time of encountering the second of these return-characters, the reading apparatus will be in state N and the second of these return-characters will yield the control word token \par
.
That's why usually an empty line is treated like a "paragraph break"/like the control word token \par
which usually (if not redefined) is a directive for breaking into lines and typesetting as a paragraph of text the material collected/gathered by now.
This means that it can happen that the last thing within a paragraph is a space token yielding horizontal glue. Be aware that such horizontal glue at the end of a paragraph usually gets discarded by TeX and glue according to the values of the \parfillskip
-glue-parameter gets attached at the end of a paragraph.