TeX rules for determing a \par after comment
As TeX reads your file, it maintains a state
which can be one of these three:
State N (for
new_line
): This is the state in which TeX starts at the beginning of each line in the input.State M (for
mid_line
): This is the most common stateState S (for
skip_blanks
): This is like State M, except that blanks are ignored.
Among other things (see more details in Section 2.5 of TeX by Topic and pages 46–47 of The TeXbook, or procedure get_next
at section §343 onwards in the TeX program), some relevant interactions are as follows:
When in State N,
spaces (catcode 10) are ignored [§345] (thus: leading spaces on each line are ignored),
end-of-line character (catcode 5) results in a
\par
token [§347→§351]comment character (catcode 14) "finishes the line", i.e. results in the rest of the line (including the end-of-line character) being ignored, thus (unless file ends) TeX will start on the next line in state N again [§347→§350]
most other characters (letters, etc.) result in going into State M [§347]
When in State M,
spaces (catcode 10) result in a space token and going into state S [§347→§349]
end-of-line character (catcode 5) results in a space token, and (unless file ends) TeX will start on the next line in state N again [§347→§348]
comment character (catcode 14) "finishes the line" as above, i.e. results in the rest of the line (including the end-of-line character) being ignored, thus (unless file ends) TeX will start on the next line in state N again [§347→§350]
When in State S,
spaces (catcode 10) are ignored [§345]
end-of-line character (catcode 5) "finishes the line" as above, thus (unless file ends) TeX will start on the next line in state N again [§347→§350]
comment character (catcode 14) "finishes the line" as above, thus (unless file ends) TeX will start on the next line in state N again [§347→§350]
(Added later): If that was too long, a possible summary:
First imagine that all lines in your input file have all leading blanks removed, so that each line is either empty or begins with a non-space character.
Encountering a comment character "teleports" you to the beginning of the next line (so that the end-of-line is never encountered).
When TeX is at the start of the line and the end-of-line is encountered—in other words, if the line (after removing leading blanks) is empty—this results in a
\par
token.Otherwise, an end-of-line is equivalent to a space.
Consecutive spaces are equivalent to one space.
So with that, let's look at all 7 of your examples:
ab
— The above produces ”ab” in the output; nothing interesting going on.
a b
— The above produces “a b” in the output: after seeing the space TeX goes into state S, so a b
would also produce the same output.
a
b
— The above produces “a b” in the output: when the end-of-line is seen, a space is emitted (2.2 in our list above). Note that
a
b
would also produce the same “a b”.
a%
b
— The above produces “ab” in the output: when the %
is seen, TeX finishes the line and goes into state N (and then M when b
is seen), exactly as if ab
had been seen on the same line. Again, leading spaces on the second line would still result in “ab”.
a
%
b
— The above produces “a b” in the output: when the end-of-line (after a
) is seen, a space is generated and TeX goes into state N, then when %
on the second line TeX ignores the rest of the line (including the end-of-line), and is in state N again on the third line, when b
is seen.
a
b
— The above produces two paragraphs in the output: when the end-of-line is seen (after a
) TeX goes into state N, then when it sees another end-of-line in state N it generates a \par
token and starts on the third line in state N, then the b
is seen.
a%
b
— The above also produces two paragraphs in the output: when the %
is seen, TeX discards the rest of the line and starts on the second line in state N. Now it sees an end-of-line (in state N) which generates a \par
token, after which TeX is on the third line and b
is seen.
Note that spaces on an otherwise blank line do not interfere with interpretation as \par
, so the rule isn't so simple as consuming consecutive newline characters. That's because TeX ignores leading spaces on lines.
The actual rule is that TeX generates the \par
token when it encounters the end of a line while it is still ignoring leading spaces on the line. Two consequences of this are that \par
is produced even if the first end-of-line is buried in a comment; and that n consecutive newlines (perhaps with spaces mixed in) produce n-1 \par
tokens, not n/2.
As a complement to the other technical answers, I'll add my thoughts. You seem to be under the impression that %
“consumes the next newline character”.
This is not the best way to look at the issue. In TeX there are end-line characters. Yes, the ASCII name for character 10 is “newline” or “line feed”, for character 13 it is “carriage return”.
However TeX uses a different approach. When it was written, operating system had very different ideas about what constitutes “end of record” in a text file.
Some used “newline”, some “carriage return”, some a combination of the two in either order, some nothing at all (they had fixed-length records, filling the blanks with character 0, “null”).
The last type is the reason for category code 9 (ignored): here's an excerpt from plain.tex
, lines numbers for clarity:
24 % We had to define the \catcodes right away, before the message line,
25 % since \message uses the { and } characters.
26 % When INITEX (the TeX initializer) starts up,
27 % it has defined the following \catcode values:
28 % \catcode`\^^@=9 % ascii null is ignored
29 % \catcode`\^^M=5 % ascii return is end-line
30 % \catcode`\\=0 % backslash is TeX escape character
31 % \catcode`\%=14 % percent sign is comment character
32 % \catcode`\ =10 % ascii space is blank space
33 % \catcode`\^^?=15 % ascii delete is invalid
34 % \catcode`\A=11 ... \catcode`\Z=11 % uppercase letters
35 % \catcode`\a=11 ... \catcode`\z=11 % lowercase letters
36 % all others are type 12 (other)
As you see, line 29 says end-line. The notation `\^^M
means “character number 13, because M
is ASCII 77 and 77 − 64 = 13.
Since operating systems had (and still have) those different ideas, TeX leaves to the implementor for a specific system the task to announce the program what's the end-of-record signal.1
When TeX reads a record (a line, in other terminology), it throws away the end-of-record signal (if the OS uses it) together with spaces preceding it and whatever is on the line past it. Then it substitutes it with the character corresponding to the current value of \endlinechar
(default value 13).
Note that up to this point no process of conversion to tokens has taken place. This happens after the complete line has been read in. If, during tokenization, TeX finds a character with category code 14 (comment), it throws away whatever remains on the line and switches to the next one.
The states described in the other answers have to do with the tokenization phase.
It's sufficient that you change your way of thinking to %
: it consumes the current line. A blank line generates a \par
token irrespective of what precedes it. A blank line is one that only contains characters of category code 10 (spaces or tabs) or 9 until finding a category code 5 character.
For instance, the following code will produced just one paragraph:
\endlinechar`a
bc
ef
\end%
The output will be a single line containing bcaaefa
, because there is no blank line according to the definition above.
By the way, the final %
is needed or TeX will stop saying that \end
is undefined. Actually the undefined control sequence is \enda
, but TeX never shows the current \endlinechar
.
Footnote.
1 TeX Live implementations recognize the most common end-of-record signals, be they newline (Unix), carriage return (legacy macOS) or the combination carriage return/newline (legacy DOS), based on what it finds at the beginning of an input file.