Python lexical analysis - logical line & compound statements
Pythons grammar
Fortunately there is a Full Grammar specification in the Python documentation.
A statement is defined in that specification as:
stmt: simple_stmt | compound_stmt
And a logical line is delimited by NEWLINE
(that's not in the specification but based on your question).
Step-by-step
Okay, let's go through this, what's the specification for a
simple_stmt
:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
Okay now it goes into several different paths and it probably doesn't make sense to go through all of them separately but based on the specification a simple_stmt
could cross logical line boundaries if any of the small_stmt
s contains a NEWLINE
(currently they don't but could).
Apart from that only theoretical possibility there is actually the
compound_stmt
:
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
[...]
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
[...]
suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT
I picked only the if
statement and suite
because it already suffices. The if
statement including elif
and else
and all of the content in these is one statement (a compound statement). And because it may contain NEWLINE
s (if the suite
isn't just a simple_stmt
) it already fulfills the requirement of "a statement that crosses logical line boundaries".
An example if
(schematic):
if 1:
100
200
would be:
if_stmt
|---> test --> 1
|---> NEWLINE
|---> INDENT
|---> expr_stmt --> 100
|---> NEWLINE
|---> expr_stmt --> 200
|---> NEWLINE
|---> DEDENT
And all of this belongs to the if statement (and it's not just a block "controlled" by the if
or while
, ...).
The same if
with parser
, symbol
and token
A way to visualize that would be using the built-in parser
, token
and symbol
modules (really, I haven't known about this modules before I wrote the answer):
import symbol
import parser
import token
s = """
if 1:
100
200
"""
st = parser.suite(s)
def recursive_print(inp, level=0):
for idx, item in enumerate(inp):
if isinstance(item, int):
print('.'*level, symbol.sym_name.get(item, token.tok_name.get(item, item)), sep="")
elif isinstance(item, list):
recursive_print(item, level+1)
else:
print('.'*level, repr(item), sep="")
recursive_print(st.tolist())
Actually I cannot explain most of the parser
result but it shows (if you remove a lot of unnecessary lines) that the suite
including it's newlines really belongs to the if_stmt
. Indentation represents the "depth" of the parser at a specific point.
file_input
.stmt
..compound_stmt
...if_stmt
....NAME
....'if'
....test
.........expr
...................NUMBER
...................'1'
....COLON
....suite
.....NEWLINE
.....INDENT
.....stmt
...............expr
.........................NUMBER
.........................'100'
.......NEWLINE
.....stmt
...............expr
.........................NUMBER
.........................'200'
.......NEWLINE
.....DEDENT
.NEWLINE
.ENDMARKER
That could probably be made much more beautiful but I hope it serves as illustration even in it's current form.
It's simpler than you think. A compound statement is considered a single statement, even though it may have other statements inside. Quoting the docs:
Compound statements contain (groups of) other statements; they affect or control the execution of those other statements in some way. In general, compound statements span multiple lines, although in simple incarnations a whole compound statement may be contained in one line.
For example,
if a < b:
do_thing()
do_other_thing()
is a single if
statement occupying 3 logical lines. That's how a statement can cross logical line boundaries.