Valid characters in a python class name
The thing that makes this interesting is that the first character of an identifier is special. After the first character, numbers '0' through '9' are valid for identifiers, but they must not be the first character.
Here's a function that will return a valid identifier given any random string of characters. Here's how it works:
First, we use itr = iter(seq)
to get an explicit iterator on the input. Then there is a first loop, which uses the iterator itr
to look at characters until it finds a valid first character for an identifier. Then it breaks out of that loop and runs the second loop, using the same iterator (which we named itr
) for the second loop. The iterator itr
keeps our place for us; the characters the first loop pulled out of the iterator are still gone when the second loop runs.
def gen_valid_identifier(seq):
# get an iterator
itr = iter(seq)
# pull characters until we get a legal one for first in identifer
for ch in itr:
if ch == '_' or ch.isalpha():
yield ch
break
# pull remaining characters and yield legal ones for identifier
for ch in itr:
if ch == '_' or ch.isalpha() or ch.isdigit():
yield ch
def sanitize_identifier(name):
return ''.join(gen_valid_identifier(name))
This is a clean and Pythonic way to handle a sequence two different ways. For a problem this simple, we could just have a Boolean variable that indicates whether we have seen the first character yet or not:
def gen_valid_identifier(seq):
saw_first_char = False
for ch in seq:
if not saw_first_char and (ch == '_' or ch.isalpha()):
saw_first_char = True
yield ch
elif saw_first_char and (ch == '_' or ch.isalpha() or ch.isdigit()):
yield ch
I don't like this version nearly as much as the first version. The special handling for one character is now tangled up in the whole flow of control, and this will be slower than the first version as it has to keep checking the value of saw_first_char
constantly. But this is the way you would have to handle the flow of control in most languages! Python's explicit iterator is a nifty feature, and I think it makes this code a lot better.
Looping on an explicit iterator is just as fast as letting Python implicitly get an iterator for you, and the explicit iterator lets us split up the loops that handle the different rules for different parts of the identifier. So the explicit iterator gives us cleaner code that also runs faster. Win/win.
As per Python Language Reference, §2.3, "Identifiers and keywords", a valid Python identifier is defined as:
(letter|"_") (letter | digit | "_")*
Or, in regex:
[a-zA-Z_][a-zA-Z0-9_]*
Python 3
Python Language Reference, §2.3, "Identifiers and keywords"
The syntax of identifiers in Python is based on the Unicode standard annex UAX-31, with elaboration and changes as defined below; see also PEP 3131 for further details.
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
Python 3.0 introduces additional characters from outside the ASCII range (see PEP 3131). For these characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module.
Identifiers are unlimited in length. Case is significant.
identifier ::= xid_start xid_continue* id_start ::= <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property> id_continue ::= <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property> xid_start ::= <all characters in id_start whose NFKC normalization is in "id_start xid_continue*"> xid_continue ::= <all characters in id_continue whose NFKC normalization is in "id_continue*">
The Unicode category codes mentioned above stand for:
- Lu - uppercase letters
- Ll - lowercase letters
- Lt - titlecase letters
- Lm - modifier letters
- Lo - other letters
- Nl - letter numbers
- Mn - nonspacing marks
- Mc - spacing combining marks
- Nd - decimal number
- Pc - connector punctuations
- Other_ID_Start - explicit list of characters in PropList.txt to support backwards compatibility
- Other_ID_Continue - likewise
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
A non-normative HTML file listing all valid identifier characters for Unicode 4.1 can be found at https://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
Python 2
Python Language Reference, §2.3, "Identifiers and keywords"
Identifiers (also referred to as names) are described by the following lexical definitions:
identifier ::= (letter|"_") (letter | digit | "_")* letter ::= lowercase | uppercase lowercase ::= "a"..."z" uppercase ::= "A"..."Z" digit ::= "0"..."9"
Identifiers are unlimited in length. Case is significant.