Exotic names for methods, constants, variables and fields - Bug or Feature?

This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

Case-insensitive identifiers (class and function/method names)

The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

<?php
function func_á() { echo "worked"; }
func_Á();

Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:

$ LANG=en_US.iso88591 php a.php
worked
$ LANG=en_US.utf8 php a.php

Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3

Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

Case-sensitive identifiers (variables, constants, fields)

The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On in php.ini). This allows you to declare the encoding of the the script:

<?php
declare(encoding='ISO-8859-1');
// code here
?>

It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

  • Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
  • Multi-byte support is usually not compiled in, so it's less tested (more bugs).
  • Portability issues between installations that have the support compiled in and those that don't.
  • Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.


Your character is encoded as 0x80 0x90 0xe2 or something like that, thus it matches your regexp when not interpreting the unicode (working on single bytes).

Tags:

Php

Unicode