What is POSIX awk's stance on null byte in variables/printf?
There are at least 4 relevant pieces of text in the POSIX.2018 specification of awk
:
Emphasis (bold text) is mine in all the quoted text below:
Input files to the awk program from any of the following sources shall be text files
That means that if the input contains NUL characters (which would make it non-text as per the POSIX definition of text), then the behaviour is unspecified.
\ddd : A <backslash> character followed by the longest sequence of one, two, or three octal-digit characters (01234567). If all of the digits are 0 (that is, representation of the NUL character), the behavior is undefined.
So \000
results in undefined behaviour.
About regexp matching:
However, in all awk ERE matching, the use of one or more NUL characters in the pattern, input record, or text string produces undefined results
About printf
/sprintf
:
7. For the c conversion specifier character: if the argument has a numeric value, the character whose encoding is that value shall be output. If the value is zero or is not the encoding of any character in the character set, the behavior is undefined.
So, that's another way to get a NUL character that leads to undefined behaviour.
So, to sum up, in awk
, POSIX tells us you can't use the NUL character portably, whether it's for input, output or to store in its variables.
gawk
(since at least 2.10 in 1989 which is the earliest version I could find where NUL support is documented) and @ThomasDickey's mawk
(since version 20140914) are two implementations that can deal with NUL.