'inf' in awk not working the way '-inf' does
The actual task is best solved by initializing your max/min values not by an imaginary "smallest" or "greatest" number (which may not be implemented in the framework you are using, in this case awk
), but by initializing it using actual data. That way, it is always guaranteed to provide a meaningful result.
In your case, you can use the very first value you encounter (i.e. the entry in the first line) to initialize max
and min
, respectively, by adding a rule
NR==1{min=$1}
to your awk
script. Then, if the first value is already the minimum, the subsequent test will not overwrite it, and in the end the correct result will be produced. The same holds for searches of the maximum value, so in combined searches, you can state
NR==1{max=min=$1}
As for the reason why your approach with inf
didn't work with awk
whereas -inf
seemed to, @steeldriver has provided a good explanation in a comment to your question, which I will also summarize for the sake of completeness:
- In
awk
, variables are "dynamically typed", i.e. everything can be a string or a number depending on use (butawk
will "remember" what it was last used as and keep that information along for use in the next operation). - Whenever arithmetic operations involving a variable are found in the code,
awk
will try to interpret the content of that variable as a number and perform the operation, from where on the variable is typed as numerical if successful. - The default value for any variable that has not yet been assigned anything is the empty string, which is interpreted as 0 in arithmetic operations.
- The variable name(*)
inf
has no special meaning inawk
, hence when used just so, it is an empty variable that will evaluate to 0 in an arithmetic expression such as-inf
. Therefore, the "maximum search" with themax
variable initialized to-inf
works if your data is all positive, because-inf
is simply 0 (and as such, the smallest non-negative number). - In the "minimum search" problem, however, initializing
min
toinf
will initialize the variable to the empty string, as no arithmetic operation is present that would warrant an automatic conversion of that empty string to a number. Therefore, in the later comparisons
if ($1<min) min=$1
the input,
$1
, is compared with a string value, which is whyawk
treats$1
as a string, too, and performs a lexicographical comparison rather than a numerical one.However, lexicographically, nothing is "smaller" than the empty string, and so
min
never gets assigned a new value. Therefore, in theEND
section, the statementprint min
prints the (still) empty string.
(*) see Stephen Kitt's answer on how a string with content "inf"
can actually have a meaning in awk
.
Your approach doesn’t work because inf
doesn’t have a special meaning in GNU AWK in its default non-POSIX mode. As a result, it’s interpreted as a variable name, and since the variable hasn’t been set to anything, its value is 0 in an arithmetic context, and the empty string in a string context. Thus your code will only find the maximum value if it’s positive (since max
is initialised in an arithmetic context), and won’t find the minimum value (since min
is initialised in a string context); see AdminBee’s answer for details.
To determine the minimal and/or maximal values in a file (or stream), you should follow the advice given in AdminBee’s answer.
However, if you’re using GNU AWK, you can calculate log(0)
to initialise your variables with positive or negative infinity, and use that in a manner similar to your approach:
BEGIN { max = log(0) }
$1 > max { max = $1 }
END { print max }
BEGIN { min = -log(0) }
$1 < min { min = $1 }
END { print min}
The only advantage of this approach compared to initialising the values from the first line, is it provides distinctive results when no values are processed — positive or negative infinity end up being reliable indicators that no value was seen. (There are other ways to determine this, including checking for an empty string as opposed to 0 when initialising from the first line.)
With GNU AWK in POSIX mode (POSIXLY_CORRECT=1
), or other POSIX-compliant AWK interpreters such as mawk
, providing "inf"
as a string in an arithmetic context produces infinity, thanks to strtod
:
BEGIN { max = "-inf" + 0 }
$1 > max { max = $1 }
END { print max }
BEGIN { min = "+inf" + 0 }
$1 < min { min = $1 }
END { print min}
There are, in fact, three values of infinity: -inf
+inf
and inf
, and, to add more complexity to an issue that should be easy, in awk, there are quoted and unquoted code constants.
To show what I mean, try this (shell code in awk 4.2.1 (current Debian 10)):
for cmd in original-awk "busybox awk" mawk nawk awk; do
printf '%-6.5s' "$cmd"
$cmd 'BEGIN {
a="-inf";b="+inf";c="inf";
d= -inf ;e= +inf; f= inf;
printf "-∞%4s%4s +∞%4s%4s ∞%4s%4s | -∞%4s%4s +∞%4s%4s ∞%4s%4s\n",a,a+0,b,b+0,c,c+0,d,d+0,e,e+0,f,f+0}
' file
To get:
bawk -∞-inf-inf +∞+inf inf ∞ inf inf | -∞ 0 0 +∞ 0 ∞ 0
busyb -∞-inf-inf +∞+inf inf ∞ inf inf | -∞ 0 0 +∞ 0 0 ∞ 0
mawk -∞-inf-inf +∞+inf inf ∞ inf inf | -∞ 0 0 +∞ 0 0 ∞ 0
nawk -∞-inf-inf +∞+inf inf ∞ inf 0 | -∞ 0 0 +∞ 0 0 ∞ 0
gawk -∞-inf-inf +∞+inf inf ∞ inf 0 | -∞ 0 0 +∞ 0 0 ∞ 0
The table presents quoted and unquoted assignment to variables (abcdef).
For each case, the value as read by awk and as converted to number (var+0).
That says that a "-inf"
stays as so even when numeric, a "+inf"
gets converted to a numeric inf
(without sign) and that a quoted "inf"
might become either inf
or 0
depending on the implementation (its 0 in nawk and gawk).
When unquoted, both -inf
and +inf
become 0
(except in bawk where +∞
is understood as the empty string "" and converts to 0
).
Oddly enough, when unquoted, all inf
are interpreted as the empty string.
But all unquoted -inf
, +inf
and inf
become 0 when used as var+0
.
So, for what you meant to do, you need quoted "-inf"
and "+inf"
, never inf
:
cat file | awk ' BEGIN { max = "-inf"+0; min = "+inf"+0 }
{ if ($1>max) max=$1
if ($1<min) min=$1
}
END { print min, max }
'
Maybe, a easier (not portable0 way to understand it is to execute:
gawk 'BEGIN{
a="-inf";b="+inf";c="inf";
d= -inf ;e= +inf; f= inf;
print a, typeof(a), b, typeof(b), c, typeof(c)
print a+0, typeof(a+0), b+0, typeof(b+0), c+0, typeof(c+0)
print d,typeof(d),e,typeof(e),f,typeof(f)
print d+0,typeof(d+0),e+0,typeof(e+0),f+0,typeof(f+0)
}'
Which will print:
-inf string +inf string inf string
-inf number inf number 0 number
0 number 0 number unassigned
0 number 0 number 0 number
Of course, the correct and portable solution is to give value to the max
and min
variables right from the start:
cat file | awk ' NR==1 { min = max = $1 }
{ if ($1>max) max=$1
if ($1<min) min=$1
}
END { print min, max }
'
---
The description from the awk manual is:
- With the
--posix
command-line option,gawk
becomes “hands off.” String values are passed directly to the system library’s strtod() function, and if it successfully returns a numeric value, that is what’s used. By definition, the results are not portable across different systems. They are also a little surprising:$ echo influence | gawk --posix '{ print $1 + 0 }' -| inf $ echo 0xDeadBeef | gawk --posix '{ print $1 + 0 }' -| 3735928559
- Without
--posix
,gawk
interprets the four string values ‘+inf’, ‘-inf’, ‘+nan’, and ‘-nan’ specially, producing the corresponding special numeric values. The leading sign acts a signal to gawk (and the user) that the value is really numeric. Hexadecimal floating point is not supported (unless you also use --non-decimal-data, which is not recommended). For example:$ echo nanny | gawk '{ print $1 + 0 }' -| 0 $ echo +nan | gawk '{ print $1 + 0 }' -| +nan $ echo 0xDeadBeef | gawk '{ print $1 + 0 }' -| 0
gawk
ignores case in the four special values. Thus, ‘+nan’ and ‘+NaN’ are the same.Besides handling input,
gawk
also needs to print “correct” values on output when a value is either NaN or infinity. Starting with version 4.2.2, for such valuesgawk
prints one of the four strings just described: ‘+inf’, ‘-inf’, ‘+nan’, or ‘-nan’. Similarly, in POSIX mode,gawk
prints the result of the system’s Cprintf()
function using the%g
format string for the value, whatever that may be.