command line utility to print statistics of numbers in linux
Using "st" (https://github.com/nferraz/st)
$ st numbers.txt
N min max sum mean stddev
10 1 10 55 5.5 3.02765
Or:
$ st numbers.txt --transpose
N 10
min 1
max 10
sum 55
mean 5.5
stddev 3.02765
(DISCLAIMER: I wrote this tool :))
For the average, median & standard deviation you can use awk
. This will generally be faster than R
solutions. For instance the following will print the average :
awk '{a+=$1} END{print a/NR}' myfile
(NR
is an awk
variable for the number of records, $1
means the first (space-separated) argument of the line ($0
would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and END
means that the following commands will be executed after having processed the whole file (one could also have initialized a
to 0
in a BEGIN{a=0}
statement)).
Here is a simple awk
script which provides more detailed statistics (takes a CSV file as input, otherwise change FS
) :
#!/usr/bin/awk -f
BEGIN {
FS=",";
}
{
a += $1;
b[++i] = $1;
}
END {
m = a/NR; # mean
for (i in b)
{
d += (b[i]-m)^2;
e += (b[i]-m)^3;
f += (b[i]-m)^4;
}
va = d/NR; # variance
sd = sqrt(va); # standard deviation
sk = (e/NR)/sd^3; # skewness
ku = (f/NR)/sd^4-3; # standardized kurtosis
print "N,sum,mean,variance,std,SEM,skewness,kurtosis"
print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku
}
It is straightforward to add min/max to this script, but it is as easy to pipe sort
& head
/tail
:
sort -n myfile | head -n1
sort -n myfile | tail -n1
This is a breeze with R. For a file that looks like this:
1
2
3
4
5
6
7
8
9
10
Use this:
R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"
To get this:
V1
Min. : 1.00
1st Qu.: 3.25
Median : 5.50
Mean : 5.50
3rd Qu.: 7.75
Max. :10.00
[1] 3.02765
- The
-q
flag squelches R's startup licensing and help output - The
-e
flag tells R you'll be passing an expression from the terminal x
is adata.frame
- a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.- Some functions, like
summary()
, naturally accommodatedata.frames
. Ifx
had multiple fields,summary()
would provide the above descriptive stats for each. - But
sd()
can only take one vector at a time, which is why I indexx
for that command (x[ , 1]
returns the first column ofx
). You could useapply(x, MARGIN = 2, FUN = sd)
to get the SDs for all columns.