regular awk - easily sort array indexes to output them in the chosen order
With GNU awk
, you could do a two-way interaction with sort
with its "coprocess" feature (info gawk coproc
), where you'd send the data to sort with print |& "sort"
and get the result with "sort" |& getline
, but again that's gawk specific.
To loop over the array in the order it's been encountered, you could record that encounter sequence at the time you fill in the array:
awk '
!seen[$1]++ {sequence[n++] = $1}
END {
for (i = 0; i < n; i++)
print sequence[i], seen[sequence[i]]
}'
You could also implement the sorting algorithm in awk
. You could even borrow gawk
's quicksort.awk
, you'll even find it in its manual (here to demonstrate indirect function calls, another GNU-specific feature, you'd replace that with a literal call to your comparison routine). Something like:
awk '
function less_than(left, right) {
return "" left <= "" right
}
function quicksort(data, left, right, i, last)
{
if (left >= right)
return
quicksort_swap(data, left, int((left + right) / 2))
last = left
for (i = left + 1; i <= right; i++)
if (less_than(data[i], data[left]))
quicksort_swap(data, ++last, i)
quicksort_swap(data, left, last)
quicksort(data, left, last - 1)
quicksort(data, last + 1, right)
}
function quicksort_swap(data, i, j, temp)
{
temp = data[i]
data[i] = data[j]
data[j] = temp
}
{seen[$1]++}
END {
for (i in seen) keys[n++]=i
quicksort(keys, 0, n-1)
for (i = 0; i < n; i++)
print keys[i], seen[keys[i]]
}'
Personally, I'd just use perl
instead of awk
here.
$ cat tst.awk
{ cnt[$0]++ }
END {
n = sort(cnt,idxs)
for (i=1; i<=n; i++) {
idx = idxs[i]
printf "%s:%d%s", idx, cnt[idx], (i<n ? OFS : ORS)
}
}
function sort(arr, idxs, args, i, str, cmd) {
for (i in arr) {
gsub(/\047/, "\047\\\047\047", i)
str = str i ORS
}
cmd = "printf \047%s\047 \047" str "\047 |sort " args
i = 0
while ( (cmd | getline idx) > 0 ) {
idxs[++i] = idx
}
close(cmd)
return i
}
# create the 2 basic files to be parsed by the awk:
printf 'a b a a a c c d e s s s s e f s a e r r f\ng f r e d e z z c s d r\n' >fileA
printf 's f g r e d f g e z s d v f e z a d d g r f e a\ns d f e r\n'>fileB
for f in fileA fileB ; do
printf 'for file: %s: ' "$f"
tr ' ' '\n' < "$f" |
awk -f tst.awk
done
for file: fileA: a:5 b:1 c:3 d:3 e:5 f:3 g:1 r:4 s:6 z:2
for file: fileB: a:2 d:5 e:5 f:5 g:3 r:3 s:3 v:1 z:2
The above just builds a newline-separated string from the array indices (quoting it appropriately for sh
), creates a shell script that pipes that string to sort
, and then loops on the output. If you want to modify sort
s behavior just add a string of Unix sort
arguments to the sort
function call, e.g. sort(seen,"-fu")
. It could obviously be modified to print or do whatever else you want inside the sort()
function instead of populating an array of indices for you to loop on when it returns if that's what you prefer but then the function is as cohesive.
Note however that it will be limited to the maximum command line length on your system.
The \047
s in the code represent '
s which shell does not allow to be included in '
-delimited strings or scripts and so while we could use '
directly in an awk script being read from a file as I'm doing above, if you were to use that script on the command line as awk 'script' file
you'd need to use something instead of '
and \047
works both when the script is interpreted from the command line and from a file so it's the most portable choice of '
-replacement.
The '
s (\047
s) are present to quote str
in a way that ensures that the shell doesn't expand variables, have mismatched quotes, etc. when the string is being piped to sort, i.e. they do this:
$ echo 'foo'\''bar $(ls) $HOME' | awk '{
str=$0; gsub(/\047/, "\047\\\047\047", str); print "str="str
cmd="printf \047%s\047 \047" str "\047"; print "cmd="cmd
}'
str=foo'\''bar $(ls) $HOME
cmd=printf '%s' 'foo'\''bar $(ls) $HOME'
so we don't get something like this, which is vulnerable/buggy, instead:
$ echo 'foo'\''bar $(ls) $HOME' | awk '{
str=$0; print "str="str
cmd="printf \"%s\" \"" str "\""; print "cmd="cmd
}'
str=foo'bar $(ls) $HOME
cmd=printf "%s" "foo'bar $(ls) $HOME"