regular awk - easily sort array indexes to output them in the chosen order

With GNU awk, you could do a two-way interaction with sort with its "coprocess" feature (info gawk coproc), where you'd send the data to sort with print |& "sort" and get the result with "sort" |& getline, but again that's gawk specific.

To loop over the array in the order it's been encountered, you could record that encounter sequence at the time you fill in the array:

awk '
  !seen[$1]++ {sequence[n++] = $1}
  END {
    for (i = 0; i < n; i++)
      print sequence[i], seen[sequence[i]]
  }'

You could also implement the sorting algorithm in awk. You could even borrow gawk's quicksort.awk, you'll even find it in its manual (here to demonstrate indirect function calls, another GNU-specific feature, you'd replace that with a literal call to your comparison routine). Something like:

awk '
  function less_than(left, right) {
    return "" left <= "" right
  }
  function quicksort(data, left, right,   i, last)
  {
    if (left >= right)
      return

    quicksort_swap(data, left, int((left + right) / 2))
    last = left
    for (i = left + 1; i <= right; i++)
      if (less_than(data[i], data[left]))
        quicksort_swap(data, ++last, i)
    quicksort_swap(data, left, last)
    quicksort(data, left, last - 1)
    quicksort(data, last + 1, right)
  }
  function quicksort_swap(data, i, j,   temp)
  {
    temp = data[i]
    data[i] = data[j]
    data[j] = temp
  }

  {seen[$1]++}
  END {
    for (i in seen) keys[n++]=i
    quicksort(keys, 0, n-1)
    for (i = 0; i < n; i++)
      print keys[i], seen[keys[i]]
  }'

Personally, I'd just use perl instead of awk here.


$ cat tst.awk
{ cnt[$0]++ }
END {
    n = sort(cnt,idxs)
    for (i=1; i<=n; i++) {
        idx = idxs[i]
        printf "%s:%d%s", idx, cnt[idx], (i<n ? OFS : ORS)
    }

}

function sort(arr, idxs, args,      i, str, cmd) {
    for (i in arr) {
        gsub(/\047/, "\047\\\047\047", i)
        str = str i ORS
    }

    cmd = "printf \047%s\047 \047" str "\047 |sort " args

    i = 0
    while ( (cmd | getline idx) > 0 ) {
        idxs[++i] = idx
    }

    close(cmd)

    return i
}

# create the 2 basic files to be parsed by the awk:
printf 'a b a a a c c d e s s s s e f s a e r r f\ng f r e d e z z c s d r\n' >fileA
printf 's f g r e d f g e z s d v f e z a d d g r f e a\ns d f e r\n'>fileB

for f in fileA fileB ; do
    printf 'for file: %s: ' "$f"
    tr ' ' '\n' < "$f" |
    awk -f tst.awk
done
for file: fileA: a:5 b:1 c:3 d:3 e:5 f:3 g:1 r:4 s:6 z:2
for file: fileB: a:2 d:5 e:5 f:5 g:3 r:3 s:3 v:1 z:2

The above just builds a newline-separated string from the array indices (quoting it appropriately for sh), creates a shell script that pipes that string to sort, and then loops on the output. If you want to modify sorts behavior just add a string of Unix sort arguments to the sort function call, e.g. sort(seen,"-fu"). It could obviously be modified to print or do whatever else you want inside the sort() function instead of populating an array of indices for you to loop on when it returns if that's what you prefer but then the function is as cohesive.

Note however that it will be limited to the maximum command line length on your system.

The \047s in the code represent 's which shell does not allow to be included in '-delimited strings or scripts and so while we could use ' directly in an awk script being read from a file as I'm doing above, if you were to use that script on the command line as awk 'script' file you'd need to use something instead of ' and \047 works both when the script is interpreted from the command line and from a file so it's the most portable choice of '-replacement.

The 's (\047s) are present to quote str in a way that ensures that the shell doesn't expand variables, have mismatched quotes, etc. when the string is being piped to sort, i.e. they do this:

$ echo 'foo'\''bar $(ls) $HOME' | awk '{
    str=$0; gsub(/\047/, "\047\\\047\047", str); print "str="str
    cmd="printf \047%s\047 \047" str "\047"; print "cmd="cmd
}'
str=foo'\''bar $(ls) $HOME
cmd=printf '%s' 'foo'\''bar $(ls) $HOME'

so we don't get something like this, which is vulnerable/buggy, instead:

$ echo 'foo'\''bar $(ls) $HOME' | awk '{
    str=$0; print "str="str
    cmd="printf \"%s\" \"" str "\""; print "cmd="cmd
}'
str=foo'bar $(ls) $HOME
cmd=printf "%s" "foo'bar $(ls) $HOME"

Tags:

Awk

Non Gnu