Measuring disk usage of specific file types per each directory (recursively, as a demo for 'du --include')

Simplifying the solution from @HaukeLaging by collecting all directory sums in one array and printing it all at the end (using GNU awk). Also, only one call to numfmt is needed (at the end).

#!/bin/sh

find . -type f -iname '*.py' -printf '%s %h\0' |
    awk 'BEGIN { RS="\0"; };

         { gsub(/\\/,"&&"); gsub(/\n/,"\\n");
           size=$1; sub("[^ ]* ",""); dirsize[$0]+=size }

         END {   PROCINFO["sorted_in"] = "@val_num_desc";
                 i=0;
                 for ( dir in dirsize ) { if(++i<=50) 
                     { print dirsize[dir], dir; }else{ exit } 
                 }
             }        ' | numfmt --to=iec-i --suffix=B

This generates the cumulative apparent size of the py files (not their disk usage), and avoids summing files in sub-directories of a directory.

To count the disk usage as opposed to the sum of the apparent size, you'd need to use %b¹ instead of %s and make sure each file is counted only once, so something like:

LC_ALL=C find . -iname '*.py' -type f -printf '%D:%i\0%b\0%h\0' |
  gawk -v 'RS=\0' -v OFS='\t' -v max=50 '
    {
      inum = $0
      getline du
      getline dir
    }
    ! seen[inum]++ {
      gsub(/\\/, "&&", dir)
      gsub(/\n/, "\\n", dir)
      sum[dir] += du
    }
    END {
      n = 0
      PROCINFO["sorted_in"] = "@val_num_desc"
      for (dir in sum) {
        print sum[dir] * 512, dir
        if (++n >= max) break
      }
    }' | numfmt --to=iec-i --suffix=B --delimiter=$'\t'

Newlines in the dir names are rendered as \n, and backslashes (at least those decoded as such in the current locale²) as \\.

If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).

It assumes there's no POSIXLY_CORRECT variable in the environment (if there is, setting PROCINFO["sorted_in"] has no effect in gawk so the list would not be sorted). If you can't guarantee it³, you can always start gawk as env -u POSIXLY_CORRECT gawk ... (assuming GNU env or compatible; or (unset -v POSIXLT_CORRECT; gawk ...)).

A few other problems with your approach:

without LC_ALL=C, GNU find wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files.
Embedding {} in the code of sh constituted an arbitrary code injection vulnerability. Think for instance of a file called $(reboot).py. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters.
echo can't be used to display arbitrary data (especially with -e which doesn't make sense here). Use printf instead.
With xargs -r0 du -sch, du may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.

^{¹ %b reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also %k which is int(%b / 2), but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)}

^{² Using LC_ALL=C for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.}

^{³ Beware that if your sh is bash, POSIXLY_CORRECT is set to y in sh scripts, and it is exported to the environment if sh is started with -a or -o allexport, so that variable can also creep in unintentionally.}

I suspect you need to write your own du.

Currently, you are triple recursing into the hierarchy, using two finds and a du.

I would suggest starting with perl's File::Find package.

Alternatively, your first find could output with something like -printf '%k %h\n' and then you could sort by directory, use perl or awk (or even bash) to total the directories and convert to "human" readable, and finally sort & head.

Either way, you should A) walk the directory tree only once, and B) create as few processes as possible.

Edit: A sample implementation

#!/bin/bash

find . -type f -iname '*.py' -printf '%k %h\n' | sort -k2 | (
    at=
    bt=
    output() {
        if [[ -n "$at" ]]
        then
            printf '%s\t%s\n' "$at" "$bt"
        fi
    }
    while read a b
    do
        if [[ "$b" != "$bt" ]]
        then
            output
            bt="$b"
            at=0
        fi
        at=$(( $at + $a ))
    done
    output
) | sort -hr | head -50 | numfmt -d'   ' --field=1 --from-unit=Ki --to=iec-i

Note: %k is important. %s reports apparent size, while %k (and du) report disk size. They differ for sparse files and large files. (If you want du --apparent-size, so be it.)

Note: numfmt should go at the end, so it is run once. Using '%k', the from-unit needs to be specified.

Note: numfmt's -d parameter should contain a single tab. I can't type that here, and numfmt won't accept -d'\t'. If the separator isn't a tab, the spacing gets messed up. I thus used printf instead of echo in the main body. (An alternative would be to use echo, and a final sed to change the first space into a tab.

Note: I initially missed the first sort, and got repeated entries for some directories in my re-testing.

Note: numfmt is fairly recent.

Measuring disk usage of specific file types per each directory (recursively, as a demo for 'du --include')

Edit: A sample implementation

Tags:

Performance

Bash

Find

Gnu

Disk Usage

Related

Recent Posts