Measuring disk usage of specific file types per each directory (recursively, as a demo for 'du --include')
Simplifying the solution from @HaukeLaging by collecting all directory sums in one array and printing it all at the end (using GNU awk). Also, only one call to numfmt
is needed (at the end).
#!/bin/sh
find . -type f -iname '*.py' -printf '%s %h\0' |
awk 'BEGIN { RS="\0"; };
{ gsub(/\\/,"&&"); gsub(/\n/,"\\n");
size=$1; sub("[^ ]* ",""); dirsize[$0]+=size }
END { PROCINFO["sorted_in"] = "@val_num_desc";
i=0;
for ( dir in dirsize ) { if(++i<=50)
{ print dirsize[dir], dir; }else{ exit }
}
} ' | numfmt --to=iec-i --suffix=B
This generates the cumulative apparent size of the py files (not their disk usage), and avoids summing files in sub-directories of a directory.
To count the disk usage as opposed to the sum of the apparent size, you'd need to use %b
¹ instead of %s
and make sure each file is counted only once, so something like:
LC_ALL=C find . -iname '*.py' -type f -printf '%D:%i\0%b\0%h\0' |
gawk -v 'RS=\0' -v OFS='\t' -v max=50 '
{
inum = $0
getline du
getline dir
}
! seen[inum]++ {
gsub(/\\/, "&&", dir)
gsub(/\n/, "\\n", dir)
sum[dir] += du
}
END {
n = 0
PROCINFO["sorted_in"] = "@val_num_desc"
for (dir in sum) {
print sum[dir] * 512, dir
if (++n >= max) break
}
}' | numfmt --to=iec-i --suffix=B --delimiter=$'\t'
Newlines in the dir names are rendered as \n
, and backslashes (at least those decoded as such in the current locale²) as \\
.
If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).
It assumes there's no POSIXLY_CORRECT
variable in the environment (if there is, setting PROCINFO["sorted_in"]
has no effect in gawk
so the list would not be sorted). If you can't guarantee it³, you can always start gawk
as env -u POSIXLY_CORRECT gawk ...
(assuming GNU env
or compatible; or (unset -v POSIXLT_CORRECT; gawk ...)
).
A few other problems with your approach:
- without
LC_ALL=C
, GNUfind
wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files. - Embedding
{}
in the code ofsh
constituted an arbitrary code injection vulnerability. Think for instance of a file called$(reboot).py
. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters. echo
can't be used to display arbitrary data (especially with-e
which doesn't make sense here). Useprintf
instead.- With
xargs -r0 du -sch
,du
may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.
¹ %b
reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also %k
which is int(%b / 2)
, but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)
² Using LC_ALL=C
for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.
³ Beware that if your sh
is bash
, POSIXLY_CORRECT
is set to y
in sh
scripts, and it is exported to the environment if sh
is started with -a
or -o allexport
, so that variable can also creep in unintentionally.
I suspect you need to write your own du.
Currently, you are triple recursing into the hierarchy, using two finds and a du.
I would suggest starting with perl's File::Find
package.
Alternatively, your first find could output with something like -printf '%k %h\n'
and then you could sort by directory, use perl or awk (or even bash) to total the directories and convert to "human" readable, and finally sort & head.
Either way, you should A) walk the directory tree only once, and B) create as few processes as possible.
Edit: A sample implementation
#!/bin/bash
find . -type f -iname '*.py' -printf '%k %h\n' | sort -k2 | (
at=
bt=
output() {
if [[ -n "$at" ]]
then
printf '%s\t%s\n' "$at" "$bt"
fi
}
while read a b
do
if [[ "$b" != "$bt" ]]
then
output
bt="$b"
at=0
fi
at=$(( $at + $a ))
done
output
) | sort -hr | head -50 | numfmt -d' ' --field=1 --from-unit=Ki --to=iec-i
Note: %k is important. %s reports apparent size, while %k (and du
) report disk size. They differ for sparse files and large files. (If you want du --apparent-size
, so be it.)
Note: numfmt should go at the end, so it is run once. Using '%k', the from-unit needs to be specified.
Note: numfmt's -d parameter should contain a single tab. I can't type that here, and numfmt won't accept -d'\t'
. If the separator isn't a tab, the spacing gets messed up. I thus used printf instead of echo in the main body. (An alternative would be to use echo, and a final sed to change the first space into a tab.
Note: I initially missed the first sort, and got repeated entries for some directories in my re-testing.
Note: numfmt is fairly recent.