I would like to find the largest file in each directory recursively
With GNU find
, sort
and sed
(4.2.2 or above), sort once on the file sizes and again on directory paths:
find /some/dir -type f -printf '%s %f%h\0' |
sort -zrn |
sort -zut/ -k2 |
sed -zre 's: ([^/]*)(/.*): \2/\1:'
Explanation:
- The file size, name and path are printed (the first separated by a space and the next two separated by
/
), and each entry is terminated by the ASCII NUL character. - Then we sort numerically using the size, assuming NUL-delimited output (and in reverse order, so largest files first).
- Then we use
sort
to print only the first unique entries using everything from the second/
-separated field, which would be the path to the directory containing the file. - Then we use
sed
to swap the directory and filenames, so that we get a normal path.
For readable output, replace the ASCII NUL with newlines:
find /some/dir -type f -printf '%s %f%h\0' |
sort -zrn |
sort -zut/ -k2 |
sed -zre 's: ([^/]*)(/.*): \2/\1:' |
tr '\0' '\n'
Example output:
$ find /var/log -type f -printf '%s %f%h\0' | sort -zrn | sort -zt/ -uk2 | sed -zre 's: ([^/]*)(/.*): \2/\1:' | tr '\0' '\n'
3090885 /var/log/syslog.1
39789 /var/log/apt/term.log
3968 /var/log/cups/access_log.1
31 /var/log/fsck/checkroot
467020 /var/log/installer/initial-status.gz
44636 /var/log/lightdm/seat0-greeter.log
15149 /var/log/lxd/lxd.log
4932 /var/log/snort/snort.log
3232 /var/log/unattended-upgrades/unattended-upgrades-dpkg.log
Combining find
and awk
allows the averages to be calculated too:
find . -type f -printf '%s %h/%f\0'|awk 'BEGIN { RS="\0" } { SIZE=$1; for (i = 1; i <= NF - 1; i++) $i = $(i + 1); NF = NF - 1; DIR=$0; gsub("/[^/]+$", "", DIR); FILE=substr($0, length(DIR) + 2); SUMSIZES[DIR] += SIZE; NBFILES[DIR]++; if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) { MAXSIZE[DIR] = SIZE; BIGGESTFILE[DIR] = FILE } }; END { for (DIR in SUMSIZES) { printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR] } }'
Laid out in a more readable manner, the AWK script is
BEGIN { RS="\0" }
{
SIZE=$1
for (i = 1; i <= NF - 1; i++) $i = $(i + 1)
NF = NF - 1
DIR=$0
gsub("/[^/]+$", "", DIR)
FILE=substr($0, length(DIR) + 2)
SUMSIZES[DIR] += SIZE
NBFILES[DIR]++
if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) {
MAXSIZE[DIR] = SIZE
BIGGESTFILE[DIR] = FILE
}
}
END {
for (DIR in SUMSIZES) {
printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR]
}
}
This expects null-separated input records (I stole this from muru’s answer); for each input record, it
- stores the size (for later use),
- removes everything before the first character in the path (so we at least handle filenames with spaces correctly),
- extracts the directory,
- extracts the filename,
- adds the size we stored earlier to the sum of sizes in the directory,
- increments the number of files in the directory (so we can calculate the average),
- if the size is larger than the stored maximum size for the directory, or if we haven’t seen a file in the directory yet, updates the information for the biggest file.
Once all that’s done, the script loops over the keys in SUMSIZES
and outputs the directory, average size, largest file’s name and size.
You can pipe the output into sort
to sort by directory name. If you want to additionally format the sizes in human-friendly form, you can change the printf
line to
printf "%.2f %d %s: %s\n", SUMSIZES[DIR] / NBFILES[DIR], MAXSIZE[DIR], DIR, BIGGESTFILE[DIR]
and then pipe the output into numfmt --field=1,2 --to=iec
. You can still sort the result by directory name, you just need to sort starting with the third field: sort -k3
.
Zsh's wildcard patterns would be very useful for the sort of things you're doing. Specifically, zsh can match files by attributes such as type, size, etc. through glob qualifiers. Glob qualifiers also allow sorting the matches.
For example, in zsh, *(.DOLN[1])
expands to the name of the largest file in the current directory. *
is the pattern for the file name (match everything, except possibly dot files depending on shell options). The qualifier .
restricts the matches to regular files, D
causes *
to include dot files, OL
sorts by decreasing size (“length”), N
causes the expansion to be empty if there is no matching file at all, and [1]
selects only the first match.
You can enumerate directories recursively with **/
. For example the following loop iterates over all the subdirectories of the current directory and their subdirectories recursively:
for d in **/*(/); do … done
You can use zstat
to access a file's size and other metadata without having to rely on other tools for parsing.
zmodload -F zsh/stat b:zstat
files=(*(DNoL))
zstat -A sizes +size -- $files
total=0; for s in $sizes; do total+=$s; done
if ((#sizes > 0)); then
max=$sizes[-1]
average=$((total/#sizes))
median=$sizes[$((#sizes/2))]
fi