Can the "find" command work more efficiently to delete many files?
The reason why the find command is slow
That is a really interesting issue... or, honestly, mallicious:
The command
find . -mindepth 2 -mtime +5 -print -delete
is very different from the usual tryout variant, leaving out the dangerous part, -delete
:
find . -mindepth 2 -mtime +5 -print
The tricky part is that the action -delete
implies the option -depth
. The command including delete is really
find . -depth -mindepth 2 -mtime +5 -print -delete
and should be tested with
find . -depth -mindepth 2 -mtime +5 -print
That is closely related to the symtoms you see; The option -depth
is changing the tree traversal algorithm for the file system tree from an preorder depth-first search to an inorder depth-first search.
Before, each file or directory that was reached was immediately used, and forgotten about. Find was using the tree itself to find it's way. find
will now need to collect all directories that could contain files or directories still to be found, before deleting the files in the deepest directoies first. For this, it needs to do the work of planing and remembering traversal steps itself, and - that's the point - in a different order than the filesystem tree naturally supports. So, indeed, it needs to collect data over many files before the first step of output work.
Find has to keep track of some directories to visit later, which is not a problem for a few directories.
But maybe with many directories, for various degrees of many.
Also, performance problems outside of find will get noticable in this kind of situation; So it is possible it's not even find
that's slow, but something else.
The performance and memory impact of that depends on your directory structure etc.
The relevant sections from man find
:
See the "Warnings":
ACTIONS
-delete
Delete files; true if removal succeeded. If the removal failed,
an error message is issued. If -delete fails, find's exit status
will be nonzero (when it eventually exits). Use of -delete auto‐
matically turns on the -depth option.
Warnings: Don't forget that the find command line is evaluated as
an expression, so putting -delete first will make find try to
delete everything below the starting points you specified. When
testing a find command line that you later intend to use with
-delete, you should explicitly specify -depth in order to avoid
later surprises. Because -delete implies -depth, you cannot use‐
fully use -prune and -delete together.
[ ... ]
And, from a section further up:
OPTIONS
[ ... ]
-depth Process each directory's contents before the directory itself.
The -delete action also implies -depth.
The faster solution to delete the files
You do not really need to delete the directories in the same run of deleting the files, right? If we are not deleting directories, we do not need the whole -depth
thing, we can just find a file and delete it, and go on to the next, as you proposed.
This time we can use the simple print variant for testing the find
, with implicit -print
.
We want to find only plain files, no symlinks, directories, special files etc:
find . -mindepth 2 -mtime +5 -type f
We use xargs
to delete more than one file per rm
process started, taking care of odd filenames by using a null byte as separator:
Testing this command - note the echo
in front of the rm
, so it prints what will be run later:
find . -mindepth 2 -mtime +5 -type f -print0 | xargs -0 echo rm
The lines will be very long and hard to read; For an initial test it could help to get readable output with only three files per line by adding -n 3
as first arguments of xargs
If all looks good, remove the echo
in front of the rm
and run again.
That should be a lot faster;
In case we are talking about millions of files - you wrote it's 600 million files in total - there is something more to take into account:
Most programs, including find
, read directories using the library call readdir (3)
. That usually uses a buffer of 32 KB to read directories;
That becomes a problem when the directories, containing huge lists of possibly long filenames, are big.
The way to work around it is to directly use the system call for reading directory entries,
getdents (2)
, and handle the buffering in a more suitable way.
For details, see You can list a directory containing 8 million files! But not with ls..
(It would be interesting if you can add details to your question on the typical numbers of files per directroy, directories per directory, max depth of paths; Also, which filesystem is used.)
(If it is still slow, you should check for filesystem performance problems.)