Find all the PDFs with at least three characters in their name
Here it's easier with standard wildcards:
find ~ -name '*???.[pP][dD][fF]'
Or with some find
implementations (those that support -regex
also support -iname
):
find ~ -iname '*???.pdf'
For arbitrary numbers of characters instead of 3
, that's where you may prefer to revert to -iregex
where available (see @Stephen Kitt's answer) or you could use zsh
or ksh93
globs:
zsh
:set -o extendedglob # best in ~/.zshrc printf '%s\n' ~/**/?(#c3,).(#i)pdf(D)
(the
(D)
to consider hidden files and files in hidden dirs like withfind
)(#cx,y)
is thezsh
wildcard equivalent of regexp{x,y}
(#i)
for case insensitive?
standard wildcard for any single character (like regexp.
)**/
: any level of subdirectories (including 0)
ksh93
:FIGNORE='@(.|..)' # to consider hidden files set -o globstar printf '%s\n' **/{3,}(?).~(i:pdf)
@(x|y)
: extended ksh wildcard operator similar to regexp(x|y)
.FIGNORE
: special variable which controls what files are ignored by globs. When set, the usual ignoring of hidden files is not done, but we still want to ignore the.
and..
directory entries where present.{x,y}(z)
isksh93
's equivalent of regexpz{x,y}
.~(i:...)
: case-insensitive matching.
Globs have some extra advantages over find
here in that you get a sorted list (you can disable that sorting in zsh
with the oN
glob qualifier, or use different sorting criteria) and also work when filenames contain sequence of bytes that don't form valid characters (for instance, in a locale using the UTF-8 charset, the find
approach would fail to report a $'St\xE9phane Chazelas - CV.pdf
as that \xE9
being not a character is not matched by regexp .
or wildcard ?
or *
with GNU find
).
Assuming you’re using GNU find
(which you probably are, since -iregex
is a GNU extension to POSIX find
), -regex
and -iregex
default to Emacs regular expressions, which don’t recognise {3,}
. You need to specify a different type of regular expressions using the -regextype
option; in addition, you need to adjust your regular expression to the fact that the expression matches against the full path:
find ~ -regextype posix-extended -iregex '.*/[^/]{3,}.pdf'
You should also escape the .
so that it matches “.” rather than any character:
find ~ -regextype posix-extended -iregex '.*/[^/]{3,}\.pdf'
The regular expression can be simplified since we only care about three non-“/” characters:
find ~ -regextype posix-extended -iregex '.*[^/]{3}\.pdf'
For completeness, with FreeBSD or NetBSD find
(another implementation that supports -iregex
, not yours though as .+
wouldn't work there without -E
), you'd write:
find ~ -iregex '.*[^/]\{3\}\.pdf'
or:
find -E ~ -iregex '.*[^/]{3}\.pdf'
Without -E
, that's basic regular expression (like in grep
) and with -E
extended regular expression (like in grep -E
).
With ast-open's find
:
find ~ -iregex '.*[^/]{3}\.pdf'
(that's extended regexps out of the box).
How do I know they're PDFs?
You don't unless you ask. Sure, I'm being pedantic, but you didn't ask about files with .pdf
in their names. Just because a file has the characters .pdf
in the filename does not make it a PDF file.
In fact, let's be all-the-way pedantic about this: if the last four characters of a file's name are .pdf
, then it will always have more than three characters in its name.
So doing this the wrong way, you might say:
$ find . -type f -name "*???.pdf"
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Setup_MagicISO.exe.pdf
See that second one? It's actually an executable. (I know, I changed the name.) And I'm also missing a PDF I coulda sworn was in the Documents directory...
$ ls Documents
McLaren 720s Coupe:Order Summary.pdf
Pioneer Premier DEH-P490IB CD Install Manual.PDF
Setup_MagicISO.exe.pdf
So using -iname
we could find that one, but that's still turning up this not-a-PDF file.
What we really want to do in this case is examine the file's magic number using the file
command. One option outputs the MIME type, which is simpler to parse. The find
query then becomes a simple -name "???*"
.
$ find . -type f -name "???*" -print0|xargs -0 file --mime
./.bash_history: text/plain; charset=us-ascii
./.bash_logout: text/plain; charset=us-ascii
./.bashrc: text/plain; charset=us-ascii
./.profile: text/plain; charset=us-ascii
./Documents/McLaren 720s Coupe:Order Summary.pdf: application/pdf; charset=binary
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF: application/pdf; charset=binary
./Documents/Setup_MagicISO.exe.pdf: application/x-dosexec; charset=binary
./Downloads/Setup_MagicISO.exe: application/x-dosexec; charset=binary
./Downloads/WindowsUpdate.diagcab: application/vnd.ms-cab-compressed; charset=binary
Let's use the colon delimiter, and look for MIME type application/pdf
, then zero out that portion and print the result. Take note, one of my files has a colon in the name; so I can't just ask awk to ($2==":"){print $1}
.
$ find . -type f -name "???*" -print0|xargs -0 file --mime|awk -F: '($NF~"application/pdf"){OFS=":";$NF="";print}'|sed s/:$//
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF
Now let's finish up by contriving to include PDF files named a
and abc
:
$ mkdir Documents/other
$ cp -a Documents/McLaren\ 720s\ Coupe\:Order\ Summary.pdf Documents/other/a
$ cp -a Documents/Pioneer\ Premier\ DEH-P490IB\ CD\ Install\ Manual.PDF Documents/other/abc
$ find . -type f -name "???*" -print0|xargs -0 file --mime|awk -F: '($NF~"application/pdf"){OFS=":";$NF="";print}'|sed s/:$//
./Documents/McLaren 720s Coupe:Order Summary.pdf
./Documents/Pioneer Premier DEH-P490IB CD Install Manual.PDF
./Documents/other/abc
That's all. I know I'll probably get dinged for being horribly pedantic, but in my job with thousands of NFS volumes to hunt and all kinds of poorly-named files, I wish more people would be pedantic.
Edited to add: in the real world, I might want to make use of updatedb
to build a searchable file index, locate
instead of find
to read that index, and parallel
instead of xargs
to thread 'er up. That's somewhat outside the scope of this question though. I wrote that with a straight face, too. Why do I care so much? I might be looking for movie and audio files; or certain types of photographs; or binary executables in a project data directory.