How can I run a grep on epub/mobi files?
You can easily grep these files by providing the -a
option to interpret the files as ascii:
grep -a "author" *.epub *.mobi
The above works on all of my 1000+ EPUB and MOBI files, giving the expected results.
EPUB and MOBI are both container formats. EPUB is a essentially .zip
file with some structural requirements, MOBI is a Palm Database Format file.
Both formats allow for compressed or uncompressed data to be put in the containers.
If the data you are looking for is in a "file" within the container,
and that file is compressed you will need to provide the compressed string not the expanded, uncompressed version of the string. In particular, if you are reading an EPUB/MOBI on an ebook reader, you will of course generally not find a word 'abcde' you just read by using grep -a 'abcde'
on all EPUB and MOBI files, as the contents of the book are likely (but not necessarily, it is just an efficiency measure) in compressed "files" in the container.
This is not a problem of grep
being incapable of searching in these files, but of you not providing the correct search string. The same would happen if you read a file with Japanese text using some Japanese to English translation software and then hoped you could find the English words by grepping the original file. With -a
and the correct Japanese (binary) word patterns, grep
would work just fine.
This worked on windows7+cygwin; search text inside the zip archives.
c:\> zipgrep "regex" file.epub
shell script in c:/cygwin/bin/zipgrep, and this also works:
c:\> unzip -p "*.epub" | grep -a --color regex
-p is for pipe.
grep-epub.sh script
PAT=${1:?"Usage: grep-epub PAT *.epub files to grep"}
shift
: ${1:?"Need epub files to grep"}
for i in $* ;do
echo $0 $i
unzip -p $i "*.htm*" "*.xml" "*.opf" | # unzip only html and content files to stdin
perl -lpe 's![<][^>]{1,200}?[>]!!g;' | # get rid of small html <b>tags
grep -Pinaso ".{0,60}$PAT.{0,60}" | # keep some context around matches
grep -Pi --color "$PAT" # color the matches.
done