How can I run a grep on epub/mobi files?

You can easily grep these files by providing the -a option to interpret the files as ascii:

grep -a "author" *.epub *.mobi

The above works on all of my 1000+ EPUB and MOBI files, giving the expected results.

EPUB and MOBI are both container formats. EPUB is a essentially .zip file with some structural requirements, MOBI is a Palm Database Format file. Both formats allow for compressed or uncompressed data to be put in the containers.

If the data you are looking for is in a "file" within the container, and that file is compressed you will need to provide the compressed string not the expanded, uncompressed version of the string. In particular, if you are reading an EPUB/MOBI on an ebook reader, you will of course generally not find a word 'abcde' you just read by using grep -a 'abcde' on all EPUB and MOBI files, as the contents of the book are likely (but not necessarily, it is just an efficiency measure) in compressed "files" in the container.

This is not a problem of grep being incapable of searching in these files, but of you not providing the correct search string. The same would happen if you read a file with Japanese text using some Japanese to English translation software and then hoped you could find the English words by grepping the original file. With -a and the correct Japanese (binary) word patterns, grep would work just fine.

This worked on windows7+cygwin; search text inside the zip archives.

Click to copy

c:\> zipgrep "regex" file.epub

shell script in c:/cygwin/bin/zipgrep, and this also works:

Click to copy

c:\> unzip -p "*.epub" | grep -a --color regex

-p is for pipe.

grep-epub.sh script

Click to copy

PAT=${1:?"Usage: grep-epub PAT *.epub files to grep"}
shift
: ${1:?"Need epub files to grep"}
for i in $* ;do
  echo $0 $i
  unzip -p $i "*.htm*" "*.xml" "*.opf" |  # unzip only html and content files to stdin
    perl -lpe 's![<][^>]{1,200}?[>]!!g;' | # get rid of small html <b>tags
    grep -Pinaso  ".{0,60}$PAT.{0,60}" | # keep some context around matches
    grep -Pi --color "$PAT"              # color the matches.
done

How can I run a grep on epub/mobi files?

Tags:

Grep

Related

Recent Posts