How to split an mbox file into n-MB big chunks using the terminal?
formail
is perfectly suited for this task. You may look at formail's +skip
and -total
options
Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.
Depending on the size of your mailbox and mails, you may try
formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox
etc.
The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox
, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox
.
To look for an initial number of mails per chunk, try
formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc
You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.
I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt
, in the directory that contains the mbox file (e.g. named mbox
):
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt
and the mbox
file):
awk -f mboxsplit.txt mbox
Please note:
- The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
- It will not split the email body
- One chunk may contain only one email if the email size is larger than the specified chunk size
I suggest you to specify the chunk size less or lower than the maximum upload/import size.
If your mbox
is in standard format, each message will begin with From
and a space:
From [email protected]
So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY
and try using awk
to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
then you will get output files called chunk_1.txt
to chunk_n.txt
each containing up to 1,000 messages.
If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt
BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}
and then type
awk -f awk.txt mbox