Compressing a folder with many duplicated files
I suggest 3 options that I've tried (in Windows):
- 7zip LZMA2 compression with dictionary size of 1536Mb
- WinRar "solid" file
- 7zip WIM file
I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.
Update edit: Thanks to @Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.
New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!
Best options in your case is 7-zip. Here is the options:
7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files
a
- add files to archive
-r
- Recurse subdirectories
-t7z
- Set type of archive (7z in your case)
-m0=lzma2
- Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:
- High compression ratio
- Variable dictionary size (up to 4 GB)
- Compressing speed: about 1 MB/s on 2 GHz CPU
- Decompressing speed: about 10-20 MB/s on 2 GHz CPU
- Small memory requirements for decompressing (depend from dictionary size)
- Small code size for decompressing: about 5 KB
- Supporting multi-threading and P4's hyper-threading
-mx=9
- Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra
-mfb=273
- Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.
-md=29
- Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.
I use md=29
because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.
-ms=8g
- Enables or disables solid mode. The default mode is s=on
. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.
Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.
-mmt=off
- Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.
-mmtf=off
- Set multithreading mode for filters to OFF.
-myx=9
- Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).
-mqs=on
- Sort files by type in solid archives. To store identical files together.
-bt
- show execution time statistics
-bb3
- set output log level
7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.
Only if you're using command line 7-zip, see this answer. https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files