What are the different use cases of joblib versus pickle?
Thanks to Gunjan for giving us this script! I modified it for Python3 results
#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib
file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891
- joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
- joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
- if you don't pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
- since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
- But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with
mmap_mode="r"
.
I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file
#comapare pickle loaders
from time import time
import pickle
import os
try:
import cPickle
except:
print "Cannot import cPickle"
import joblib
t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
Output for this is
time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392
According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks