Using python lime as a udf on spark
I'm the dill
author. I agree with @Majaha, and will extend @Majaha's answer slightly. In the first link in @Majaha's answer, it's clearly pointed out that a Broadcast
instance is hardwired to use pickle
... so the suggestion to dill
to a string, then undill
afterward is a good one.
Unfortunately, the extend
method probably won't work for you. In the Broadcast
class, the source uses CPickle
, which dill
cannot extend.
If you look at the source, it uses import CPickle as pickle; ... pickle.dumps
for python 2, and import pickle; ... pickle.dumps
for python 3. Had it used import pickle; ... pickle.dumps
for python 2, and import pickle; ... pickle._dumps
for python 3, then dill
could extend the pickler by just doing an import dill
. For example:
Python 3.6.6 (default, Jun 28 2018, 05:53:46)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pickle import _dumps
>>> import dill
>>> _dumps(lambda x:x)
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x01KCC\x04|\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x07\x00\x00\x00<stdin>q\tX\x08\x00\x00\x00<lambda>q\nK\x01C\x00q\x0b))tq\x0cRq\rc__main__\n__dict__\nh\nNN}q\x0etq\x0fRq\x10.'
You could, thus, either do what @Majaha suggests (and bookend the call to broadcast
) or you could patch the code to make the replacement that I outline above (where needed, but eh...), or you may be able to make your own derived class that does the job using dill
:
>>> from pyspark.broadcast import Broadcast as _Broadcast
>>>
>>> class Broadcast(_Broadcast):
... def dump(self, value, f):
... try:
... import dill
... dill.dump(value, f, pickle_protocol)
... ...[INSERT THE REST OF THE DUMP METHOD HERE]...
If the above fails... you could still get it to work by pinpointing where the serialization failure occurs (there's dill.detect.trace
to help you with that).
If you are going to suggest to pyspark
to use dill
... a potentially better suggestion is to allow users to dynamically replace the serializer. This is what mpi4py
and a few other packages do.