Using python lime as a udf on spark

I'm the dill author. I agree with @Majaha, and will extend @Majaha's answer slightly. In the first link in @Majaha's answer, it's clearly pointed out that a Broadcast instance is hardwired to use pickle... so the suggestion to dill to a string, then undill afterward is a good one.

Unfortunately, the extend method probably won't work for you. In the Broadcast class, the source uses CPickle, which dill cannot extend. If you look at the source, it uses import CPickle as pickle; ... pickle.dumps for python 2, and import pickle; ... pickle.dumps for python 3. Had it used import pickle; ... pickle.dumps for python 2, and import pickle; ... pickle._dumps for python 3, then dill could extend the pickler by just doing an import dill. For example:

Python 3.6.6 (default, Jun 28 2018, 05:53:46) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pickle import _dumps
>>> import dill
>>> _dumps(lambda x:x)
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x01KCC\x04|\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x07\x00\x00\x00<stdin>q\tX\x08\x00\x00\x00<lambda>q\nK\x01C\x00q\x0b))tq\x0cRq\rc__main__\n__dict__\nh\nNN}q\x0etq\x0fRq\x10.'

You could, thus, either do what @Majaha suggests (and bookend the call to broadcast) or you could patch the code to make the replacement that I outline above (where needed, but eh...), or you may be able to make your own derived class that does the job using dill:

>>> from pyspark.broadcast import Broadcast as _Broadcast
>>>
>>> class Broadcast(_Broadcast):
...   def dump(self, value, f):
...     try:
...       import dill
...       dill.dump(value, f, pickle_protocol)
...     ...[INSERT THE REST OF THE DUMP METHOD HERE]...

If the above fails... you could still get it to work by pinpointing where the serialization failure occurs (there's dill.detect.trace to help you with that).

If you are going to suggest to pyspark to use dill... a potentially better suggestion is to allow users to dynamically replace the serializer. This is what mpi4py and a few other packages do.

Using python lime as a udf on spark

Tags:

Python

Pickle

Apache Spark

Dill

Related

Recent Posts