Require kryo serialization in Spark (Scala)

I have a method to get all the class names which are required to be registered quickly.

implicit class FieldExtensions(private val obj: Object) extends AnyVal {
    def readFieldAs[T](fieldName: String): T = {
        FieldUtils.readField(obj, fieldName, true).asInstanceOf[T]
    }

    def writeField(fieldName: String, value: Object): Unit = {
        FieldUtils.writeField(obj, fieldName, value, true)
    }
}

class LogClassResolver extends DefaultClassResolver {

    override def registerImplicit(t: Class[_]): Registration = {
        println(s"registerImplicitclasstype:${t.getName}")

        super.registerImplicit(t)
    }

    def copyFrom(resolver: DefaultClassResolver): Unit = {
        this.kryo = resolver.readFieldAs("kryo")

        this.idToRegistration.putAll(resolver.readFieldAs("idToRegistration"))
        this.classToRegistration.putAll(resolver.readFieldAs("classToRegistration"))
        this.classToNameId = resolver.readFieldAs("classToNameId")
        this.nameIdToClass = resolver.readFieldAs("nameIdToClass")
        this.nameToClass = resolver.readFieldAs("nameToClass")
        this.nextNameId = resolver.readFieldAs("nextNameId")

        this.writeField("memoizedClassId", resolver.readFieldAs("memoizedClassId"))
        this.writeField("memoizedClassIdValue", resolver.readFieldAs("memoizedClassIdValue"))
        this.writeField("memoizedClass", resolver.readFieldAs("memoizedClass"))
        this.writeField("memoizedClassValue", resolver.readFieldAs("memoizedClassValue"))
    }
}

class MyRegistrator extends KryoRegistrator {
    override def registerClasses(kryo: Kryo): Unit = {
        val newResolver = new LogClassResolver
        newResolver.copyFrom(kryo.getClassResolver.asInstanceOf[DefaultClassResolver])
        FieldUtils.writeField(kryo, "classResolver", newResolver, true)
    }
}

And you just need to register MyRegistrator in spark session.

val sparkSession = SparkSession.builder()
    .appName("Your_Spark_App")
    .config("spark.kryo.registrator", classOf[MyRegistrator].getTypeName)
    .getOrCreate()
    // all your spark logic will be added here

After that, submit a small sample spark app to the cluster, all the class names which need registration will be printed to stdout. Then the following linux command will get the class name list:

yarn logs --applicationId {your_spark_app_id} | grep registerImplicitclasstype >> type_names.txt
sort -u type_names.txt

Then register all class name in your registrator: kryo.registser(Class.forName("class name"))

After that, you can add config("spark.kryo.registrationRequired", "true") to the spark conf. Sometimes the yarn logs may get lost, you can rerun the above process again. ps: The code above works for spark version 2.1.2.

Enjoy.


As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.

No. If you set spark.serializer to org.apache.spark.serializer. KryoSerializer then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.

So what is this Kryo registration then?

When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That's a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.

This is especially crucial when each row of an RDD is serialized with Kryo. You don't want to include the same class name for each of a billion rows. So you pre-register these classes. But it's easy to forget to register a new class and then you're wasting bytes again. The solution is to require every class to be registered:

conf.set("spark.kryo.registrationRequired", "true")

Now Kryo will never output full class names. If it encounters an unregistered class, that's a runtime error.

Unfortunately it's hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]? You have to register classOf[scala.Tuple3[_, _, _]].

The list of classes that Spark registers actually includes CompactBuffer, so if you see an error for that, you're doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister or spark.kryo.registrator to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)