Re-using A Schema from JSON within a Spark DataFrame using Scala
I recently ran into this. I'm using Spark 2.0.2 so I don't know if this solution works with earlier versions.
import scala.util.Try
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
/** Produce a Schema string from a Dataset */
def serializeSchema(ds: Dataset[_]): String = ds.schema.json
/** Produce a StructType schema object from a JSON string */
def deserializeSchema(json: String): StructType = {
Try(DataType.fromJson(json)).getOrElse(LegacyTypeStringParser.parse(json)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing StructType: $json")
}
}
Note that the "deserialize" function I just copied from a private function in the Spark StructType object. I don't know how well it will be supported across versions.
Well, the error message should tell you everything you have to know here - StructType
expects a sequence of fields as an argument. So in your case schema should look like this:
StructType(Seq(
StructField("comments", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("comId", StringType, true),
StructField("content", StringType, true))), true), true),
StructField("createHour", StringType, true),
StructField("gid", StringType, true),
StructField("replies", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("content", StringType, true),
StructField("repId", StringType, true))), true), true),
StructField("revisions", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("modDate", StringType, true),
StructField("revId", StringType, true))),true), true)))