Spark: Programmatically creating dataframe schema in scala
val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)
df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+
As David pointed out, I needed to use an ArrayType. Spark is happy with this:
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", ArrayType(StructType(Array(
StructField("value", StringType),
StructField("count", LongType)
))))
)
)