Unit testing with Spark dataframes

You could use SharedSQLContext and SharedSparkSession that Spark uses for its own unit tests. Check my answer for examples.


This link shows how we can programmatically create a data frame with schema. You can keep the data in separate traits and mix it in with your tests. For instance,

// This example assumes CSV data. But same approach should work for other formats as well.

trait TestData {
  val data1 = List(
    "this,is,valid,data",
    "this,is,in-valid,data",
  )
  val data2 = ...  
}

Then with ScalaTest, we can do something like this.

class MyDFTest extends FlatSpec with Matchers {

  "method" should "perform this" in new TestData {
     // You can access your test data here. Use it to create the DataFrame.
     // Your test here.
  }
}

To create the DataFrame, you can have few util methods like below.

  def schema(types: Array[String], cols: Array[String]) = {
    val datatypes = types.map {
      case "String" => StringType
      case "Long" => LongType
      case "Double" => DoubleType
      // Add more types here based on your data.
      case _ => StringType
    }
    StructType(cols.indices.map(x => StructField(cols(x), datatypes(x))).toArray)
  }

  def df(data: List[String], types: Array[String], cols: Array[String]) = {
    val rdd = sc.parallelize(data)
    val parser = new CSVParser(',')
    val split = rdd.map(line => parser.parseLine(line))
    val rdd = split.map(arr => Row(arr(0), arr(1), arr(2), arr(3)))
    sqlContext.createDataFrame(rdd, schema(types, cols))
  }

I am not aware of any utility classes for checking specific values in a DataFrame. But I think it should be simple to write one using the DataFrame APIs.