How to use Scala DataFrameReader option method

The list of available options varies by the file format. They are documented in the DataFrameReader API docs.

For example:

def json(paths: String*): DataFrame

Loads a JSON file (one object per line) and returns the result as a DataFrame.

This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.

You can set the following JSON-specific options to deal with non-standard JSON files:

primitivesAsString (default false): infers all primitive values as a string type

prefersDecimal (default false): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.

allowComments (default false): ignores Java/C++ style comment in JSON records

allowUnquotedFieldNames (default false): allows unquoted JSON field names

allowSingleQuotes (default true): allows single quotes in addition to double quotes

allowNumericLeadingZeros (default false): allows leading zeros in numbers (e.g. 00012)

allowBackslashEscapingAnyCharacter (default false): allows accepting quoting of all character using backslash quoting mechanism

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.

PERMISSIVE: sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.

DROPMALFORMED: ignores the whole corrupted records.

FAILFAST: throws an exception when it meets corrupted records.

columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.

Spark source code

  def option(key: String, value: String): DataFrameReader = {
    this.extraOptions += (key -> value)
    this
  }

Where extraOptions is simply a Map and used like that:

private def jdbc(
  url: String,
  table: String,
  parts: Array[Partition],
  connectionProperties: Properties): DataFrame = {
  val props = new Properties()
  // THIS
  extraOptions.foreach { case (key, value) =>
    props.put(key, value)
  }
  // connectionProperties should override settings in extraOptions
  props.putAll(connectionProperties)
  val relation = JDBCRelation(url, table, parts, props)(sqlContext)
  sqlContext.baseRelationToDataFrame(relation)
}

As you can see, it is simply a method to pass additional property to a jdbc driver.

There is also more general options method to pass the Map instead of single key-value and it's usage example in Spark documentation:

val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

How to use Scala DataFrameReader option method

Tags:

Scala

Apache Spark

Related

Recent Posts