How to use Scala DataFrameReader option method
The list of available options varies by the file format. They are documented in the DataFrameReader
API docs.
For example:
def json(paths: String*): DataFrame
Loads a JSON file (one object per line) and returns the result as a DataFrame.
This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan.
You can set the following JSON-specific options to deal with non-standard JSON files:
primitivesAsString
(defaultfalse
): infers all primitive values as a string typeprefersDecimal
(defaultfalse
): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.allowComments
(defaultfalse
): ignores Java/C++ style comment in JSON recordsallowUnquotedFieldNames
(defaultfalse
): allows unquoted JSON field namesallowSingleQuotes
(defaulttrue
): allows single quotes in addition to double quotesallowNumericLeadingZeros
(defaultfalse
): allows leading zeros in numbers (e.g. 00012)allowBackslashEscapingAnyCharacter
(defaultfalse
): allows accepting quoting of all character using backslash quoting mechanismmode
(defaultPERMISSIVE
): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE
: sets other fields tonull
when it meets a corrupted record, and puts the malformed string into a new field configured bycolumnNameOfCorruptRecord
. When a schema is set by user, it setsnull
for extra fields.DROPMALFORMED
: ignores the whole corrupted records.FAILFAST
: throws an exception when it meets corrupted records.columnNameOfCorruptRecord
(default is the value specified inspark.sql.columnNameOfCorruptRecord
): allows renaming the new field having malformed string created byPERMISSIVE
mode. This overridesspark.sql.columnNameOfCorruptRecord
.
Spark source code
def option(key: String, value: String): DataFrameReader = {
this.extraOptions += (key -> value)
this
}
Where extraOptions
is simply a Map
and used like that:
private def jdbc(
url: String,
table: String,
parts: Array[Partition],
connectionProperties: Properties): DataFrame = {
val props = new Properties()
// THIS
extraOptions.foreach { case (key, value) =>
props.put(key, value)
}
// connectionProperties should override settings in extraOptions
props.putAll(connectionProperties)
val relation = JDBCRelation(url, table, parts, props)(sqlContext)
sqlContext.baseRelationToDataFrame(relation)
}
As you can see, it is simply a method to pass additional property to a jdbc
driver.
There is also more general options
method to pass the Map
instead of single key-value and it's usage example in Spark documentation:
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()