How does foldLeft in Scala work on DataFrame?
Consider a trivialized foldLeft
example more similar to your DataFrame version:
List(3, 2, 1).foldLeft("abcde")((acc, x) => acc.take(x))
If you look closely at what the (acc, x) => acc.take(x)
function does in each iteration, the foldLeft
is no difference from the following:
"abcde".take(3).take(2).take(1)
// Result: "a"
Going back to the foldLeft
for your DataFrame:
stringColumns.foldLeft(yearDF){ (tempdf, colName) =>
tempdf.withColumn(colName, regexp_replace(col(colName), "\n", ""))
}
Similarly it's no difference from:
val sz = stringColumns.size
yearDF.
withColumn(stringColumns(0), regexp_replace(col(stringColumns(0)), "\n", "")).
withColumn(stringColumns(1), regexp_replace(col(stringColumns(1)), "\n", "")).
...
withColumn(stringColumns(sz - 1), regexp_replace(col(stringColumns(sz - 1)), "\n", ""))
- What value does tempDF hold ? If it is the same as yearDF, how is it mapped to yearDF ?
In each iteration (i = 0, 1, 2, ...), tempDF
holds a new DataFrame transformed from applying withColumn(stringColumns(i), ...)
, starting from yearDF
- If withColumns is used in the function and the result is added to yearDF, how come it is not creating duplicating columns when
From withColumn(stringColumns(i), regexp_replace(col(stringColumns(i)), "\n", ""))
, method withColumn
creates a new DataFrame, "adding" a column with the same name as the column stringColumns(i)
it derives from, thus essentially resulting in a new DataFrame with the same column list as the original yearDF
.