how to select all columns that starts with a common label
Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)
First grab the column names with df.columns
, then filter down to just the column names you want .filter(_.startsWith("colF"))
. This gives you an array of Strings. But the select takes select(String, String*)
. Luckily select for columns is select(Column*)
, so finally convert the Strings into Columns with .map(df(_))
, and finally turn the Array of Columns into a var arg with : _*
.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show