Selecting only numeric/string columns names from a Spark DF in pyspark
You can do what zlidme suggested to get only string (categorical columns). To extend on the answer given take a look at the example bellow. It will give you all numeric (continuous) columns in a list called continuousCols, all categorical columns in a list called categoricalCols and all columns in a list called allCols.
data = {'mylongint': [0, 1, 2],
'shoes': ['blue', 'green', 'yellow'],
'hous': ['furnitur', 'roof', 'foundation'],
'C': [1, 0, 0]}
play_df = pd.DataFrame(data)
play_ddf = spark.createDataFrame(play_df)
#store all column names in a list
allCols = [item[0] for item in play_ddf]
#store all column names that are categorical in a list
categoricalCols = [item[0] for item in play_ddf.dtypes if item[1].startswith('string')]
#store all column names that are continous in a list
continuousCols =[item[0] for item in play_ddf.dtypes if item[1].startswith('bigint')]
print(len(allCols), ' - ', len(continuousCols), ' - ', len(categoricalCols))
This will give the result: 4 - 2 - 2
PySpark provides a rich API related to schema types. As @DanieldePaula mentioned you can access fields' metadata through df.schema.fields
.
Here is a different approach based on statically typed checking:
from pyspark.sql.types import StringType, DoubleType
df = spark.createDataFrame([
[1, 2.3, "t1"],
[2, 5.3, "t2"],
[3, 2.1, "t3"],
[4, 1.5, "t4"]
], ["cola", "colb", "colc"])
# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']
# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]