performance optimization techniques pyspark code example

Example 1: What are optimization technique in spark or what optimization you have done during your spark project .

filtered_df = filter_input_data(intial_data)filter_df.persist()for obj in list_objects:    compute_df = compute_dataframe(input_df,obj)    percentage_df =                                 calculate_percentage(compute_df)export_as_csv(percentage_df)filter_df.unpersist()

Example 2: What are optimization technique in spark or what optimization you have done during your spark project .

>>>df = spark.createDataFrame(    [('1', 'true'),('2', 'false'),     ('1', 'true'),('2', 'false'),    ('1', 'true'),('2', 'false'),    ('1', 'true'),('2', 'false'),    ('1', 'true'),('2', 'false'),    ])>>> df.rdd.getNumPartitions()8#Now performing a group by Operation>>> group_df = df.groupBy("_1").count()>>> group_df.show()+---+-----+| _1|count|+---+-----+|  1|    5||  2|    5|+---+-----+>>> group_df.rdd.getNumPartitions()200

Tags:

Misc Example