performance optimization techniques pyspark code example
Example 1: What are optimization technique in spark or what optimization you have done during your spark project .
filtered_df = filter_input_data(intial_data)filter_df.persist()for obj in list_objects: compute_df = compute_dataframe(input_df,obj) percentage_df = calculate_percentage(compute_df)export_as_csv(percentage_df)filter_df.unpersist()
Example 2: What are optimization technique in spark or what optimization you have done during your spark project .
>>>df = spark.createDataFrame( [('1', 'true'),('2', 'false'), ('1', 'true'),('2', 'false'), ('1', 'true'),('2', 'false'), ('1', 'true'),('2', 'false'), ('1', 'true'),('2', 'false'), ])>>> df.rdd.getNumPartitions()8#Now performing a group by Operation>>> group_df = df.groupBy("_1").count()>>> group_df.show()+---+-----+| _1|count|+---+-----+| 1| 5|| 2| 5|+---+-----+>>> group_df.rdd.getNumPartitions()200