How to set Parquet file encoding in Spark
So I found an answer to my question on twitter engineering blog.
Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing Parquet 1.0 with self-tuning dictionary encoding
UPD:
Dictionary encoding can be switched in SparkSession configs:
SparkSession.builder
.appName("name")
.config("parquet.enable.dictionary","false") //true
Regarding encoding by column, there is an open issue as improvement in Parquet’s Jira that was created on 14th July, 17. Since dictionary encoding is a default and works only for all table it turns off Delta Encoding(Jira issue for this bug) which is the only suitable encoding for data like timestamps where almost each value is unique.
UPD2
How can we tell which encoding was used for an output file?
I used parquet-tools for it.
-> brew install parquet-tools (for mac)
-> parquet-tools meta your_parquet_file.snappy.parquet
Output:
.column_1: BINARY SNAPPY DO:0 FPO:16637 SZ:2912/8114/3.01 VC:26320 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
.column_2: BINARY SNAPPY DO:0 FPO:25526 SZ:119245/711487/1.32 VC:26900 ENC:PLAIN,RLE,BIT_PACKED
.
Where PLAIN and PLAIN_DICTIONARY are encodings which were used for that columns