How to read a parquet file in R without using spark packages?
You can simply use the arrow package:
install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")
With reticulate you can use pandas from python to read parquet files. This could save you the hassle from running a spark instance. May lose performance in serialization till apache arrow releases their version. As above comment mentioned.
library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {
path <- path.expand(path)
path <- normalizePath(path)
if (!is.null(columns)) columns = as.list(columns)
xdf <- pandas$read_parquet(path, columns = columns)
xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)
dplyr::tbl_df(xdf)
}
read_parquet(PATH_TO_PARQUET_FILE)