Read partitioned parquet directory (all files) in one R dataframe with apache arrow
Solution for: Read partitioned parquet files from local file system into R dataframe with arrow
As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr
, SparkR
or reticulate
and dplyr
as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?
I solved my task now with your proposal using arrow
together with lapply
and rbindlist
my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))
looking forward until the apache arrow functionality is available Thanks
Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply
/map
over the directory listing and rbind
/bind_rows
into a single data.frame. There's probably a purrr
function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.
In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.
Solution for: Read partitioned parquet files from S3 into R dataframe using arrow
As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3
library(arrow)
library(aws.s3)
bucket="mybucket"
prefix="my_prefix"
# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key
# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})
# concatenate all data together into one data.frame
data <- do.call(rbind, data)
What a mess but it works.
@neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R