Read partitioned parquet directory (all files) in one R dataframe with apache arrow

Solution for: Read partitioned parquet files from local file system into R dataframe with arrow

As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?

I solved my task now with your proposal using arrow together with lapply and rbindlist

Click to copy

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

looking forward until the apache arrow functionality is available Thanks

Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply/map over the directory listing and rbind/bind_rows into a single data.frame. There's probably a purrr function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.

In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.

Solution for: Read partitioned parquet files from S3 into R dataframe using arrow

As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3

Click to copy

library(arrow)
library(aws.s3)

bucket="mybucket"
prefix="my_prefix"

# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key

# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})

# concatenate all data together into one data.frame
data <- do.call(rbind, data)

What a mess but it works.
@neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

Tags:

R

Rstudio

Parquet

Apache Arrow

Related

Recent Posts