R: possible truncation of >= 4GB file
I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
- uses
system2
over system because the documentation says "system2 is a more portable and flexible interface than system" - separates the
directory
andfile
arguments, and moves the working directory to thedirectory
argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory- it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
- you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
- I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
- includes a
.file_cache
argument which allows you to skip decompression- this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
- commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
- the system2 command redirects the stdout to decompression, a character vector
- an
if
+grepl
check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression
- an
Checking ?unzip
, found the following comment in Note
:
It does have some support for bzip2 compression and > 2GB zip files (but not >= 4GB files pre-compression contained in a zip file: like many builds of unzip it may truncate these, in R's case with a warning if possible).
You can try to unzip it outside of R (using 7-Zip for example).