R: possible truncation of >= 4GB file

I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.

To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.

I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.

decompress_file <- function(directory, file, .file_cache = FALSE) {

    if (.file_cache == TRUE) {
       print("decompression skipped")
    } else {

      # Set working directory for decompression
      # simplifies unzip directory location behavior
      wd <- getwd()
      setwd(directory)

      # Run decompression
      decompression <-
        system2("unzip",
                args = c("-o", # include override flag
                         file),
                stdout = TRUE)

      # uncomment to delete archive once decompressed
      # file.remove(file) 

      # Reset working directory
      setwd(wd); rm(wd)

      # Test for success criteria
      # change the search depending on 
      # your implementation
      if (grepl("Warning message", tail(decompression, 1))) {
        print(decompression)
      }
    }
}    

Notes:

The function does a few things, which I like and recommend:

  • uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
  • separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
    • it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
    • you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
  • I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
  • includes a .file_cache argument which allows you to skip decompression
    • this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
  • commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
  • the system2 command redirects the stdout to decompression, a character vector
    • an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression

Checking ?unzip, found the following comment in Note:

It does have some support for bzip2 compression and > 2GB zip files (but not >= 4GB files pre-compression contained in a zip file: like many builds of unzip it may truncate these, in R's case with a warning if possible).

You can try to unzip it outside of R (using 7-Zip for example).

Tags:

R

Unzip