Downloading and unzipping a .zip file without writing to disk
I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files
My suggestion would be to use a StringIO
object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]
I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
- There's no
StringIO
module in Python 3 (it's been moved toio.StringIO
). Instead, I useio.BytesIO
]2, because we will be handling a bytestream -- Docs, also this thread. - urlopen:
- "The legacy
urllib.urlopen
function from Python 2.6 and earlier has been discontinued;urllib.request.urlopen()
corresponds to the oldurllib2.urlopen
.", Docs and this thread.
- "The legacy
Note:
- In Python 3, the printed output lines will look like so:
b'some text'
. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
- I use
with ... as
instead ofzipfile = ...
according to the Docs. - The script now uses
.namelist()
to cycle through all the files in the zip and print their contents. - I moved the creation of the
ZipFile
object into thewith
statement, although I'm not sure if that's better. - I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds
"unzipped_and_read_"
to the beginning of the filename and a".file"
extension (I prefer not to use".txt"
for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.- Need to be careful here -- because we have a byte string, we use binary mode, so
"wb"
; I have a feeling that writing binary opens a can of worms anyway...
- Need to be careful here -- because we have a byte string, we use binary mode, so
- I am using an example file, the UN/LOCODE text archive:
What I didn't do:
- NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)
Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(StringIO(resp.read()))
for line in zipfile.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(BytesIO(resp.read()))
for line in zipfile.open(file).readlines():
print(line.decode('utf-8'))
Here file
is a string. To get the actual string that you want to pass, you can use zipfile.namelist()
. For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']