Write-streaming to Google Cloud Storage in Python
smart_open now has support for GCS and also has support for on the fly decompression.
import lzma
from smart_open import open, register_compressor
def _handle_xz(file_obj, mode):
return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
register_compressor('.xz', _handle_xz)
# stream from GCS
with open('gs://my_bucket/my_file.txt.xz') as fin:
for line in fin:
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt.xz', 'wb') as fout:
fout.write(b'hello world')
I got confused with multipart
vs. resumable
upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.
Multipart
upload is to load data and custom metadata at once, in the same API call.
While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media
library.
GCSFS
is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP
- we faced the issue with GCF
.
On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write
and read
. It has the core code already.
If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.