zcat on amazon s3
From S3 REST API » Operations on Objects » GET Object:
To use GET, you must have READ access to the object. If you grant READ access to the anonymous user, you can return the object without using an authorization header.
If that's the case, you can use:
$ curl <url-of-your-object> | zcat | grep "log_id"
or
$ wget -O- <url-of-your-object> | zcat | grep "log_id"
However, if you haven't granted anonymous READ access on the object, you need to create and send the authorization header as part of the GET
request and that becomes somewhat tedious to do with curl
/wget
. Lucky for you, someone has already done it and that's the Perl aws script by Tim Kay as recommended by Hari. Note that you don't have to put Tim Kay's script on your path or otherwise install it (except making it executable), as long as you use the command versions which start with aws
, eg.
$ ./aws cat BUCKET/OBJECT | zcat | grep "log_id"
You could also use s3cat, part of Tim Kay's command-line toolkit for AWS:
http://timkay.com/aws/
To get the equivalent of zcat FILENAME | grep "log_id"
, you'd do:
> s3cat BUCKET/OBJECT | zcat - | grep "log_id"
Found this thread today, and liked Keith's answer. Fast forward to today's aws cli it's done with:
aws s3 cp s3://some-bucket/some-file.bz2 - | bzcat -c | mysql -uroot some_db
Might save someone else a tiny bit of time.
Not exaclty a zcat, but a way to use hadoop to download large files parallel from S3 could be http://hadoop.apache.org/common/docs/current/distcp.html
hadoop distcp s3://YOUR_BUCKET/your_file /tmp/your_file
or
hadoop distcp s3://YOUR_BUCKET/your_file hdfs://master:8020/your_file
Maybe from this point you can pipe a zcat...
To add your credentials you have to edit core-site.xml file with:
<configuration>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_KEY</value>
</property>
</configuration>