How to list objects by extension from s3 api?
You don't actually need a separate database to do this for you.
S3 gives you the ability to list objects in a bucket with a certain prefix. Your dilemma is that the ".xls" extension is at the end of the file name, therefore, prefix search doesn't help you. However, when you put the file into the bucket, you can change the object name so that the prefix contains the file type (for example: XLS-myfile.xls). Then, you can use the S3 API listObjects and pass a prefix of "XLS".
I'm iterating after fetching the file information. End result will be in dict
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
#get all files information from buket
files = bucket.objects.all()
# create empty list for final information
files_information = []
# your known extensions list. we will compare file names with this list
extensions = ['png', 'jpg', 'txt', 'docx']
# Iterate throgh 'files', convert to dict. and add extension key.
for file in files:
if file.key[-3:] in extensions:
files_information.append({'file_name' : file.key, 'extension' : file.key[-3:]})
else:
files_information.append({'file_name' : file.key, 'extension' : 'unknown'})
print files_information
While I do think the BEST answer is to use a database to keep track of your files for you, I also think its an incredible pain in the ass. I was working within python with boto3, and this is the solution I came up with.
It's not elegant, but it will work. List all the files, and then filter it down to a list of the ones with the "suffix"/"extension" that you want in code.
s3_client = boto3.client('s3')
bucket = 'my-bucket'
prefix = 'my-prefix/foo/bar'
paginator = s3_client.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)
file_names = []
for response in response_iterator:
for object_data in response['Contents']:
key = object_data['Key']
if key.endswith('.json'):
file_names.append(key)
print file_names
I don't believe this is possible with S3.
The best solution is to 'index' S3 using a database (Sql Server, MySql, SimpleDB etc) and do your queries against that.