Amazon S3 concatenate small files

Edit: Didn't see the 5MB requirement. This method will not work because of this requirement.

From https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby:

While it is possible to download and re-upload the data to S3 through an EC2 instance, a more efficient approach would be to instruct S3 to make an internal copy using the new copy_part API operation that was introduced into the SDK for Ruby in version 1.10.0.

Code:

require 'rubygems'
require 'aws-sdk'

s3 = AWS::S3.new()
mybucket = s3.buckets['my-multipart']

# First, let's start the Multipart Upload
obj_aggregate = mybucket.objects['aggregate'].multipart_upload

# Then we will copy into the Multipart Upload all of the objects in a certain S3 directory.
mybucket.objects.with_prefix('parts/').each do |source_object|

  # Skip the directory object
  unless (source_object.key == 'parts/')
    # Note that this section is thread-safe and could greatly benefit from parallel execution.
    obj_aggregate.copy_part(source_object.bucket.name + '/' + source_object.key)
  end

end

obj_completed = obj_aggregate.complete()

# Generate a signed URL to enable a trusted browser to access the new object without authenticating.
puts obj_completed.url_for(:read)

Limitations (among others)

With the exception of the last part, there is a 5 MB minimum part size.
The completed Multipart Upload object is limited to a 5 TB maximum size.

Based on @wwadge's comment I wrote a Python script.

It bypasses the 5MB limit by uploading a dummy-object slightly bigger than 5MB, then append each small file as if it was the last. In the end it strips out the dummy-part from the merged file.

import boto3
import os

bucket_name = 'multipart-bucket'
merged_key = 'merged.json'
mini_file_0 = 'base_0.json'
mini_file_1 = 'base_1.json'
dummy_file = 'dummy_file'

s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')

# we need to have a garbage/dummy file with size > 5MB
# so we create and upload this
# this key will also be the key of final merged file
with open(dummy_file, 'wb') as f:
    # slightly > 5MB
    f.seek(1024 * 5200) 
    f.write(b'0')

with open(dummy_file, 'rb') as f:
    s3_client.upload_fileobj(f, bucket_name, merged_key)

os.remove(dummy_file)


# get the number of bytes of the garbage/dummy-file
# needed to strip out these garbage/dummy bytes from the final merged file
bytes_garbage = s3_resource.Object(bucket_name, merged_key).content_length

# for each small file you want to concat
# when this loop have finished merged.json will contain 
# (merged.json + base_0.json + base_2.json)
for key_mini_file in ['base_0.json','base_1.json']: # include more files if you want

    # initiate multipart upload with merged.json object as target
    mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)
        
    part_responses = []
    # perform multipart copy where merged.json is the first part 
    # and the small file is the second part
    for n, copy_key in enumerate([merged_key, key_mini_file]):
        part_number = n + 1
        copy_response = s3_client.upload_part_copy(
            Bucket=bucket_name,
            CopySource={'Bucket': bucket_name, 'Key': copy_key},
            Key=merged_key,
            PartNumber=part_number,
            UploadId=mpu['UploadId']
        )

        part_responses.append(
            {'ETag':copy_response['CopyPartResult']['ETag'], 'PartNumber':part_number}
        )

    # complete the multipart upload
    # content of merged will now be merged.json + mini file
    response = s3_client.complete_multipart_upload(
        Bucket=bucket_name,
        Key=merged_key,
        MultipartUpload={'Parts': part_responses},
        UploadId=mpu['UploadId']
    )

# get the number of bytes from the final merged file
bytes_merged = s3_resource.Object(bucket_name, merged_key).content_length

# initiate a new multipart upload
mpu = s3_client.create_multipart_upload(Bucket=bucket_name, Key=merged_key)            
# do a single copy from the merged file specifying byte range where the 
# dummy/garbage bytes are excluded
response = s3_client.upload_part_copy(
    Bucket=bucket_name,
    CopySource={'Bucket': bucket_name, 'Key': merged_key},
    Key=merged_key,
    PartNumber=1,
    UploadId=mpu['UploadId'],
    CopySourceRange='bytes={}-{}'.format(bytes_garbage, bytes_merged-1)
)
# complete the multipart upload
# after this step the merged.json will contain (base_0.json + base_2.json)
response = s3_client.complete_multipart_upload(
    Bucket=bucket_name,
    Key=merged_key,
    MultipartUpload={'Parts': [
       {'ETag':response['CopyPartResult']['ETag'], 'PartNumber':1}
    ]},
    UploadId=mpu['UploadId']
)

If you already have a >5MB object that you want to add smaller parts too, then skip creating the dummy file and the last copy part with the byte-ranges. Also, I have no idea how this performs on a large number of very small files - in that case it might be better to download each file, merge them locally and then upload.

Amazon S3 does not provide a concatenate function. It is primarily an object storage service.

You will need some process that downloads the objects, combines them, then uploads them again. The most efficient way to do this would be to download the objects in parallel, to take full advantage of available bandwidth. However, that is more complex to code.

I would recommend doing the processing on "in the cloud" to avoid having to download the objects across the Internet. Doing it on Amazon EC2 or AWS Lambda would be more efficient and less costly.

Amazon S3 concatenate small files

Tags:

Amazon S3

Amazon Web Services

Concatenation

Related

Recent Posts