Etag definition changed in Amazon S3

Also in python...

#!/usr/bin/env python3
import binascii
import hashlib
import os

# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
# note: 2022-01-27 bitnami-minio container uses 5 mib
AWS_UPLOAD_PART_SIZE = int(os.environ.get('AWS_UPLOAD_PART_SIZE', 5 * 1024 * 1024))

def md5sum(sourcePath):
    '''
    Function: md5sum
    Purpose: Get the md5 hash of a file stored in S3
    Returns: Returns the md5 hash that will match the ETag in S3    
    '''

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5bytes = b""
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash = hashlib.md5()
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
                md5bytes += binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5bytes)
        hexdigest = hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
        hexdigest = hash.hexdigest()
    return hexdigest

Amazon S3 calculates Etag with a different algorithm (not MD5 Sum, as usually) when you upload a file using multipart.

This algorithm is detailed here : http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583

"Calculate the MD5 hash for each uploaded part of the file, concatenate the hashes into a single binary string and calculate the MD5 hash of that result."

I just develop a tool in bash to calculate it, s3md5 : https://github.com/Teachnova/s3md5

For example, to calculate Etag of a file foo.bin that has been uploaded using multipart with chunk size of 15 MB, then

# s3md5 15 foo.bin

Now you can check integrity of a very big file (bigger than 5GB) because you can calculate the Etag of the local file and compares it with S3 Etag.


If any file is being uploaded with multipart then you will always get such type of ETag. But if you upload whole file as single file then you will get ETag as before.

Bucket Explorer providing you normal ETag till 5Gb upload in multipart operation. But more then it is not providing.

AWS:

The ETag for an object created using the multipart upload api will contain one or more non-hexadecimal characters and/or will consist of less than 16 or more than 16 hexadecimal digits.

Reference: https://forums.aws.amazon.com/thread.jspa?messageID=203510#203510

Tags:

Amazon S3