How to parse data-uri in python?

This may help:

import re
from lxml import html

BASE_NAME = "image_"

source_code = """<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=" alt="Black dot" />"""

tree = html.fromstring(source_code)

for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
    image_type, image_content = image.split(',', 1)
    image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
    with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
        f.write(image_content.decode('base64'))
    print "[*] '{}' image found with content: {}\n".format(image_type, image_content)

Output:

Click to copy

[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==

[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=

It will save every base64 image within <img> tags, with their respective file extension:

Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate + image_extension

enter image description here

Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode to decode that to bytes. Last, write the bytes to a file.

Click to copy

from base64 import b64decode

data_uri = "data:image/png;base64,iVBORw0KGg..."

# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)

# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
#     data = response.read()

with open("image.png", "wb") as f:
    f.write(data)

Python since 3.4 has support for data-uri, under the hood using urllib.request.DataHandler.

Click to copy

from urllib.request import urlopen

with urlopen(data_uri) as response:
    data = response.read()

w3lib (a library used by Scrapy) has a function to parse data uris:

Click to copy

>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('data:image/png;base64,iVBORw0KGg==')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')

How to parse data-uri in python?

Tags:

Python

Image

Base64

Data Uri

Related

Recent Posts