Generate random UTF-8 string in Python
People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.
import random
def get_random_unicode(length):
try:
get_char = unichr
except NameError:
get_char = chr
# Update this to include code point ranges to be sampled
include_ranges = [
( 0x0021, 0x0021 ),
( 0x0023, 0x0026 ),
( 0x0028, 0x007E ),
( 0x00A1, 0x00AC ),
( 0x00AE, 0x00FF ),
( 0x0100, 0x017F ),
( 0x0180, 0x024F ),
( 0x2C60, 0x2C7F ),
( 0x16A0, 0x16F0 ),
( 0x0370, 0x0377 ),
( 0x037A, 0x037E ),
( 0x0384, 0x038A ),
( 0x038C, 0x038C ),
]
alphabet = [
get_char(code_point) for current_range in include_ranges
for code_point in range(current_range[0], current_range[1] + 1)
]
return ''.join(random.choice(alphabet) for i in range(length))
if __name__ == '__main__':
print('A random string: ' + get_random_unicode(10))
Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:
#!/usr/bin/env python3.1
# From Table 3–7 of the Unicode Standard 5.0.0
import random
def byte_range(first, last):
return list(range(first, last+1))
first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)
def random_utf8_seq():
first = random.choice(first_values)
if first <= 0x7F:
return bytes([first])
elif first <= 0xDF:
return bytes([first, random.choice(trailing_values)])
elif first == 0xE0:
return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
elif first == 0xED:
return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
elif first <= 0xEF:
return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
elif first == 0xF0:
return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
elif first <= 0xF3:
return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
elif first == 0xF4:
return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])
print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))
Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).
There is a UTF-8 stress test from Markus Kuhn you could use.
See also Really Good, Bad UTF-8 example test data.