Unicode test strings for unit tests
https://github.com/noct/cutf/tree/master/bin
Includes following files:
UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt
Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/
To really test all possible conversions between formats, opposed to character conversions (i.e. towupper()
, towlower()
) you should test all characters. The following loop gives you all of those:
for(wint_t c(0); c < 0x110000; ++c)
{
if(c >= 0xD800 && c <= 0xDFFF)
{
continue;
}
// here 'c' is any one Unicode character in UTF-32
...
}
That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.