c# Detect xml encoding from Byte Array?
A solution similar to this question could solve this by using a Stream over the byte array. Then you won't have to fiddle at the byte level. Like this:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You could look at the first 40-ish bytes1. They should contain the document declaration (assuming it has an document declaration) which should either contain the encoding or you can assume it's UTF-8 or UTF-16, which should should be obvious from how you've understood the <?xml
part. (Just check for both patterns.)
Realistically, do you expect you'll ever get anything other than UTF-8 or UTF-16? If not, you could check for the patterns you get at the start of both of those and throw an exception if it doesn't follow either pattern. Alternatively, if you want to make another attempt, you could always try to decode the document as UTF-8, re-encode it and see if you get the same bytes back. It's not ideal, but it might just work.
I'm sure there are more rigorous ways of doing this, but they're likely to be finicky :)
1 Quite possibly less than this. I figure 20 characters should be enough, which is 40 bytes in UTF-16.
The first 2 or 3 bytes may be a Byte Order Mark (BOM) which can tell you whether the stream is UTF-8, Unicode-LittleEndian or Unicode-BigEndian.
UTF-8 BOM is 0xEF 0xBB 0xBF Unicode-Bigendian is 0xFE 0xFF Unicode-LittleEndiaon is 0xFF 0xFE
If none of these are present then you can use ASCII to test for <?xml
(note most modern XML generation sticks to the standard that no white space may preceed the xml declare).
ASCII is used up until ?>
so you can find the presence of encoding= and find its value.
If encoding isn't present or <?xml
declare is not present then you can assume UTF-8.