Hexadecimal value 0x00 is a invalid character
Without your actual data or source, it will be hard for us to diagnose what is going wrong. However, I can make a few suggestions:
- Unicode NUL (0x00) is illegal in all versions of XML and validating parsers must reject input that contains it.
- Despite the above; real-world non-validated XML can contain any kind of garbage ill-formed bytes imaginable.
- XML 1.1 allows zero-width and nonprinting control characters (except NUL), so you cannot look at an XML 1.1 file in a text editor and tell what characters it contains.
Given what you wrote, I suspect whatever converts the database data to XML is broken; it's propagating non-XML characters.
Create some database entries with non-XML characters (NULs, DELs, control characters, et al.) and run your XML converter on it. Output the XML to a file and look at it in a hex editor. If this contains non-XML characters, your converter is broken. Fix it or, if you cannot, create a preprocessor that rejects output with such characters.
If the converter output looks good, the problem is in your XML consumer; it's inserting non-XML characters somewhere. You will have to break your consumption process into separate steps, examine the output at each step, and narrow down what is introducing the bad characters.
Check file encoding (for UTF-16)
Update: I just ran into an example of this myself! What was happening is that the producer was encoding the XML as UTF16 and the consumer was expecting UTF8. Since UTF16 uses 0x00 as the high byte for all ASCII characters and UTF8 doesn't, the consumer was seeing every second byte as a NUL. In my case I could change encoding, but suggested all XML payloads start with a BOM.
To add to Sonz's answer above, following worked for us.
//Instead of
XmlString.Replace("�", "[0x00]");
// use this
XmlString.Replace("\x00", "[0x00]");
In my case, it took some digging, but found it.
My Context
I'm looking at exception/error logs from the website using Elmah. Elmah returns the state of the server at the of time the exception, in the form of a large XML document. For our reporting engine I pretty-print the XML with XmlWriter.
During a website attack, I noticed that some xmls weren't parsing and was receiving this '.', hexadecimal value 0x00, is an invalid character.
exception.
NON-RESOLUTION: I converted the document to a byte[]
and sanitized it of 0x00, but it found none.
When I scanned the xml document, I found the following:
...
<form>
...
<item name="SomeField">
<value
string="C:\boot.ini�.htm" />
</item>
...
There was the nul byte encoded as an html entity �
!!!
RESOLUTION: To fix the encoding, I replaced the �
value before loading it into my XmlDocument
, because loading it will create the nul byte and it will be difficult to sanitize it from the object. Here's my entire process:
XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("�", "[0x00]"); // in my case I want to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);
string formattedXml = null;
// I have this in a helper function, but for this example I have put it in-line
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
OmitXmlDeclaration = true,
Indent = true,
IndentChars = "\t",
NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
xml.Save(writer);
formattedXml = sb.ToString();
}
LESSON LEARNED: sanitize for illegal bytes using the associated html entity, if your incoming data is html encoded on entry.