Testing whether or not something is parseable XML in C#

It sounds like that you sometimes get back XML and sometimes you get back "plain" (non-XML) text.

If that's the case you could just check that the text starts with <:

if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
    var doc = XDocument.Parse(str);

Since "plain" messages seem unlikely to start with < this may be reasonable. The only thing you need to decide is what to do in the edge case that you have non-XML text that starts with a <?

If it were me I would default to trying to parse it and catching the exception:

Click to copy

if (!string.IsNullOrEmpty(str) && str.TrimStart().StartsWith("<"))
{
    try
    {
        var doc = XDocument.Parse(str);
        return //???
    }   
    catch(Exception ex)
        return str;
}
else
{
    return str;   
}

That way the only time you have the overhead of a thrown exception is when you have a message that starts with < but is not valid XML.

You could try to parse the string into an XDocument. If it fails to parse, then you know that it is not valid.

Click to copy

string xml = "";
XDocument document = XDocument.Parse(xml);

And if you don't want to have the ugly try/catch visible, you can throw it into an extension method on the string class...

Click to copy

public static bool IsValidXml(this string xml)
{
    try
    {
        XDocument.Parse(xml);
        return true;
    }
    catch
    {
        return false;
    }
}

Then your code simply looks like if (mystring.IsValidXml()) {

The only way you can really find out if something will actually parse is to...try and parse it.

An XMl document should (but may not) have an XML declaration at the head of the file, following the BOM (if present). It should look something like this:

Click to copy

<?xml version="1.0" encoding="UTF-8" ?>

Though the encoding attribute is, I believe, optional (defaulting to UTF-8. It might also have a standalone attribute whose value is yes or no. If that is present, that's a pretty good indicator that the document is supposed to be valid XML.

Riffing on @GaryWalker's excellent answer, something like this is about as good as it gets, I think (though the settings might need some tweaking, a custom no-op resolver perhaps). Just for kicks, I generated a 300mb random XML file using XMark xmlgen (http://www.xml-benchmark.org/): validating it with the code below takes 1.7–1.8 seconds elapsed time on my desktop machine.

Click to copy

public static bool IsMinimallyValidXml( Stream stream )
{
  XmlReaderSettings settings = new XmlReaderSettings
    {
      CheckCharacters              = true                          ,
      ConformanceLevel             = ConformanceLevel.Document     ,
      DtdProcessing                = DtdProcessing.Ignore          ,
      IgnoreComments               = true                          ,
      IgnoreProcessingInstructions = true                          ,
      IgnoreWhitespace             = true                          ,
      ValidationFlags              = XmlSchemaValidationFlags.None ,
      ValidationType               = ValidationType.None           ,
    } ;
  bool isValid ;

  using ( XmlReader xmlReader = XmlReader.Create( stream , settings ) )
  {
    try
    {
      while ( xmlReader.Read() )
      {
        ; // This space intentionally left blank
      }
      isValid = true ;
    }
    catch (XmlException)
    {
      isValid = false ;
    }
  }
  return isValid ;
}

static void Main( string[] args )
{
  string text = "<foo>This &SomeEntity; is about as simple as it gets.</foo>" ;
  Stream stream = new MemoryStream( Encoding.UTF8.GetBytes(text) ) ;
  bool isValid = IsMinimallyValidXml( stream ) ;
  return ;
}

Testing whether or not something is parseable XML in C#

Tags:

C#

.Net

Xml

Related

Recent Posts