Reading Xml with XmlReader in C#
My experience of XmlReader
is that it's very easy to accidentally read too much. I know you've said you want to read it as quickly as possible, but have you tried using a DOM model instead? I've found that LINQ to XML makes XML work much much easier.
If your document is particularly huge, you can combine XmlReader
and LINQ to XML by creating an XElement
from an XmlReader
for each of your "outer" elements in a streaming manner: this lets you do most of the conversion work in LINQ to XML, but still only need a small portion of the document in memory at any one time. Here's some sample code (adapted slightly from this blog post):
static IEnumerable<XElement> SimpleStreamAxis(string inputUrl,
string elementName)
{
using (XmlReader reader = XmlReader.Create(inputUrl))
{
reader.MoveToContent();
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name == elementName)
{
XElement el = XNode.ReadFrom(reader) as XElement;
if (el != null)
{
yield return el;
}
}
}
}
}
}
I've used this to convert the StackOverflow user data (which is enormous) into another format before - it works very well.
EDIT from radarbob, reformatted by Jon - although it's not quite clear which "read too far" problem is being referred to...
This should simplify the nesting and take care of the "a read too far" problem.
using (XmlReader reader = XmlReader.Create(inputUrl))
{
reader.ReadStartElement("theRootElement");
while (reader.Name == "TheNodeIWant")
{
XElement el = (XElement) XNode.ReadFrom(reader);
}
reader.ReadEndElement();
}
This takes care of "a read too far" problem because it implements the classic while loop pattern:
initial read;
(while "we're not at the end") {
do stuff;
read;
}
Three years later, perhaps with the renewed emphasis on WebApi and xml data, I came across this question. Since codewise I am inclined to follow Skeet out of an airplane without a parachute, and seeing his initial code doubly corraborated by the MS Xml team article as well as an example in BOL Streaming Transform of Large Xml Docs, I very quickly overlooked the other comments, most specifically from 'pbz', who pointed out that if you have the same elements by name in succession, every other one is skipped because of the double read. And in fact, the BOL and MS blog articles both were parsing source documents with target elements nested deeper than second level, masking this side-effect.
The other answers address this problem. I just wanted to offer a slightly simpler revision that seems to work well so far, and takes into account that the xml might come from different sources, not just a uri, and so the extension works on the user managed XmlReader. The one assumption is that the reader is in its initial state, since otherwise the first 'Read()' might advance past a desired node:
public static IEnumerable<XElement> ElementsNamed(this XmlReader reader, string elementName)
{
reader.MoveToContent(); // will not advance reader if already on a content node; if successful, ReadState is Interactive
reader.Read(); // this is needed, even with MoveToContent and ReadState.Interactive
while(!reader.EOF && reader.ReadState == ReadState.Interactive)
{
// corrected for bug noted by Wes below...
if(reader.NodeType == XmlNodeType.Element && reader.Name.Equals(elementName))
{
// this advances the reader...so it's either XNode.ReadFrom() or reader.Read(), but not both
var matchedElement = XNode.ReadFrom(reader) as XElement;
if(matchedElement != null)
yield return matchedElement;
}
else
reader.Read();
}
}
We do this kind of XML parsing all the time. The key is defining where the parsing method will leave the reader on exit. If you always leave the reader on the next element following the element that was first read then you can safely and predictably read in the XML stream. So if the reader is currently indexing the <Account>
element, after parsing the reader will index the </Accounts>
closing tag.
The parsing code looks something like this:
public class Account
{
string _accountId;
string _nameOfKin;
Statements _statmentsAvailable;
public void ReadFromXml( XmlReader reader )
{
reader.MoveToContent();
// Read node attributes
_accountId = reader.GetAttribute( "accountId" );
...
if( reader.IsEmptyElement ) { reader.Read(); return; }
reader.Read();
while( ! reader.EOF )
{
if( reader.IsStartElement() )
{
switch( reader.Name )
{
// Read element for a property of this class
case "NameOfKin":
_nameOfKin = reader.ReadElementContentAsString();
break;
// Starting sub-list
case "StatementsAvailable":
_statementsAvailable = new Statements();
_statementsAvailable.Read( reader );
break;
default:
reader.Skip();
}
}
else
{
reader.Read();
break;
}
}
}
}
The Statements
class just reads in the <StatementsAvailable>
node
public class Statements
{
List<Statement> _statements = new List<Statement>();
public void ReadFromXml( XmlReader reader )
{
reader.MoveToContent();
if( reader.IsEmptyElement ) { reader.Read(); return; }
reader.Read();
while( ! reader.EOF )
{
if( reader.IsStartElement() )
{
if( reader.Name == "Statement" )
{
var statement = new Statement();
statement.ReadFromXml( reader );
_statements.Add( statement );
}
else
{
reader.Skip();
}
}
else
{
reader.Read();
break;
}
}
}
}
The Statement
class would look very much the same
public class Statement
{
string _satementId;
public void ReadFromXml( XmlReader reader )
{
reader.MoveToContent();
// Read noe attributes
_statementId = reader.GetAttribute( "statementId" );
...
if( reader.IsEmptyElement ) { reader.Read(); return; }
reader.Read();
while( ! reader.EOF )
{
....same basic loop
}
}
}