How can I manipulate the DOM from a string of HTML in C#?

JasonBunting already posted this, but it really works to use a .net wrapper around HTML tidy and load it up in an XmlDocument.

I have used this .net wrapper before :

http://www.codeproject.com/KB/cs/ZetaHtmlTidy.aspx

And implemented it somewhat like this:

string input = "<p>crappy html<br <img src=foo></div>";
HtmlTidy tidy = new HtmlTidy()
string output = tidy.CleanHtml(input, HtmlTidyOptions.ConvertToXhtml);
XmlDocument doc = new XmlDocument();
doc.LoadXml(output);

Sorry if considered a repost :)


I did a search to GooglePlex for HTML and I found Html Agility Pack I do not know if it's for that or not, I am downloading it right now to give a try.


Depending on what you are trying to do (maybe you can give us more details?) and depending on whether or not the HTML is well-formed, you could convert this to an XmlDocument:

System.Xml.XmlDocument x = new System.Xml.XmlDocument();
x.LoadXml(html); // as long as html is well-formed, i.e. XHTML

Then you could manipulate it easily, without the WebBrowser instance. As for threads, I don't know enough about the implementation of XmlDocument to know the answer to that part.


If the document isn't in proper form, you could use NTidy (.NET wrapper for HTML Tidy) to get it in shape first; I had to do this very thing for a project once and it really wasn't too bad.