Jsoup like html parser for C++
If you are familiar with Qt Framework the most convenient way is using QWebElement (Reference here).
Otherwise, (as another post suggests) using Tidy to convert HTML to a valid XML and then using an XML parser such as libxml++ is a good option. You can find a sample code showing these two steps here.
Chromium has an open source parser. Also, the Google gumbo-parser looks cool.
Unfortunately, i guess there's no parser like Jsoup for C++ ...
Beside the libraries which are already mentioned here, there's a good overview about C++ (some C too) parser here: Free C or C++ XML Parser Libraries
For parsing i used TinyXML-2 for (Html-) DOM parsing; it's a very small (only 2 files) library that runs on most OS (even non-desktop).
LibXml
- push and pull parser (DOM, SAX)
- Validation
- XPath and XPointer support
- Cross-Plattform / good documentation
Apache Xerxces
- push and pull parser (DOM, SAX)
- Validation
- No XPath support (but a package for this?)
- Cross-Plattform / good documentation
If you are on C++ CLI, check out NSoup - a Jsoup port for .NET.
Some more:
- htmlcxx - html and css APIs for C++
- MSHTML (?)
- pugixml (DOM / XPath and Unicode support)
- LibCSS (CSS Parser) / LibDOM (DOM) (however, both in C)
- hcxselect (CSS selector engine for C++)
Maybe you can combine a DOM Model / Parser and a CSS selector together?