Command-line CSS selector tool
Use the W3C tools for HTML/XML parsing and extraction of content using CSS selectors. For example:
hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "td.data"
Will produce the desired output:
Tabular Content 1
Tabular Content 2
Using a line length of 240 characters ensures that elements with long content will not be split across multiple lines. The hxnormalize -x
command creates a well-formed XML document, which can be used by hxselect
.
CSS Solution
The Element Finder command will partially accomplish this task:
- https://github.com/keeganstreet/element-finder
- http://keegan.st/2012/06/03/find-in-files-with-css-selectors/
For example:
elfinder -j -s td.data -x "html"
This renders the result in JSON format, which can be extracted.
XML Solution
The XML::Twig module ("sudo apt-get install xml-twig-tools
") comes with a tool named xml_grep
that is able to do just that, provided that your HTML is well-formed, of course.
I'm sorry I'm not able to test this at the moment, but something like this should work:
xml_grep -t '*/div[@class="content"]/table/tbody/tr/td[@class="data"]' file.html
https://github.com/ericchiang/pup has a CSS-based query language that conforms closely to your example. In fact, with your input, the following command:
pup "body > div.content > table > tbody > tr > td.data text{}"
produces:
Tabular Content 1
Tabular Content 2
The trailing text{}
removes the HTML tags.
One nice feature is that the full path need not be given, so that again with your example:
$ pup 'td.data text{}' < input.html
Tabular Content 1
Tabular Content 2
One advantage of pup
is that it uses the golang.org/x/net/html package for parsing HTML5.