Text between two tags
If you only want ...
of all <tr>...</tr>
do:
grep -o '<tr>.*</tr>' HTMLFILE | sed 's/\(<tr>\|<\/tr>\)//g' > NEWFILE
For multiline do:
tr "\n" "|" < HTMLFILE | grep -o '<tr>.*</tr>' | sed 's/\(<tr>\|<\/tr>\)//g;s/|/\n/g' > NEWFILE
Check the HTMLFILE first of the char "|" (not usual, but possible) and if it exists, change to one which doesn't exist.
You do have a requirement that warrants an HTML parser: you need to parse HTML. Perl's HTML::TreeBuilder, Python's BeautifulSoup and others are easy to use, easier than writing complex and brittle regular expressions.
perl -MHTML::TreeBuilder -le '
$html = HTML::TreeBuilder->new_from_file($ARGV[0]) or die $!;
foreach ($html->look_down(_tag => "tr")) {
print map {$_->as_HTML()} $_->content_list();
}
' input.html
or
python -c 'if True:
import sys, BeautifulSoup
html = BeautifulSoup.BeautifulSoup(open(sys.argv[1]).read())
for tr in html.findAll("tr"):
print "".join(tr.contents)
' input.html
sed
and awk
are not well suited for this task, you should rather use a proper html parser. For example hxselect
from w3.org:
<htmlfile hxselect -s '\n' -c 'tr'