Extract part of the code and parse HTML in bash
For your purposes a quick solution would be a 1-liner:
sed -n '/<table class="my-table">/,/<\/table>/p' <file>
Explanation:
print everything between two specified tags, in this case <table>
You could also easily make a tag variable for e.g <body>
or <p>
and change the output on the fly. But the above solution gives what you asked for without external tools.
I will break down the answer which I tried using xmllint
which supports a --html
flag for parsing html
files
Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-
$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
<p>... dolor.</p>
</body>
</html>
with my original YourHTML.html
file just being the input HTML file in your question.
Now for the value extraction part:-
Starting the file parsing from root-node to the table
node (//html/body/table
) and running xmllint
in HTML parser & interactive shell mode (xmllint --html --shell
)
Running the command plainly produces a result,
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html
/ > -------
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
/ >
Now removing the special characters using sed
i.e. sed '/^\/ >/d'
produces
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d'
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
which is the output structure as you expected. Tested on xmllint: using libxml version 20900
I will go one more step ahead, and if you want to fetch the values within the table
tag, you can apply the sed
command to extract them as
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact