bash command to convert html page to a text file
Easiest way is to use something like this which the dump (in short is the text version of viewable HTML).
Remote file:
lynx --dump www.google.com > file.txt
links -dump www.google.com
Local file:
lynx --dump ./1.html > file.txt
links -dump ./1.htm
With charset conversion to utf8 (see):
lynx -dump -display_charset UTF-8 ./1.htm
links -dump -codepage UTF-8 ./1.htm
You have html2text.py on command line.
Usage: html2text.py [(filename|url) [encoding]]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--ignore-links don't include any formatting for links
--ignore-images don't include any formatting for images
-g, --google-doc convert an html-exported Google Document
-d, --dash-unordered-list
use a dash rather than a star for unordered list items
-b BODY_WIDTH, --body-width=BODY_WIDTH
number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT
number of pixels Google indents nested lists
-s, --hide-strikethrough
hide strike-through text. only relevent when -g is
specified as well