Rendered HTML to plain text using Python
I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.
On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.
I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...
Answer from @Helge (nltk).
import nltk
%timeit nltk.clean_html(html)
was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
Answer above from @del
betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text
. For example:
import html2text
html = open("foobar.html").read()
print html2text.html2text(html)
This outputs:
Some text more text even more text * list item * yet another list item Some other text * list item * yet another list item