Clean Up HTML in Python
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.
from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.
An example of cleaning up HTML using the lxml.html.clean.Cleaner module.
Requires the lxml
module — pip install lxml
(it's a native module written in C so it might be faster than pure python solutions).
import sys
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
return cleaner.clean_html(dirty_html)
if __name__ == '__main__':
with open(sys.argv[1]) as fin:
print(sanitize(fin.read()))
Check out the docs for a full list of options you can pass to the Cleaner.
There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.