Remove all inline styles using BeautifulSoup
I wouldn't do this in BeautifulSoup
- you'll spend a lot of time trying, testing, and working around edge cases.
Bleach
does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup
, I'd suggest you go with the "whitelist" approach, like Bleach
does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract()
, which returns the tag. You just need decompose()
:
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.