BeautifulSoup: Strip specified attributes, but preserve the tag and its contents

The line

for tag in soup.findAll(attribute=True):

does not find any tags. There might be a way to use findAll; I'm not sure. However, this works:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.

Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name attribute, as the variable does not get expanded.

This is why

hard-coding your attribute name worked[0]
the code does not fail. The search just doesn't match any tags

To fix the problem, pass the attribute you are looking for as a dict:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth someone in the future, dtk

[0]: Although it needs to be find_all(style=True) in your example, without the quotes, because SyntaxError: keyword can't be an expression

BeautifulSoup: Strip specified attributes, but preserve the tag and its contents

Tags:

Python

Web Scraping

Beautifulsoup

Scraper

Frontpage

Related

Recent Posts