How can I replace or remove HTML entities like " " using BeautifulSoup 4
>>> soup = BeautifulSoup('<div>a b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n <div>\n a b\n </div>\n </body>\n</html>'
See Entities in the documentation. BeautifulSoup 4 produces proper Unicode for all entities:
An incoming HTML or XML entity is always converted into the corresponding Unicode character.
Yes,
is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.
You can simply replace the non-breaking space unicode with a normal space.
nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')
A benefit is that even though you are using BeautifulSoup, you do not need to.