Escape unescaped characters in XML with Python
If you don't care about invalid characters in the xml you could use XML parser's recover
option (see Parsing broken XML with lxml.etree.iterparse):
from lxml import etree
parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(broken_xml, parser=parser)
print etree.tostring(root)
Output
<root>
<element>
<name>name surname</name>
<mail>[email protected]</mail>
</element>
</root>
You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.
Even simpler, if there aren't any SGML entities (&...;
) in the code, html=html.replace('&','&')
will do the trick.
Otherwise, try this:
x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish & Chips & Gravy</p>"
import re
q=re.sub(r'&([^a-zA-Z#])',r'&\1',x)
print q
Essentially the regex looks for &
not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.