Escape unescaped characters in XML with Python

If you don't care about invalid characters in the xml you could use XML parser's recover option (see Parsing broken XML with lxml.etree.iterparse):

from lxml import etree

parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(broken_xml, parser=parser)
print etree.tostring(root)

Output

<root>
<element>
<name>name  surname</name>
<mail>[email protected]</mail>
</element>
</root>

You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.

Even simpler, if there aren't any SGML entities (&...;) in the code, html=html.replace('&','&') will do the trick.

Otherwise, try this:

x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish &amp; Chips &#x0026; Gravy</p>"
import re
q=re.sub(r'&([^a-zA-Z#])',r'&amp;\1',x)
print q

Essentially the regex looks for & not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.

Escape unescaped characters in XML with Python

Output

Tags:

Python

Xml

Beautifulsoup

Special Characters

Lxml

Related

Recent Posts