How do I perform HTML decoding/encoding using Python/Django?
With the standard library:
HTML Escape
try: from html import escape # python 3.x except ImportError: from cgi import escape # python 2.x print(escape("<"))
HTML Unescape
try: from html import unescape # python 3.4+ except ImportError: try: from html.parser import HTMLParser # python 3.x (<3.4) except ImportError: from HTMLParser import HTMLParser # python 2.x unescape = HTMLParser().unescape print(unescape(">"))
For html encoding, there's cgi.escape from the standard library:
>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.
For html decoding, I use the following:
import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39
def unescape(s):
"unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
return re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), s)
For anything more complicated, I use BeautifulSoup.
Given the Django use case, there are two answers to this. Here is its django.utils.html.escape
function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape
. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}
Use daniel's solution if the set of encoded characters is relatively restricted. Otherwise, use one of the numerous HTML-parsing libraries.
I like BeautifulSoup because it can handle malformed XML/HTML :
http://www.crummy.com/software/BeautifulSoup/
for your question, there's an example in their documentation
from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacré bleu!",
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'