Decoding ampersand hash strings (|xa)etc
The correct format for character reference is &#nnnn;
so the ;
is missing in your example. You can add the ;
and then use HTMLParser.unescape() :
from HTMLParser import HTMLParser
import re
x ='Blasterjaxx '
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)
This gives this output :
Blasterjaxx 
Blasterjaxx
Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.
However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,
s ='Blasterjaxx '
print ''.join([chr(int(u)) for u in s.split('&#') if u])
output
Blasterjaxx
The if u
skips over the initial empty string that we get because s
begins with the splitting string '&#'
. Alternatively, we could skip it by slicing:
''.join([chr(int(u)) for u in s.split('&#')[1:]])
In Python 3, use the html
module:
>>> import html
>>> html.unescape('Blasterjaxx ')
'Blasterjaxx '
docs: https://docs.python.org/3/library/html.html