What's the right way to decode a string that has special HTML entities in it?
This is my favourite way of decoding HTML characters. The advantage of using this code is that tags are also preserved.
function decodeHtml(html) {
var txt = document.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}
Example: http://jsfiddle.net/k65s3/
Input:
Entity: Bad attempt at XSS:<script>alert('new\nline?')</script><br>
Output:
Entity: Bad attempt at XSS:<script>alert('new\nline?')</script><br>
Don’t use the DOM to do this. Using the DOM to decode HTML entities (as suggested in the currently accepted answer) leads to differences in cross-browser results.
For a robust & deterministic solution that decodes character references according to the algorithm in the HTML Standard, use the he library. From its README:
he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.
Here’s how you’d use it:
he.decode("We're unable to complete your request at this time.");
→ "We're unable to complete your request at this time."
Disclaimer: I'm the author of the he library.
See this Stack Overflow answer for some more info.
If you don't want to use html/dom, you could use regex. I haven't tested this; but something along the lines of:
function parseHtmlEntities(str) {
return str.replace(/&#([0-9]{1,3});/gi, function(match, numStr) {
var num = parseInt(numStr, 10); // read num as normal number
return String.fromCharCode(num);
});
}
[Edit]
Note: this would only work for numeric html-entities, and not stuff like &oring;.
[Edit 2]
Fixed the function (some typos), test here: http://jsfiddle.net/Be2Bd/1/