HTML and Attribute encoding
HTML encoding replaces certain characters that are semantically meaningful in HTML markup, with equivalent characters that can be displayed to the user without affecting parsing the markup.
The most significant and obvious characters are <, >, &, and " which are are replaced with <
, >
, &
, and "
, respectively. Additionally, an encoder may replace high-order characters with the equivalent HTML entity encoding, so content can be preserved and properly rendered even in the event the page is sent to the browser as ASCII.
HTML attribute encoding, on the other hand, only replaces a subset of those characters that are important to prevent a string of characters from breaking the attribute of an HTML element. Specifically, you'd typically just replace ", &, and < with "
, &
, and <
. This is because the nature of attributes, the data they contain, and how they are parsed and interpreted by a browser or HTML parser is different than how an HTML document and its elements are read.
In terms of how that relates to XSS, you want to properly sanitize strings from an outside source (such as the user) so they don't break your page, or more importantly, inject markup and script that can alter or destroy your application or affect your users' machines (by taking advantage of browser or platform vulnerabilities).
If you want to display user-generated content in your page, you'd HTML encode the string and then display it in your markup, and everything they entered will be displayed literally without worrying XSS or broken markup.
If you needed to attach user-generated content to an element in an attribute (for example, a tooltip
on a link), you'd attribute encode to make sure the content doesn't break the element's markup.
Could you just use the same function for HTML encoding to handle attribute encoding? Technically, yes. In the case of the meta question you linked, it sounds like they were taking HTML that was encoded and decoding it, then using that result as an attribute value, which results in encoded markup being displayed literally, if you follow.
I would recommend looking over OWASP XSS Prevention Rules 1 and 2.
A brief summary...
Rule 1 for HTML
Escape the following characters with HTML entity encoding ...
&
-->&
<
--><
>
-->>
"
-->"
'
-->'
/
-->/
Rule 2 for HTML Common Attributes
Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the &#xHH; format (or a named entity if available) to prevent switching out of the attribute. The reason this rule is so broad is that developers frequently leave attributes unquoted. Properly quoted attributes can only be escaped with the corresponding quote. Unquoted attributes can be broken out of with many characters, including [space] % * + , - / ; < = > ^ and |.