What characters do I need to escape in XML documents?
Perhaps this will help:
List of XML and HTML character entity references:
In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity reference. This article lists the character entity references that are valid in HTML and XML documents.
That article lists the following five predefined XML entities:
quot "
amp &
apos '
lt <
gt >
According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:
Original CharacterXML entity replacementXML numeric replacement
< < <
> > >
" " "
& & &
' ' '
Notice that the aforementioned entities can be used also in HTML, with the exception of ', that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of ' instead.
If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.
XML escape characters
There are only five:
" "
' '
< <
> >
& &
Escaping characters depends on where the special character is used.
The examples can be validated at the W3C Markup Validation Service.
Text
The safe way is to escape all five characters in text. However, the three characters "
, '
and >
needn't be escaped in text:
<?xml version="1.0"?>
<valid>"'></valid>
Attributes
The safe way is to escape all five characters in attributes. However, the >
character needn't be escaped in attributes:
<?xml version="1.0"?>
<valid attribute=">"/>
The '
character needn't be escaped in attributes if the quotes are "
:
<?xml version="1.0"?>
<valid attribute="'"/>
Likewise, the "
needn't be escaped in attributes if the quotes are '
:
<?xml version="1.0"?>
<valid attribute='"'/>
Comments
All five special characters must not be escaped in comments:
<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>
CDATA
All five special characters must not be escaped in CDATA sections:
<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>
Processing instructions
All five special characters must not be escaped in XML processing instructions:
<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>
XML vs. HTML
HTML has its own set of escape codes which cover a lot more characters.
New, simplified answer to an old, commonly asked question...
Simplified XML Escaping (prioritized, 100% complete)
Always (90% important to remember)
- Escape
<
as<
unless<
is starting a<tag/>
or other markup. - Escape
&
as&
unless&
is starting an&entity;
.
- Escape
Attribute Values (9% important to remember)
attr="
'
Single quotes'
are ok within double quotes."
attr='
"
Double quotes"
are ok within single quotes.'
- Escape
"
as"
and'
as'
otherwise.
Comments, CDATA, and Processing Instructions (0.9% important to remember)
<!--
Within comments-->
nothing has to be escaped but no--
strings are allowed.<![CDATA[
Within CDATA]]>
nothing has to be escaped, but no]]>
strings are allowed.<?PITarget
Within PIs?>
nothing has to be escaped, but no?>
strings are allowed.
Esoterica (0.1% important to remember)
- Escape control codes in XML 1.1 via Base64 or Numeric Character References.
- Escape
]]>
as]]>
unless]]>
is ending a CDATA section.
(This rule applies to character data in general – even outside a CDATA section.)