ElementTree in Python 2.6.2 Processing Instructions support?
Try the lxml
library: it follows the ElementTree api, plus adds a lot of extras. From the compatibility overview:
ElementTree ignores comments and processing instructions when parsing XML, while etree will read them in and treat them as Comment or ProcessingInstruction elements respectively. This is especially visible where comments are found inside text content, which is then split by the Comment element.
You can disable this behaviour by passing the boolean
remove_comments
and/orremove_pis
keyword arguments to the parser you use. For convenience and to support portable code, you can also use theetree.ETCompatXMLParser
instead of the defaultetree.XMLParser
. It tries to provide a default setup that is as close to the ElementTree parser as possible.
Not in the stdlib, I know, but in my experience the best bet when you need stuff that the standard ElementTree doesn't provide.
With the lxml API it couldn't be easier, though it is a bit "underdocumented":
If you need a top-level processing instruction, create it like this:
from lxml import etree
root = etree.Element("anytagname")
root.addprevious(etree.ProcessingInstruction("anypi", "anypicontent"))
The resulting document will look like this:
<?anypi anypicontent?>
<anytagname />
They certainly should add this to their FAQ because IMO it is another feature that sets this fine API apart.
Yeah, I don't believe it's possible, sorry. ElementTree provides a simpler interface to (non-namespaced) element-centric XML processing than DOM, but the price for that is that it doesn't support the whole XML infoset.
There is no apparent way to represent the content that lives outside the root element (comments, PIs, the doctype and the XML declaration), and these are also discarded at parse time. (Aside: this appears to include any default attributes specified in the DTD internal subset, which makes ElementTree strictly-speaking a non-compliant XML processor.)
You can probably work around it by subclassing or monkey-patching the Python native ElementTree implementation's write()
method to call _write
on your extra PIs before _writeing the _root
, but it could be a bit fragile.
If you need support for the full XML infoset, probably best stick with DOM.