Get all text from an XML document?

Using stdlib xml.etree

import xml.etree.ElementTree as ET

tree = ET.parse('sample.xml') 
print(ET.tostring(tree.getroot(), encoding='utf-8', method='text'))

I really like BeautifulSoup, and would rather not use regex on HTML if we can avoid it.

Adapted from: [this StackOverflow Answer], [BeautifulSoup documentation]

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt)    # txt is simply the a string with your XML file
pageText = soup.findAll(text=True)
print ' '.join(pageText)

Though of course, you can (and should) use BeautifulSoup to navigate the page for what you are looking for.

A solution that doesn't require an external library like BeautifulSoup, using the built-in sax parsing framework:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def parse(self, filename):
        self.text = []
        sax.parse(filename, self)
        return ''.join(self.text)

    def characters(self, data):
        self.text.append(data)

result = MyHandler().parse("yourfile.xml")

If you need all whitespace intact in the text, also define the ignorableWhitespace method in the handler class in the same way characters is defined.

Get all text from an XML document?

Tags:

Python

Xml

Lxml

Related

Recent Posts