Screen scraping with Python

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

Beautiful soup is still probably your best bet.

If you need "JavaScript support" for the purpose of intercepting Ajax requests then you should use some sort of capture too (such as YATT) to monitor what those requests are, and then emulating / parsing them.

If you need "JavaScript support" in order to be able to see what the end result of a page with static JavaScript is, then my first choice would be to try and figure out what the JavaScript is doing on a case-by-case basis (e.g. if the JavaScript is doing something based on some Xml, then just parse the Xml directly instead)

If you really want "JavaScript support" (as in you want to see what the html is after scripts have been run on a page) then I think you will probably need to create an instance of some browser control, and then read the resulting html / dom back from the browser control once its finished loading and parse it normally with beautiful soup. That would be my last resort however.


Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Here you go: http://scrapy.org/