Web scraping with Python

have you tried scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


You can drive a browser of your choice with SeleniumRC.


Use BeautifulSoup as a tree builder for html5lib:

from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

text = "a<b>b<b>c"
soup = parser.parse(text)
print soup.prettify()

Output:

<html>
 <head>
 </head>
 <body>
  a
  <b>
   b
   <b>
    c
   </b>
  </b>
 </body>
</html>

pyWebKitGTK looks like it might be of some help.

Also here is a dude that had to do the same thing but get the export of the content after javascript ran, execute javascript from python using pyWebKitGTK.

pyWebkitGTK at the cheeseshop.

You can also do this with pyQt.