Python Web Crawlers and "getting" html source code

An Example with python3 and the requests library as mentioned by @leoluk:

pip install requests

Script req.py:

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

Now,execute it and you will get the html source of localhost!

python3 req.py

~~Use Python 2.7, is has more 3rd party libs at the moment.~~ (Edit: see below).

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

BTW: what exactly do you want to do:

Just for background, I need to download a page and replace any img with ones I have

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests is a very nice high-level library which is easier to use than urllib2.

If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework. The old urllib2 package has been renamed to urllib:

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

Python Web Crawlers and "getting" html source code

Tags:

Python

Web Crawler

Get

Related

Recent Posts