Python Web Crawlers and "getting" html source code
An Example with python3
and the requests
library as mentioned by @leoluk:
pip install requests
Script req.py:
import requests
url='http://localhost'
# in case you need a session
cd = { 'sessionid': '123..'}
r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content
Now,execute it and you will get the html source of localhost!
python3 req.py
Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).
I recommend you using the stdlib module urllib2
, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
For parsing the code, have a look at BeautifulSoup
.
BTW: what exactly do you want to do:
Just for background, I need to download a page and replace any img with ones I have
Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests
is a very nice high-level library which is easier to use than urllib2
.
If you are using Python > 3.x
you don't need to install any libraries, this is directly built in the python framework. The old urllib2
package has been renamed to urllib
:
from urllib import request
response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)