How to scrape all the content of each link with scrapy?
To scaffold a basic scrapy project you can use the command:
scrapy startproject craig
Then add the spider and items:
craig/spiders/spider.py
from scrapy import Spider
from craig.items import CraigslistSampleItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy import Request
import urlparse, re
class CraigSpider(Spider):
name = "craig"
start_url = "https://sfbay.craigslist.org/search/npo"
def start_requests(self):
yield Request(self.start_url, callback=self.parse_results_page)
def parse_results_page(self, response):
sel = Selector(response)
# Browse paging.
page_urls = sel.xpath(""".//span[@class='buttons']/a[@class='button next']/@href""").getall()
for page_url in page_urls + [response.url]:
page_url = urlparse.urljoin(self.start_url, page_url)
# Yield a request for the next page of the list, with callback to this same function: self.parse_results_page().
yield Request(page_url, callback=self.parse_results_page)
# Browse items.
item_urls = sel.xpath(""".//*[@id='sortable-results']//li//a/@href""").getall()
for item_url in item_urls:
item_url = urlparse.urljoin(self.start_url, item_url)
# Yield a request for each item page, with callback self.parse_item().
yield Request(item_url, callback=self.parse_item)
def parse_item(self, response):
sel = Selector(response)
item = CraigslistSampleItem()
item['title'] = sel.xpath('//*[@id="titletextonly"]').extract_first()
item['body'] = sel.xpath('//*[@id="postingbody"]').extract_first()
item['link'] = response.url
yield item
craig/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
body = Field()
link = Field()
craig/settings.py
# -*- coding: utf-8 -*-
BOT_NAME = 'craig'
SPIDER_MODULES = ['craig.spiders']
NEWSPIDER_MODULE = 'craig.spiders'
ITEM_PIPELINES = {
'craig.pipelines.CraigPipeline': 300,
}
craig/pipelines.py
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter
class CraigPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
file = open('%s_ads.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
You can run the spider by running the command:
scrapy runspider craig/spiders/spider.py
From the root of your project.
It should create a craig_ads.csv
in the root of your project.
I am trying to answer your question.
First of all, because of your incorrect XPath query, you got blank results. By XPath ".//*[@id='sortable-results']//ul//li//p"
, you located relevant <p>
nodes correctly, though I don't like your query expression. However, I have no idea of your following XPath expression ".//*[@id='titletextonly']"
and "a/@href"
, they couldn't locate link and title as you expected. Maybe your meaning is to locate the text of title and the hyperlink of the title. If yes, I believe you have to learn Xpath, and please start with HTML DOM.
I do want to instruct you how to do XPath query, as there are lots of resources online. I would like to mention some features of Scrapy XPath selector:
- Scrapy XPath Selector is an improved wrapper of standard XPath query.
In standard XPath query, it returns an array of DOM nodes you queried. You can open development mode of your browser(F12
), use console command $x(x_exp)
to test. I highly suggest that test your XPath expression through this way. It will give you instant results and save lots of time. If you have time, be familiar with the web development tools of your browser, which will have you quick understand web page structure and locate the entry you are looking for.
While, Scrapy response.xpath(x_exp)
returns an array of Selector
objects corresponding to actual XPath query, which is actually a SelectorList
object. This means XPath results is reprented by SelectorsList
. And both Selector
and SelectorList
class provides some useful functions to operate the results:
extract
, return a list of serialized document nodes (to unicode strings)extract_first
, return scalar,first
of theextract
resultsre
, return a list,re
of theextract
resultsre_first
, return scalar,first
of there
results.
These functions make your programming much more convenient. One example is that you can call xpath
function directly on SelectorList
object. If you tried lxml
before, you would see that this is super useful: if you want to call xpath
function on the results of a former xpath
results in lxml
, you have to iterate over the former results. Another example is that when you definitely sure that there is at most one element in that list, you can use extract_first
to get a scalar value, instead of using list index method (e.g., rlist[0]
) which would cause out of index exception when no element matched. Remember that there are always exceptions when you parse the web page, be careful and robust of your programming.
- Absolute XPath vs. relative XPath
Keep in mind that if you are nesting XPathSelectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the XPathSelector you’re calling it from.
When you do operation node.xpath(x_expr)
, if x_expr
starts with /
, it is an absolute query, XPath will search from root
; else if x_expr
starts with .
, it is a relative query. This is also noted in standards 2.5 Abbreviated Syntax
. selects the context node
.//para selects the para element descendants of the context node
.. selects the parent of the context node
../@lang selects the lang attribute of the parent of the context node
- How to follow the next page and end of following.
For your application, you probably need to following the next page. Here, the next page node is easy to locate -- there are next buttons. However, you need also take care of the time to stop following. Look carefully for your URL query parameter to tell the URL pattern of your application. Here, to determine when to stop follow the next page, you can compare current item range with the total number of items.
New Edited
I was a little confused with the meaning of content of the link. Now I got it that @student wanted to crawl the link to extract AD content as well. The following is a solution.
- Send Request and attach its parser
As you may notice that I use Scrapy Request
class to follow the next page. Actually, the power of Request class is beyond that -- you can attach desired parse function for each request by setting parameter callback
.
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
In step 3, I did not set callback
when sending next page requests, as these request should be handled by default parse
function. Now comes to the specified AD page, a different page then the former AD list page. Thus we need to define a new page parser function, let's say parse_ad
, when we send each AD page request, attach this parse_ad
function with the requests.
Let's go to the revised sample code that works for me:
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapydemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
class AdItem(scrapy.Item):
title = scrapy.Field()
description = scrapy.Field()
The spider
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.http import Request
from scrapydemo.items import ScrapydemoItem
from scrapydemo.items import AdItem
try:
from urllib.parse import urljoin
except ImportError:
from urlparse import urljoin
class MySpider(Spider):
name = "demo"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/npo"]
def parse(self, response):
# locate list of each item
s_links = response.xpath("//*[@id='sortable-results']/ul/li")
# locate next page and extract it
next_page = response.xpath(
'//a[@title="next page"]/@href').extract_first()
next_page = urljoin(response.url, next_page)
to = response.xpath(
'//span[@class="rangeTo"]/text()').extract_first()
total = response.xpath(
'//span[@class="totalcount"]/text()').extract_first()
# test end of following
if int(to) < int(total):
# important, send request of next page
# default parsing function is 'parse'
yield Request(next_page)
for s_link in s_links:
# locate and extract
title = s_link.xpath("./p/a/text()").extract_first().strip()
link = s_link.xpath("./p/a/@href").extract_first()
link = urljoin(response.url, link)
if title is None or link is None:
print('Warning: no title or link found: %s', response.url)
else:
yield ScrapydemoItem(title=title, link=link)
# important, send request of ad page
# parsing function is 'parse_ad'
yield Request(link, callback=self.parse_ad)
def parse_ad(self, response):
ad_title = response.xpath(
'//span[@id="titletextonly"]/text()').extract_first().strip()
ad_description = ''.join(response.xpath(
'//section[@id="postingbody"]//text()').extract())
if ad_title is not None and ad_description is not None:
yield AdItem(title=ad_title, description=ad_description)
else:
print('Waring: no title or description found %s', response.url)
Key Note
- Two parse function,
parse
for requests of AD list page andparse_ad
for request of specified AD page. - To extract content of the AD post, you need some tricks. See How can I get all the plain text from a website with Scrapy
A snapshot of output:
2016-11-10 21:25:14 [scrapy] DEBUG: Scraped from <200 http://sfbay.craigslist.org/eby/npo/5869108363.html>
{'description': '\n'
' \n'
' QR Code Link to This Post\n'
' \n'
' \n'
'Agency History:\n' ........
'title': 'Staff Accountant'}
2016-11-10 21:25:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 39259,
'downloader/request_count': 117,
'downloader/request_method_count/GET': 117,
'downloader/response_bytes': 711320,
'downloader/response_count': 117,
'downloader/response_status_count/200': 117,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2016, 11, 11, 2, 25, 14, 878628),
'item_scraped_count': 314,
'log_count/DEBUG': 432,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 117,
'scheduler/dequeued': 116,
'scheduler/dequeued/memory': 116,
'scheduler/enqueued': 203,
'scheduler/enqueued/memory': 203,
'start_time': datetime.datetime(2016, 11, 11, 2, 24, 59, 242456)}
2016-11-10 21:25:14 [scrapy] INFO: Spider closed (shutdown)
Thanks. Hope this would be helpful and have fun.