How to work with the scrapy contracts?
Yes, Spiders Contracts is far from being clear and detailed.
I'm not an expert in writing spider contracts (actually wrote them only once while working on web-scraping tutorial at newcoder.io). But whenever I needed to write tests for Scrapy spiders, I preferred to follow the approach suggested here - create a fake response from a local html file. It is arguable if this is still a unit testing procedure, but this gives you way more flexibility and robustness.
Note that you can still write contracts but you will quickly feel the need of extending them and writing custom contracts. Which is pretty much ok.
Relevant links:
- Scrapy Unit Testing
- Scrapy Contracts Evolution
Scrapy Contracts
Testing spiders
The two most basic questions in testing the spider might be:
- will/did my code change break the spider?
- will/did the spider break because the page I'm scraping changed?
Contracts
Scrapy offers a means for testings spiders: contracts.
Contracts can look a bit magical. They live in multi-line doc strings. The contract "syntax" is: @contract_name <arg>
. You can create your own contracts, which is pretty neat.
To use a contract, you prepend and @
to the name of a contract. The name of a contract is specified by the .name
attribute on the given contract subclass. These contract subclasses are either built-in or a custom ones that you create.
Finally, the above-mentioned doc string must live in the callbacks of yours spiders. Here's an example of some basic contracts living in the parse
callback; the default callback.
def parse(self, response):
"""This function gathers the author and the quote text.
@url http://quotes.toscrape.com/
@returns items 1 8
@returns requests 0 0
@scrapes author quote_text
"""
You can run this contract via scrapy check
; alternatively, list your contracts with scrapy check -l
.
Contracts in more depth
The above contract is tested using three built-in contracts:
scrapy.contracts.default.UrlContract
scrapy.contracts.default.ReturnsContract
scrapy.contracts.default.ScrapesContract
The UrlContract
is mandatory and isn't really a contract as it is not used for validation. The @url
contract is used to set the URL that the spider will crawl when testing the spider via scrapy check
. In this case, we're specifying http://quotes.toscrape.com/
. But we could've specified http://127.0.0.1:8080/home-11-05-2019-1720.html
which is the local version of quotes.toscrape.com
that I saved with the scrapy view http://quotes.toscrape.com/
command.
The ReturnsContract
is used to check the output of the callback you're testing. As you can see, the contract is called twice, with different args. You can't just put any ol' arg in there though. Under the hood, there is a dictionary of expected args:
objects = {
'request': Request,
'requests': Request,
'item': (BaseItem, dict),
'items': (BaseItem, dict),
}
Our contract specifes that our spider @returns items 1 16
. The 1
and the 16
are lower and upper bounds. The upper bound is optional; under the hood it is set to infinity if not specified ð.
try:
self.max_bound = int(self.args[2])
except IndexError:
self.max_bound = float('inf')
But yeah, the @returns
helps you know if your spider returns the expect amount of items or requests.
Finally, the @scrapes
contract is the last built-in. It is used to check the presence of fields in scraped items. It just goes through the outputted dictionary of your callback and constructs a list of missing properties:
class ScrapesContract(Contract):
""" Contract to check presence of fields in scraped items
@scrapes page_name page_body
"""
name = 'scrapes'
def post_process(self, output):
for x in output:
if isinstance(x, (BaseItem, dict)):
missing = [arg for arg in self.args if arg not in x]
if missing:
raise ContractFail(
"Missing fields: %s" % ", ".join(missing))
Running contracts
Run: scrapy check
If all goes well, you see:
...
----------------------------------------------------------------------
Ran 3 contracts in 0.140s
OK
If something explodes, you see:
F..
======================================================================
FAIL: [example] parse (@returns post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/adnauseum/.virtualenvs/scrapy_testing-CfFR3tdG/lib/python3.7/site-packages/scrapy/contracts/__init__.py", line 151, in wrapper
self.post_process(output)
File "/Users/adnauseum/.virtualenvs/scrapy_testing-CfFR3tdG/lib/python3.7/site-packages/scrapy/contracts/default.py", line 90, in post_process
(occurrences, self.obj_name, expected))
scrapy.exceptions.ContractFail: Returned 10 items, expected 0
----------------------------------------------------------------------
Custom contracts
Let's say you want a @has_header X-CustomHeader
contract. This will ensure that your spiders check for the presence of X-CustomHeader
. Scrapy contracts are just classes that have three overridable methods: adjust_request_args
, pre_process
, and post_process
. From there, you'll need to raise ContractFail
from pre_process
or post_process
whenever expectations are not met.
from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail
class HasHeaderContract(Contract):
"""Demo contract which checks the presence of a custom header
@has_header X-CustomHeader
"""
name = 'has_header' # add the command name to the registry
def pre_process(self, response):
for header in self.args:
if header not in response.headers:
raise ContractFail(f"{header} not present")
Why are contracts useful?
It looks like contracts can be useful for helping you know two things:
your code changes didn't break things
- Seems like it might be a good idea to run the spider against local copies of the page you're scraping and use contracts to validate that your code changes didn't break anything. In this case, you're controlling the page being scraped and you know it is unchanged. Thus, if your contracts fail, you know that it was your code change.
- In this approach, it might be useful to name these HTML fixtures with some kind of timestamp, for record keeping. I.e.,
Site-Page-07-14-2019.html
. You can save these pages by runningscrapy view <url>
. Scrapy will open this page in your browser, but will also save an HMTL file with everything you need.
the page you're scraping didn't change (in ways that affect you)
- Then you could also run your spider against the real thing and let the contracts tell you that what you're scraping has changed.
Though contracts are useful, you'll likely have to do more to ensure your spider. for instance, the amount of items you're scraping isn't guaranteed to be a constant all the time. In that case, you might consider crawling a mock server and running tests against the items collected. There's a dearth of documentation and best practices, it seems.
Finally, there is a project made by Scrapinghub, Spidermon, which is useful for monitoring your spider while it's running: https://spidermon.readthedocs.io/en/latest/getting-started.html
You can validate scraped items according to model definitions and get stats on your spider (current num items scraped, num items that don't meet validation, etc).