getting Forbidden by robots.txt: scrapy

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

Netflix's Terms of Use state:

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

They have their robots.txt set up to block web scrapers. If you override the setting in settings.py to ROBOTSTXT_OBEY=False then you are violating their terms of use which can result in a law suit.

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY

ROBOTSTXT_OBEY = False

Here are the release notes

getting Forbidden by robots.txt: scrapy

Tags:

Python

Web Crawler

Scrapy

Related

Recent Posts