Scrapy read list of URLs from file to scrape?

You were pretty close.

f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

...better still would be to use the context manager to ensure the file's closed as expected:

with open("urls.txt", "rt") as f:
    start_urls = [url.strip() for url in f.readlines()]

If Dmoz expects just filenames in the list, you have to call strip on each line. Otherwise you get a '\n' at the end of each URL.

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [l.strip() for l in open('urls.txt').readlines()]

Example in Python 2.7

>>> open('urls.txt').readlines()
['http://site.org\n', 'http://example.org\n', 'http://example.com/page\n']
>>> [l.strip() for l in open('urls.txt').readlines()]
['http://site.org', 'http://example.org', 'http://example.com/page']

Scrapy read list of URLs from file to scrape?

Tags:

Python

Scrapy

Related

Recent Posts