Database storage: Why is Pipeline better than Feed Export?
This is a tooooo late answer. But I just spent a whole afternoon and an evening trying to understand the difference between item pipeline and feed export which is poorly documented. And I think it would be helpful to someone who is still confused.
TL;DR: FeedExport is designed for exporting items as files. It is totally not suitable for database storage.
Feed export is implemented as an extension to scrapy in scrapy.extensions.feedexport
. In this way, just like other extensions in scrapy, it is in-turn implemented by register callback functions to some scrapy signals (open_spider
, close_spider
and item_scraped
) so that it can take necessary steps to store items.
When open_spider
, FeedExporter
(the actual extension class) initializes feed storages and item exporters. The concrete steps involve getting a file-like object which is usually a temporary file from a FeedStroage
and pass it to an ItemExporter
. When item_scraped
, FeedExporter
simply calls a pre-initialized ItemExporter
object to export_item
. When close_spider
, FeedExporter
call store
method on the previous FeedStorage
object to write the file to filesystem, upload to a remote FTP server, upload to S3 storage, etc.
There is a collection of built-in item exporters and storages. But as you may notice from the above text, FeedExporter
is by design tightly coupled with file storage. When using databases, the usual way to store items is to insert it into databases as soon as it is scraped (or possibly you may want some buffers).
Therefore, the proper way to use a database storage seems to be writing your own FeedExporter
. You can achieve it by register callbacks to scrapy signals. But it is not necessary, using item pipeline is more straightforward and does not require awareness of such implementation details.
As far as i understand:
Pipeline are a universal solution - you make the db connection, you know the db structure, you check for duplicates - you have control over all the process of storing the scraped items.
The exporters are predefined ways of storing scraped data. Quote:
If you are in a hurry, and just want to use an Item Exporter to output scraped data see the Feed exports.