Scrapy like tool for Nodejs?
I haven't seen such a strong solution for crawling / indexing whole websites like Scrapy in python, so personally I use Python Scrapy for crawling websites.
But for scraping data from pages there is casperjs in nodejs. It is a very cool solution. It also works for ajax websites, e.g. angular-js pages. Python Scrapy cannot parse ajax pages. So for scraping data for one or few pages I prefer to use CasperJs.
Cheerio is really faster than casperjs, but it doesn't work with ajax pages and it doesn't have such good structure of a code like casperjs. So I prefer casperjs even when you can use cheerio package.
Coffee-script example:
casper.start 'https://reports.something.com/login', ->
this.fill 'form',
username: params.username
password: params.password
, true
casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
this.click 'input'
casper.then ->
get = (number) =>
value = this.fetchText("tr[bgcolor= '#AFC5E4'] > td:nth-of-type(#{number})").trim()
Scrapy is a library that adds asynchronous IO to python. The reason we don't have something like that for node is because all IO is already asynchronous (unless you need it not to be).
Here's what a scrapy script might look like in node and notice that the urls are processed concurrently.
const cheerio = require('cheerio');
const axios = require('axios');
const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']
// this might be called a "middleware" in scrapy.
const get = async url => {
const response = await axios.get(url)
return cheerio.load(response.data)
}
// this too.
const output = item => {
console.log(item)
}
// here is parse which is the initial scrapy callback
const parse = async url => {
const $ = await get(url)
output({url, title: $('title').text()})
}
// and here is the main execution
startUrls.map(url => parse(url))
Exactly same thing? No. But both powerful and simple? Yes: crawler Quick example:
var Crawler = require("crawler");
var c = new Crawler({
maxConnections : 10,
// This will be called for each crawled page
callback : function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($("title").text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: function (error, res, done) {
if(error){
console.log(error);
}else{
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);
Some crawling functionalities may be achieved with Google Puppeteer. According to the documentation:
Most things that you can do manually in the browser can be done using Puppeteer! Here are a few examples to get you started:
- Generate screenshots and PDFs of pages.
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
- Automate form submission, UI testing, keyboard input, etc.
- Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.