Crawling. It used to be relatively easy for quite some time: the web was curl
abel. You could basicaly do it in a bash
script. But with the advent of single page apps more and more ‘pages’ only exist because a script made them. Hence you need something more advanced, something that includes a Javascript parser. CasperJS helps you here. They even give a hint how to crawl Google. Let the example speak for itself:
var links = [];
var casper = require('casper').create();
function getLinks() {
var links = document.querySelectorAll('h3.r a');
return Array.prototype.map.call(links, function(e) {
return e.getAttribute('href');
});
}
casper.start('http://google.fr/', function() {
// Wait for the page to be loaded
this.waitForSelector('form[action="/search"]');
});
casper.then(function() {
// search for 'casperjs' from google form
this.fill('form[action="/search"]', { q: 'casperjs' }, true);
});
casper.then(function() {
// aggregate results for the 'casperjs' search
links = this.evaluate(getLinks);
});
Now the links
-array has all the search results from the first page.
CasperJS has a basic testing framework included as well. So you can assert your truths. Happy crawling / blackbox testing :)