Recently, I was interacting with data from a website that didn't have an API. I've been in this situation before, and it really just entails building a webscraper to fetch the data. Because most of my development recently has been with PHP and Laravel, I usually use Goutte (which is really just a nicely wrapped version of Guzzle, and Symfony's DomCrawller and BroswerKit components).
However, for my most recent project, the particular site I was seeking to fetch data used a form of AJAX pagination. Each page link really called a javascript function that modified the DOM rather than requesting a new page. This meant I couldn't use Goutte, as it wouldn't respond and load AJAX requsts, and it wasn't really a browser that could respond to changes in the DOM.
I ended up using CasperJS, which relies on (PhantomJS)[http://phantomjs.org/]. Basically, it's a headless browser that has a Javascript API. Using CasperJS I was able to interact with the AJAX-based pagination, and gather the information I required.
I ran into a few problems working with CasperJS, mostly my forgetting that Javascript is asynchronous. Unusually, I had a hard time finding solutions to these problems online, so I thought I'd put them here in case anyone stumbles accross them.
Synchronous Actions Following HTTP Requests
Casper JS is asynchronous and that can make executing certain functions and actions after receiving the data a little bit tricky.
Here's what I found works for me:
casper.waitFor(function check() {
return casper.open('http://example.app/');
}, function then() {
\\ executed after receiving response from page
this.echo(this.page.title);
});
the casper waitFor()
method accepts two functions as its parameters. The first, a check()
function, and the second a then()
function. In short, it waits to run any code in then()
until it has received a truthy value from check()
.
Note: You must return a truthy value from the check()
function to call the then()
function. Similarly, if a falsey value is returned from check()
, then()
will not be executed.
Looping Over Multiple Requests
Again, the asynchronous nature of CasperJS requires special care when you are looping over multiple requests. This is especially so when there are certain actions you wish to perform on the response of each request. In my case, I wanted to gather data from each paginated page request.
The following is what ended up working for me:
casper.then(function() {
this.each(this.pageNumbersArray, function(self, pageNumber) {
this.waitFor(function check() {
return this.goToPage(pageNumber);
}, function then() {
this.echo("Scraping page " + pageNumber);
this.scrapePage();
});
});
});
In my case, each "page" was really a javascript function that executed onClick of the page number.
Calling broswer-side Javascript Elements
This is pretty much equivalent to you plugging functions into your browsers console. I used to to call the AJAX function that updated the DOM with the next page's content. It's pretty simple, and is actually well explained in the documentation, but I thought it was worth including.
casper.evaluate(function() {
// webpage's javascript function can be called here
});
If, like me, you need to pass a variable into the function, you also need to include the variable as the third parameter of evaluate()
:
casper.evaluate(function(page) {
// webpage's javascript function can be called here
}, page);
casper.evaluate()
is pretty powerful -- it can be used to run pretty much any javascript code you would normally be able to type into the console of your regular web browser.