Automating downloads from Infoshare with R

Infoshare is Stats NZ’s online system for publishing all its time-series datasets. Its user interface is rudimentary but fairly easy to use once you get used to it. Infoshare is mainly designed for manual data retrieval, but there is an “Export Direct” function where you can re-run a previously saved query, which is handy if you want to get updated data for time series that you’ve previously downloaded.

Using Export Direct involves uploading a “search file” saved from a previous query and selecting which time periods of data to download. There are a few other options but the defaults for these are fine and don’t usually need to be changed. The “search file” is just a plain text file with a list of Stats NZ time series IDs, one per line. So if you know the series IDs that you want, it’s easy to construct the relevant “search file” to download them from Export Direct.

I wanted to automate the few manual steps involved in downloading data from Export Direct. My first attempt used the excellent web scraping package rvest, which has form-filling functions that can be used to handle the Export Direct form. Unfortunately I couldn’t get this to work, possibly because Infoshare is built using ASP.NET which uses a lot of javascript in the browser to handle the user interface and form submission functions. My attempts to send an appropriate POST request to Infoshare using basic httr functions also failed, I expect for similar reasons.

After those failures, I turned to the RSelenium package which provides an R interface to Selenium. With Selenium you can code simulated user interactions in a web browser. From the website’s point of view it’s exactly as if a user was clicking buttons or typing on the keyboard, which is perfect for automating Export Direct. The disadvantage is that it is somewhat slow, first needing to start up a Selenium server instance (which runs in Java on your computer) and then telling the browser app what to do.

The first step in making this work is installing Selenium (brew install selenium on a Mac; you will also need Java installed) and the RSelenium package. Then install the as.infoshare package that I made to automate Infoshare downloads (devtools::install_github("https://github.com/aaronschiff/as.infoshare")).

This package is very rudimentary, it doesn’t really do any error checking, and it also needs to copy and delete some files on your hard drive. Use at your own risk!

There are just three functions in this package:

  • start_selenium(): This starts and returns a connection to a local instance of the Selenium server which can be used for subsequent data requests from Infoshare. This process is kind of slow, so if you need to make multiple requests you can re-use the server connection.
  • download_infoshare(): Download one or more time series from Infoshare via a Selenium connection, and save them as CSV files.
  • stop_selenium(): Shut down the Selenium server when you are done with it.

Here’s a simple example. First, create a Selenium connection using the default browser Chrome (this can be changed with the selenium_browser option). I found that Selenium is fussy about whether the version of the Chrome app matches what I thinks it should be. The default is for Selenium to use the ’latest’ version of Chrome, but despite my copy of Chrome saying there were no updates, Selenium claimed that I had an incorrect version of Chrome installed. This can be resolved by setting the chromever parameter directly. Have a look at the documentation for RSelenium::rsDriver() to see the other possible options for configuring the Selenium server connection.

library(as.infoshare)
ss <- start_selenium(chromever = "99.0.4844.51")

The next step is to request some data from Infoshare using this connection. Let’s request the consumer price index and unemployment. Both of these are quarterly, so the returned data will have columns for each series and one row per quarter. Although Export Direct allows you to specify the time period for downloading data, in almost all casese you would want all of the available data, so all available periods are selected and I haven’t provided any option to change this.

dl <- download_infoshare(selenium_connection = ss,
                         series_ids = c("HLFQ.S1F3S"),
                         target_directory = "data")

The downloaded CSV files are saved in target_directory which must exist as a subdirectory of your current working directory. Behind the scenes, a temporary “search file” (.sch file) is also created in target_directory and then deleted after the data has been retrieved from Infoshare.

Since Selenium uses a web browser to interact with Infoshare, the downloaded CSV file is initially saved as ExportDirect.csv in your default downloads directory. This file is moved to target_directory and a timestamp is added to the filename to note when it was downloaded and to distinguish it from other files. You can use browser_dl_directory in the call to download_infoshare() to specify where to find the browser’s downloaded files (defaults to ~/Downloads on a Mac).

Finally, if we are done, we can close the Selenium connection (or keep re-using it for more queries).

stop_selenium(ss)

And that’s it really. I hope this is useful to someone. Feel free to re-use and modify this code!