Aaron’s blog

To pie or not to pie

Some discussion on Twitter about pie charts lately, starting from here:

During this discussion Chris made the point that pie charts can communicate “part-whole” relationships (ie distributions) better than bar charts.

A couple of days later, someone else I follow retweeted this older tweet from Max Roser:

Here’s the accompanying picture:


Max’s point is that the three pie charts in his example look pretty much the same, when in fact there are differences between the values of the five categories in each, as shown clearly in the bar charts. But the thing is, the proportion of each category relative to the total (the part-whole relationship) is pretty much the same in each of these three cases. This is what the pies tell us and as long as we respect that and don’t expect them to give us accurate comparisons of absolute values across categories, a pie is not a bad choice. Pies may in fact be a better choice than the bar charts depending on whether we want to communicate proportions or absolutes.

You can see proportions quickly and intuitively with the pies in Max’s example. To see this from the bar charts, you have to do some mental arithmetic to add up the totals across the bars and divide each one by that total. That arithmetic is not too hard in this particular example but would be a lot more difficult if some of the categories were quite different from the others and the total was not a nice number like 100.

Personally I’ve avoided using pie charts for a long time, wary of their well-known pitfalls. But Chris’s tweet made me realise that, used appropriately, pie charts can be a good choice for displaying proportions. It’s hard for people to infer exact values from pies, as illustrated by Max’s example, but if you just want to communicate a sense of the proportions, pie charts can be useful. The data values can always be labelled on the chart if higher accuracy is needed.

Country name disambiguation

In a previous post I whinged about lack of standards in categorical data causing headaches when merging multiple datasets. For example: South Korea; Korea, South; Republic of Korea; etc …

In an ideal world, everyone would use well-defined standards, such as ISO 3166 for country names. In reality that’s unlikely, so we need tools to help us navigate the mess.

My friend Alyona suggested that it would be possible to use Wikipedia’s disambiguation links to build a database of alternative names for each country, and use that as a lookup table to translate the various alternative names into a common standard. Very helpfully, she pointed me to DBpedia, which has turned Wikipedia into a set of data files, including a “redirects” list, which is exactly what is needed to construct the desired lookup table.

So I set out to do this. From Wikipedia I scraped the ISO 3166 country names and used that as the standard to which all other alternatives would be cross referenced. Then I parsed the DBpedia redirects dataset to determine all the things that get redirected to each country’s Wikipedia page. This includes alternative names for the country, common mis-spellings, as well as a few other topics that are covered on the country’s page and don’t have their own Wikipedia page. The latter are irrelevant for the country name disambiguation list, but I couldn’t figure out an automatic way to exclude them, and I think the only real harm of including them is that the lookup table is a bit larger than it needs to be.

The resulting lookup table is here in CSV format and the full code is here. The code is a bit fragile; if you have any suggestions for improvements, let me know.

So what are the alternative names for New Zealand? (I’ve manually excluded the obviously irrelevant ones here):

NewZealand, AoTeAroa, Nz, Niu Tireni, Nu Tirani, New zealand, New Zealand’s, New zeeland, New zeland, NZ, NZL, New zelanad, N.Z., Staten Landt, New Zaeland, NEW Z, N z, Kiwistan, New Zealnd, New Zeeland, Newzealand, Nova Zeelandia, Staaten land, Staten Land, N. Zealand, Enzed, Nouvelle-Zelande, NEW ZEALAND, Sheepland, Kiwiland, New Zealnad, New Zealand., New+Zealand, New-Zealand, New.Zealand, N Zealand, Maoriland, New Xealand, New Zealand, Mew Zealand, New Zealend, New Zeland, Aotearoa / New Zealand, N Z, Māoria, Neo Zealand

Not bad I think. Who put Sheepland in there? Must be an Australian, surely.

The single house zone

There has been some discussion in the media lately about the extent “single house zone” in Auckland. See for example this article, which makes the very good point that the single house zone does not allow anyone to live there who cannot afford a single house.

Where is this zone? The basic zones in the proposed (NB not yet finalised) Unitary Plan are here (thanks, @nzsd!). This is what the proposed single house zone looks like:

single house zone

What can you build (or not) in this zone? This PDF file summarises. There is a minimum lot size of 600 square metres on which your single house must sit. In addition there is a long list of other requirements, such as (I paraphrase):

  • Minimum yards: 5m in front and 1m at the back and sides
  • Maximum impervious area of 60% of the site
  • Maximum building coverage of 35% of the site
  • Buildings not more than 2.5m high at the boundary, and thereafter set back from the boundary by 1m for each extra 1m in height
  • 40% of the site must be a landscaped area with at least 10% planted with shrubs and one specimen tree that will become “significantly large”; 50% of the front yard must be landscaped
  • Outdoor living space of at least 80 square metres
  • If the principal living room is at ground level, part of the outdoor living space must be in a 20 square metre area that is directly accessible from the living room

None of this helps to make houses more affordable.

Some quick maps of NZ roads

As a learning exercise based on Nathan Yau’s excellent tutorials, I made some simple maps of the LINZ road centrelines data in R. If you’re interested, my code is here.

Here’s the result (click for bigger):


A day in the life of a data janitor

I’m working on a project for which I need data on the resident population of the cities served by a few hundred airports around the world. I’m going to use this to illustrate some of the basic, dull, and frustrating challenges with doing data analysis.

My list of airports is given by their 3-letter IATA codes (LAX, AKL, YVR, etc), so first I need a table that translates the codes into cities and countries. There are many commercial sources of such data but I don’t have budget for that. Fortunately, Wikipedia has a pretty good list.

The first challenge is downloading the list — Wikipedia has split it into 26 pages. I could manually copy and paste these into a spreadsheet, but after some very kind help from Chris McDowall and some hacking of my own, I made a script to automatically download the tables into a single CSV file. Neat! Now let the fun begin …

The Wikipedia table is pretty good, but it has some issues that I had to clean up before I could use it. There are some inconsistencies in country names, for example “England” appears once, while “United Kingdom” is used for all others. Similarly there is one “Madagascar.” (note the .), México and Mexico, People’s Republic of China and China, etc.

In addition, some of the airport locations are recorded in a special way, such as Barcelona’s airport, which is recorded as being in “El Prat de Llobregat (near Barcelona), Spain”. Now this is actually quite helpful for me, because I want to use the population of Barcelona rather than El Prat in my analysis. But because this information is recorded in the location field, I had to go through and identify the airports that are “near” some city and modify them accordingly.

Now for the population data! There are a few different places where you can get city population data for free. For example, Demographia publishes lists of city populations a PDF file, which I liberated using Tabula.

Population data in hand, now comes the fun part of trying to match this with the airports. The difficulty of course comes from inconsistencies in naming of cities and countries, including city names with accents and other special characters. As Alyona Medelyan pointed out to me on Twitter, the potential problem is pretty bad:

What a mess! Suffice to say, I spent the whole afternoon messing around with this. In my next post I’ll talk about why data standards are really important, but why the system for making and communicating such standards is broken so that it’s hard to adhere to standards even if you want to.

Data Counsel

Update: Wiki New Zealand is now Figure.NZ. Read about the reasons for the new name here.

I’m really excited to be appointed as Data Counsel at Figure.NZ (formerly Wiki New Zealand), and I could be your Data Counsel too. What’s a Data Counsel? I’ll explain, but first…

Figure.NZ is on a mission to democratise data by making it usable by everyone. There’s tons of fascinating public data out there, but for the most part it’s trapped in obstinate spreadsheets and clunky web tools. Figure.NZ has built some really cool software called Grace that liberates this data and turns it into friendly charts and tables, and also serves it up via an API. One of the super-talented developers, Nigel McNie, explains a bit more about Grace here.

This is really important because data only creates value when it is used. Before Figure.NZ, using New Zealand’s public data required a lot of specialised skills and knowledge. Now all you need is curiosity. This means that vastly more people will be able to use data and generate value from it.

So what is a Data Counsel? Lillian Grace, Figure.NZ’s Founder and CEO, created this term for me. It is inspired by legal counsel, who advise, solve problems, and dispense general wisdom. This is essentially what I’ll be doing for Figure.NZ, its clients and its users but obviously in relation to data instead of law. You can have your own Data Counsel too — I’m still available for hire and consulting work directly.

As well as data publishing, Figure.NZ often gets asked by companies, government, individuals, industry groups, and others for advice on how to think about or use data, and sometimes this is internal or private data that falls outside Figure.NZ’s main focus. Sometimes the guidance can be easily and freely given, sometimes it turns into a project that sees more data published on Figure.NZ, and sometimes it requires really specialised work. I’ll be helping with all of these things.

Always ahead of the curve, Lillian also recently appointed a Chief Data Officer, Andrea Carboni. The first sentence Lillian usually says about Andrea is “He’s changed my life!”. Andrea heads up all Figure.NZ’s data work full-time, guides the data team, and makes sure everything that’s published is accurate and adheres to the Figure.NZ graph usability standards. I expect we’ll be seeing lots of other organisations sprouting Chief Data Officers and engaging Data Counsel in short order.

I’m super excited and grateful to be able to help such a talented group of people who are doing important and valuable work. We have some great things coming soon, so stay tuned!

Household access to the internet in NZ

The NZ Census asks a bunch of questions about household access to various types of telecommunications services, like fixed-line phones, faxes, mobile phones, and the internet.

The internet question is a bit crude — it does not distinguish between dial-up and broadband for example, and simply asks about access rather than use. But anyway, the following chart shows the relationship between household access to the internet and median household income, for the 2001 and 2013 Census results by area unit. In both cases I’ve used the 2013 median household income and this is strongly correlated with 2001 median income.


Obviously, household access to the internet has increased dramatically over this period. A positive correlation between income and internet access is clear, although this probably will not be entirely an income effect and probably reflects other factors like the difficulty and cost of internet access in rural areas.

There are a few areas at the low end of the income scale that have very high household internet access in 2013, likely due to things like the government’s rural broadband initiative. It also seems like at higher income levels at least the distribution of internet access has become more compressed. In 2001 the rate of internet access for area units with 2013 household income > $100,000 was between about 50% and 75%; in 2013 these rates are more tightly clustered around 90%.

This is part of my NZ Census Challenge series, visualising all questions in the 2013 NZ Census.

Budget 2015 visualisations

The NZ Herald published some great visualisations of Budget data yesterday. My favourite was Harkanwal Singh’s bubble plot that simply shows percentage changes in expenditure categories.


The really nice thing about this is that the bubbles move and interact with each other. The result is a slightly chaotic but still very legible breakdown of expenditure. I like this because it doesn’t try to do too much — it gives you an overall sense of the 2015 Budget and the changes since the previous Budget, but you don’t get lost in the detail. The physics makes it interesting to explore and every time you get a slightly different visual display but the overall message is the same. And believe me, that bubble physics stuff is hard to do!

Also on the NZ Herald site, Keith Ng made an interactive chart that compares past budget forecasts with actual results. This is a more traditional visualisation in that it uses line charts, and it is a really interesting and useful piece of work. Forecasting is hard to do accurately, but the one thing you really want to avoid is consistently over- or under-estimating.

For example, the unemployment rate forecasts:


In the short term these look not too bad, although it looks like some of the earlier forecasts were too optimistic about the rate at which unemployment would fall. In the long term, all of the forecasts seem to be assuming that unemployment reverts to a long-run average (about 4.5%). This is a common forecasting assumption, which you may or may not agree with. But it shows the value of this exercise by revealing assumptions that are not otherwise obvious.

On the NBR site, Keith also published an interactive treemap that breaks down budget revenue and expenditure into all its glorious detail. One thing I like about this is that it includes the revenue side of things as well, which is often overlooked in all the emphasis on who is getting the money:


Guess what is the second biggest box? NZ Customs. Ok, it’s mostly GST on imports, but still there is $2.3 billion in revenue from customs duties, ie taxes on imported goods.

Addendum: Somehow I missed Chris McDowall’s interesting dot chart breakdown of expenditure:


As Chris says, it’s a work in progress, but I think this is quite a neat way to see where the money goes and what the government’s priorities are.

Proportion of quaxing households

The next item in the Census challenge is the number of motor vehicles per household, and I thought I’d use this to look at the rate of quaxing in New Zealand cities. By definition households with no motor vehicles must quax, so for each Census area unit I’ve calculated the proportion of such households, for 2013.

[Addendum: Obviously households with motor vehicles can quax sometimes too, so the overall rate of quaxing is certainly greater than implied by the maps below].

Here’s Auckland. The city centre is obviously a hotbed of quaxing. There’s also a fair amount of quaxing households in the western and southern suburbs, and an interesting pocket out east. Not so much quaxing on the North Shore however.


Wellington. Lots of quaxing households in the central area, and pockets of quaxing around the satellite centres.


Christchurch. Fewer quaxing households than I expected here, given the flat terrain.


Rent to income ratio

Yikes, it’s been too long since I did a Census challenge post. Been busy with work and stuff …

The next question is weekly rent for households in rented dwellings. One of the results reported is the median weekly rent. I decided to calculate the ratio of that to median weekly household income (for 2013) and put it on a map by Census area unit.

Note I have not normalised for the number of households or population in each area unit. So the maps below do not reflect the number of people or households that are experiencing a high or low rent to income ratio. The maps just show how these ratios vary geographically. (Also something has gone screwy with QGIS font rendering in the latest version so the labels are looking a bit weird).

Here’s Auckland. The one thing that surprised me initially was the relatively high ratio of rent to income in the central city. It’s possibly explained by the high student population, some of whom probably reported low income if they are being supported by their parents or other sources. Interesting also the pockets of higher ratios here and there (but again, this does not reflect the number of households experiencing those ratios).


Welly. There seem to be fewer areas with higher ratios.


And Christchurch, fairly uniform here, given the 10% bands that I’ve set.