Aaron’s blog

Untangling CSV files

Statistics New Zealand has started publishing some raw CSV data files. Hooray! However, these files are quite large, and I’ve found it challenging to make sense of them. For example, the GDP file includes basically every GDP data series that Stats NZ publishes, and weights in at over 68,000 rows.

The different data series are distinguished by categorical (non-numeric) fields in the file. To make sense of these it’s helpful to understand what the categorical variables are and what values these variables take. For example in the GDP file there are categorical variables called “Series_reference”, “STATUS”, “UNITS”, “Subject”, “Group”, and the more cryptic “Series_title_1″, “Series_title_2″, and “Series_title_3″. If you know the values that these variables can take, you can get a sense of the coverage of data in the file.

With this in mind, I wrote a bit of R code to parse a CSV file, identify the categorical variables, and report the unique values of each variable, as well as the number of times that each value appears in the file. It will also optionally plot bar charts of the most common values for each categorical variable.

So I think this is useful for getting a quick overview of the types of data that may be in a CSV file. Something I hope to add in future is an overview of the hierarchy of values of the categorical variables, if they are part of a nested classification like ANZSIC. The end result could be something like this.