A pattern for analysing large datasets in R

This is a common situation for me:

Here is a pattern I use for this scenario in R code. There are probably better ways of doing it, but this seems to work ok for me.

I use something like the following function for loading data and cleaning it:

load_data <- function() {
  with(.GlobalEnv, rm(list = setdiff(ls(), lsf.str())))

  # Data loading and cleaning code here ...
}

The point of putting the data loading code in a function is that it can be called only when required. The first line of this function removes all data, but not functions, from the global environment. I want to clear out all data to get rid of the temporary variables and other junk that I’ve created in figuring out the analysis.

The rest of the load_data() function does whatever is necessary to load the data. In this part of the code, I use <<- for assignment so that variables holding the data are stored in the global environment rather than being local variables in the load_data() function. I know some people think that is a very yucky thing to do, but typically my datasets have lots of variables so I find it easier to do <<- assignment than to get load_data() to return a list of variables and then unpack that list into separate variables. Maybe there is a better technique?

The rest of the code is just a call to load_data() which I run once and then comment out when I’m doing analysis, followed by all the analytical code, which may or may not be in some other functions depending on how complicated it is.

Here’s what it looks like all together:

load_data <- function() {
  # Remove all data
  with(.GlobalEnv, rm(list = setdiff(ls(), lsf.str())))

  # Data loading and cleaning code here ...
}

load_data()

# Analytical code and functions here ...

So the first time I run this code, it will load the data and do the analysis. For subsequent runs, I comment out the call to load_data() and it will only do the analysis part, which will be much faster. In RStudio I can tweak the analysis code, hit Command-Shift-S to run the code, see the results, and repeat very quickly. If I ever need to go back to a clean dataset again, just un-comment the call to load_data() and run the code one time, then comment it out again.

Back to home