When doing data analysis, sometimes you want to filter out (exclude) rows that meet some conditions. With dplyr
, one way to do this is to use filter()
with a logical not operator. For example:
df |> filter(!((name == "John") & (daughter == "Jenny")))
This will filter out all the rows where name
is John and daughter
is Jenny. This is fine as long as both logical expressions only return either TRUE
or FALSE
for all rows, but there is another possibility that is always lurking: NA
. With missing values we might run into trouble because filter()
only returns rows where the filter condition is TRUE
, but !NA
is still NA
, which is not the same as TRUE
.
Suppose that in our dataset, every row has name
but daughter
is NA
for some rows (some people don’t have a daughter). We would probably want these rows to be included in the results of our filter, because they don’t have daughter == "Jenny"
. However, if daughter
is NA
, then daughter == "Jenny"
also returns NA
, and our filter may not work properly.
There is an extra complication from evaluating the logical and operation inside filter()
. In R, TRUE & NA
returns NA
, but FALSE & NA
returns FALSE
. This is because R treats NA
as ‘unknown’ in logical operations. TRUE
and unknown is unknown since the result depends on the unknown thing. But FALSE
and unknown is FALSE
, because regardless of whether the unknown thing is TRUE
or FALSE
, the result will be FALSE
.
So what happens when we apply our filter when some rows are missing daughter
? It’s easier to see with an example. Let’s say our data is this:
name | daughter |
---|---|
John | Jenny |
John | Mary |
John | |
Peter | Jenny |
Peter | Alice |
Peter |
When we run the filter as above, we get:
name | daughter |
---|---|
John | Mary |
Peter | Jenny |
Peter | Alice |
Peter |
The John / NA
row has been omitted, because for this row, the test !((name == "John") & (daughter == "Jenny"))
returns NA
. However, the Peter / NA
row has been included. This is because although daughter == "Jenny"
returns NA
for this row, name == "John"
returns FALSE
, and for the reason explained above, !(FALSE & NA)
evaluates as TRUE
.
The moral of the story is, be careful with negative filtering! One way to fix the above example is to explicitly check for possible NA values:
df |> filter(!((name == "John") & (daughter == "Jenny") & !is.na(daughter)))
This works because if daughter
is missing, !is.na(daughter)
will return FALSE
, and as above this will cause the logical and expression to return FALSE
rather than NA
.