Data Wrangling Part 3: Basic and more advanced ways to filter rows

This is the third blog post in a series of dplyr tutorials. In this post, we will cover how to filter your data. Apart from the basics of filtering, it covers some more nifty ways to filter numerical columns with near() and between(), or string columns with regex.

Content:

Basic row filters
Filtering across multiple columns

The data
As per previous blog posts, many of these functions truly shine when you have a lot of columns, but to make it easy on people to copy paste code and experiment, I’m using a built-in dataset. This dataset is built into ggplot2, so if you load tidyverse you will get it. Otherwise, just add once msleep <- ggplot2::msleep argument to have the dataset available.

library(dplyr)
library(stringr)
msleep <- ggplot2::msleep

glimpse(msleep)

## Observations: 83
## Variables: 11
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...

Basic row filters

In many cases you don’t want to include all rows in your analysis but only a selection of rows. The function to use only specific rows is called filter() in dplyr. The general syntax of filter is: filter(dataset, condition). In case you filter inside a pipeline, you will only see the condition argument as the dataset is piped into the function.

Filtering rows based on a numeric variable

You can filter numeric variables based on their values. The most used operators for this are >, >=, <, <=, == and !=.

msleep %>% 
  select(name, sleep_total) %>% 
  filter(sleep_total > 18)

## # A tibble: 4 x 2
##   name                 sleep_total
##   <chr>                      <dbl>
## 1 Big brown bat               19.7
## 2 Thick-tailed opposum        19.4
## 3 Little brown bat            19.9
## 4 Giant armadillo             18.1

If you want to select a range of values you can use two logical requirements. For instance to select all animals with a total sleep time between 15 and 18 hours, I could use: filter(sleep_total >= 16, sleep_total <= 18), but there is a slightly shorter way by using the between() function.

msleep %>% 
  select(name, sleep_total) %>% 
  filter(between(sleep_total, 16, 18))

## # A tibble: 4 x 2
##   name                   sleep_total
##   <chr>                        <dbl>
## 1 Owl monkey                    17.0
## 2 Long-nosed armadillo          17.4
## 3 North American Opossum        18.0
## 4 Arctic ground squirrel        16.6

Another function that can come in handy is near(), which will select all code that is nearly a given value. You have to specify a tolerance tol to indicate how far the values can be. You can add a specific number: filter(near(sleep_total, 17, tol = 0.5)) for instance will return any rows where sleep_total is between 16.5 and 17.5, or you can add a formula.
The sample code will return all rows that are within one standard deviation of 17.

msleep %>% 
  select(name, sleep_total) %>% 
  filter(near(sleep_total, 17, tol = sd(sleep_total)))

## # A tibble: 26 x 2
##    name                       sleep_total
##    <chr>                            <dbl>
##  1 Owl monkey                        17.0
##  2 Mountain beaver                   14.4
##  3 Greater short-tailed shrew        14.9
##  4 Three-toed sloth                  14.4
##  5 Long-nosed armadillo              17.4
##  6 North American Opossum            18.0
##  7 Big brown bat                     19.7
##  8 Western american chipmunk         14.9
##  9 Thick-tailed opposum              19.4
## 10 Mongolian gerbil                  14.2
## # ... with 16 more rows

Filtering based on a exact character variable matches

If you want to select a specific group of animals for instance you can use the == comparison operator:

msleep %>% 
  select(order, name, sleep_total) %>% 
  filter(order == "Didelphimorphia")

## # A tibble: 2 x 3
##   order           name                   sleep_total
##   <chr>           <chr>                        <dbl>
## 1 Didelphimorphia North American Opossum        18.0
## 2 Didelphimorphia Thick-tailed opposum          19.4

Similarly you can use the other operators:

filter(order != "Rodentia") will select everything except the Rodentia rows.
filter(name > "v") will just select the rows with a name in the alphabet after the letter v.

If you want to select more than one animal you can use the %in% operator. The following code will just select the rows with animals belonging to the order of Didelphimorphia and Diprotodontia.

msleep %>% 
  select(order, name, sleep_total) %>% 
  filter(order %in% c("Didelphimorphia", "Diprotodontia"))

## # A tibble: 4 x 3
##   order           name                   sleep_total
##   <chr>           <chr>                        <dbl>
## 1 Didelphimorphia North American Opossum        18.0
## 2 Didelphimorphia Thick-tailed opposum          19.4
## 3 Diprotodontia   Phalanger                     13.7
## 4 Diprotodontia   Potoroo                       11.1

You can use the %in% operator to deselect certain groups as well, in this case you have to negate by adding an exclamation mark at the beginning of your filter. Making a !%in% might seem logic but it won’t work.

remove <- c("Rodentia", "Carnivora", "Primates")
msleep %>% 
  select(order, name, sleep_total) %>% 
  filter(!order %in% remove)

## # A tibble: 37 x 3
##    order           name                       sleep_total
##    <chr>           <chr>                            <dbl>
##  1 Soricomorpha    Greater short-tailed shrew       14.9 
##  2 Artiodactyla    Cow                               4.00
##  3 Pilosa          Three-toed sloth                 14.4 
##  4 Artiodactyla    Roe deer                          3.00
##  5 Artiodactyla    Goat                              5.30
##  6 Soricomorpha    Star-nosed mole                  10.3 
##  7 Soricomorpha    Lesser short-tailed shrew         9.10
##  8 Cingulata       Long-nosed armadillo             17.4 
##  9 Hyracoidea      Tree hyrax                        5.30
## 10 Didelphimorphia North American Opossum           18.0 
## # ... with 27 more rows

Filtering rows based on regex

The above options will only work if you can use the full variable content. In some cases though it will be needed to filter based on partial matches. In this case, we need a function that will evaluate regular expressions on strings and return boolean values. Whenever the statement is TRUE the row will be filtered.
There are two main options for this: base R’s grepl() function, or str_detect() from the stringr package.

Whenever you are looking for partial matches, it is important to remember that R is case sensitive. By just using filter(str_detect(name, pattern="mouse")) we would leave out any row called Mouse. In this case it does not make a difference, but it’s a good habit to create.

I used str_detect() below as it is easier to understand. For those interested, the alternative would be: filter(grepl(pattern="mouse", tolower(name))).

msleep %>% 
  select(name, sleep_total) %>% 
  filter(str_detect(tolower(name), pattern = "mouse"))

## # A tibble: 5 x 2
##   name                       sleep_total
##   <chr>                            <dbl>
## 1 Vesper mouse                      7.00
## 2 House mouse                      12.5 
## 3 Northern grasshopper mouse       14.5 
## 4 Deer mouse                       11.5 
## 5 African striped mouse             8.70

Filtering based on multiple conditions

The above examples return rows based on a single condition, but the filter option also allows AND and OR style filters:

filter(condition1, condition2) will return rows where both conditions are met.
filter(condition1, !condition2) will return all rows where condition one is true but condition 2 is not.
filter(condition1 | condition2) will return rows where condition 1 and/or condition 2 is met.
filter(xor(condition1, condition2) will return all rows where only one of the conditions is met, and not when both conditions are met.

Multiple AND, OR and NOT conditions can be combined. The sample code will return all rows with a bodywt above 100 and either have a sleep_total above 15 or are not part of the Carnivora order.

msleep %>% 
  select(name, order, sleep_total:bodywt) %>% 
  filter(bodywt > 100, (sleep_total > 15 | order != "Carnivora"))

## # A tibble: 10 x 8
##    name      order  sleep_total sleep_rem sleep_cycle awake brainwt bodywt
##    <chr>     <chr>        <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
##  1 Cow       Artio~        4.00     0.700       0.667 20.0    0.423    600
##  2 Asian el~ Probo~        3.90    NA          NA     20.1    4.60    2547
##  3 Horse     Peris~        2.90     0.600       1.00  21.1    0.655    521
##  4 Donkey    Peris~        3.10     0.400      NA     20.9    0.419    187
##  5 Giraffe   Artio~        1.90     0.400      NA     22.1   NA        900
##  6 Pilot wh~ Cetac~        2.70     0.100      NA     21.4   NA        800
##  7 African ~ Probo~        3.30    NA          NA     20.7    5.71    6654
##  8 Tiger     Carni~       15.8     NA          NA      8.20  NA        163
##  9 Brazilia~ Peris~        4.40     1.00        0.900 19.6    0.169    208
## 10 Bottle-n~ Cetac~        5.20    NA          NA     18.8   NA        173

Example with xor()

msleep %>%
  select(name, bodywt:brainwt) %>% 
  filter(xor(bodywt > 100, brainwt > 1))

## # A tibble: 5 x 3
##   name            bodywt brainwt
##   <chr>            <dbl>   <dbl>
## 1 Cow              600     0.423
## 2 Horse            521     0.655
## 3 Donkey           187     0.419
## 4 Human             62.0   1.32 
## 5 Brazilian tapir  208     0.169

Example with !:
The sample code will select all rows where brainwt is larger than 1, but bodywt does not exceed 100.

msleep %>% 
  select(name, sleep_total, brainwt, bodywt) %>% 
  filter(brainwt > 1, !bodywt > 100)

## # A tibble: 1 x 4
##   name  sleep_total brainwt bodywt
##   <chr>       <dbl>   <dbl>  <dbl>
## 1 Human        8.00    1.32   62.0

Filtering out empty rows

To filter out empty rows, you negate the is.na() function inside a filter:
The sample code will remove any rows where conservation is NA.

msleep %>% 
  select(name, conservation:sleep_cycle) %>% 
  filter(!is.na(conservation))

## # A tibble: 54 x 5
##    name                     conservation sleep_total sleep_rem sleep_cycle
##    <chr>                    <chr>              <dbl>     <dbl>       <dbl>
##  1 Cheetah                  lc                 12.1     NA          NA    
##  2 Mountain beaver          nt                 14.4      2.40       NA    
##  3 Greater short-tailed sh~ lc                 14.9      2.30        0.133
##  4 Cow                      domesticated        4.00     0.700       0.667
##  5 Northern fur seal        vu                  8.70     1.40        0.383
##  6 Dog                      domesticated       10.1      2.90        0.333
##  7 Roe deer                 lc                  3.00    NA          NA    
##  8 Goat                     lc                  5.30     0.600      NA    
##  9 Guinea pig               domesticated        9.40     0.800       0.217
## 10 Grivet                   lc                 10.0      0.700      NA    
## # ... with 44 more rows

Filtering across multiple columns

The dplyr package has a few powerful variants to filter across multiple columns in one go:

filter_all() will filter all columns based on your further instructions
filter_if() requires a function that returns a boolean to indicate which columns to filter on. If that is true, the filter instructions will be followed for those columns.
filter_at() requires you to specify columns inside a vars() argument for which the filtering will be done.

In these cases, there is a general syntax: first you specify which columns, then you mention the condition for the filter. In many cases you will need a . operator within the condition which refers to the values we are looking at.

Filter all

Admittedly, msleep is not the best database to showcase this power, but imagine you have a database with a few columns and you want to select all rows that have a certain word in either column. Take a financial dataframe for instance and you want to select all rows with ‘food’, whether food is mentioned in the main category column, the subcategory column, the comments column or the place you’ve spent it.
You could make a long filter statement with 4 different conditions wrapped inside OR statements. Or you just filter across all columns for the string “food”.

In the sample code below I’m searching for the string “Ca” across all columns. I want to keep rows where the string “Ca” is present in ANY of the variables, so I will wrap the condition in any_vars().
The below code basically asks to retain any rows where any of the variables has the pattern “Ca” inside.

msleep %>% 
  select(name:order, sleep_total, -vore) %>% 
  filter_all(any_vars(str_detect(., pattern = "Ca")))

## # A tibble: 16 x 4
##    name              genus        order        sleep_total
##    <chr>             <chr>        <chr>              <dbl>
##  1 Cheetah           Acinonyx     Carnivora          12.1 
##  2 Northern fur seal Callorhinus  Carnivora           8.70
##  3 Vesper mouse      Calomys      Rodentia            7.00
##  4 Dog               Canis        Carnivora          10.1 
##  5 Roe deer          Capreolus    Artiodactyla        3.00
##  6 Goat              Capri        Artiodactyla        5.30
##  7 Guinea pig        Cavis        Rodentia            9.40
##  8 Domestic cat      Felis        Carnivora          12.5 
##  9 Gray seal         Haliochoerus Carnivora           6.20
## 10 Tiger             Panthera     Carnivora          15.8 
## 11 Jaguar            Panthera     Carnivora          10.4 
## 12 Lion              Panthera     Carnivora          13.5 
## 13 Caspian seal      Phoca        Carnivora           3.50
## 14 Genet             Genetta      Carnivora           6.30
## 15 Arctic fox        Vulpes       Carnivora          12.5 
## 16 Red fox           Vulpes       Carnivora           9.80

The same can be done for numerical values: This code will retain any rows that has any value below 0.1:

msleep %>%  
  select(name, sleep_total:bodywt) %>% 
  filter_all(any_vars(. < 0.1))

## # A tibble: 47 x 7
##    name           sleep_total sleep_rem sleep_cycle awake  brainwt  bodywt
##    <chr>                <dbl>     <dbl>       <dbl> <dbl>    <dbl>   <dbl>
##  1 Owl monkey           17.0      1.80       NA      7.00  1.55e-2 4.80e-1
##  2 Greater short~       14.9      2.30        0.133  9.10  2.90e-4 1.90e-2
##  3 Vesper mouse          7.00    NA          NA     17.0  NA       4.50e-2
##  4 Dog                  10.1      2.90        0.333 13.9   7.00e-2 1.40e+1
##  5 Roe deer              3.00    NA          NA     21.0   9.82e-2 1.48e+1
##  6 Guinea pig            9.40     0.800       0.217 14.6   5.50e-3 7.28e-1
##  7 Chinchilla           12.5      1.50        0.117 11.5   6.40e-3 4.20e-1
##  8 Star-nosed mo~       10.3      2.20       NA     13.7   1.00e-3 6.00e-2
##  9 African giant~        8.30     2.00       NA     15.7   6.60e-3 1.00e+0
## 10 Lesser short-~        9.10     1.40        0.150 14.9   1.40e-4 5.00e-3
## # ... with 37 more rows

The any_vars() statement is equivalent to OR, so of course there is an equivalent for AND statements as well:all_vars(). The below code will retain any rows where all values are above 1.

msleep %>%  
  select(name, sleep_total:bodywt, -awake) %>% 
  filter_all(all_vars(. > 1))

## # A tibble: 1 x 6
##   name  sleep_total sleep_rem sleep_cycle brainwt bodywt
##   <chr>       <dbl>     <dbl>       <dbl>   <dbl>  <dbl>
## 1 Human        8.00      1.90        1.50    1.32   62.0

Filter if

The filter_all() function can sometimes go a bit wild. The msleep dataset has a set of sleep and weight measurements where some data is missing - there is nothing I can do to add data there. But the first few set of columns just contain info on animals. The vore of Vesper Mouse is missing, but that is info I can still dig up and add to the dataframe if I wanted.
So imagine I want to find out all data rows where we NA in the first few columns. filter_all(any_vars(is.na(.))) will be quite useless because it would return 27 rows, many of which are missing data in the measurement section.

In this case: filter_if() comes in handy. The describing columns are all character columns, while the measurement data is numeric. So using filter_if() I can specify that I want to just filter on character variables. In this case I only get 7 rows.

msleep %>% 
  select(name:order, sleep_total:sleep_rem) %>% 
  filter_if(is.character, any_vars(is.na(.)))

## # A tibble: 7 x 6
##   name            genus       vore  order          sleep_total sleep_rem
##   <chr>           <chr>       <chr> <chr>                <dbl>     <dbl>
## 1 Vesper mouse    Calomys     <NA>  Rodentia              7.00    NA    
## 2 Desert hedgehog Paraechinus <NA>  Erinaceomorpha       10.3      2.70 
## 3 Deer mouse      Peromyscus  <NA>  Rodentia             11.5     NA    
## 4 Phalanger       Phalanger   <NA>  Diprotodontia        13.7      1.80 
## 5 Rock hyrax      Procavia    <NA>  Hyracoidea            5.40     0.500
## 6 Mole rat        Spalax      <NA>  Rodentia             10.6      2.40 
## 7 Musk shrew      Suncus      <NA>  Soricomorpha         12.8      2.00

Similarly, you can add is.numeric, is.integer, is.double, is.logical, is.factor. If you have data columns, you can load the lubridate package, and use is.POSIXt or is.Date.

Filter at

One of the more powerful functions is filter_at(): it does not filter all columns, nor does it need you to specify the type of column, you can just select columns to which the change should happen via the vars() argument. This argument allows anything that can be done within a select statement: so you can refer to them by name, but also by logical numerical functions, regex, etc (See my first blog post for select options).

The second argument is the condition for selection. Similar to the examples above, you can use all_vars() if all columns need to return TRUE (AND equivalent), or any_vars() in case just one variable needs to return TRUE (OR equivalent).

Example: refer to columns by their name:

msleep %>% 
  select(name, sleep_total:sleep_rem, brainwt:bodywt) %>% 
  filter_at(vars(sleep_total, sleep_rem), all_vars(.>5))

## # A tibble: 2 x 5
##   name                 sleep_total sleep_rem brainwt bodywt
##   <chr>                      <dbl>     <dbl>   <dbl>  <dbl>
## 1 Thick-tailed opposum        19.4      6.60 NA       0.370
## 2 Giant armadillo             18.1      6.10  0.0810 60.0

Example: using another select option:

msleep %>% 
  select(name, sleep_total:sleep_rem, brainwt:bodywt) %>% 
  filter_at(vars(contains("sleep")), all_vars(.>5))

## # A tibble: 2 x 5
##   name                 sleep_total sleep_rem brainwt bodywt
##   <chr>                      <dbl>     <dbl>   <dbl>  <dbl>
## 1 Thick-tailed opposum        19.4      6.60 NA       0.370
## 2 Giant armadillo             18.1      6.10  0.0810 60.0

Data Wrangling Part 3: Basic and more advanced ways to filter rows

Basic row filters

Filtering rows based on a numeric variable

Filtering based on a exact character variable matches

Filtering rows based on regex

Filtering based on multiple conditions

Filtering out empty rows

Filtering across multiple columns

Filter all

Filter if

Filter at

Want to learn more?

Suzan Baert

Data Wrangling Part 3: Basic and more advanced ways to filter rows

Basic row filters

Filtering rows based on a numeric variable

Filtering based on a exact character variable matches

Filtering rows based on regex

Filtering based on multiple conditions

Filtering out empty rows

Filtering across multiple columns

Filter all

Filter if

Filter at

Want to learn more?

Suzan Baert

Data Wrangling Part 4: Summarizing and slicing your data

Reflections on 4 months of GitHub: my advice to beginners

Data Wrangling Part 3: Basic and more advanced ways to filter rows

Happy Galentine's Day!

Data Wrangling Part 2: Transforming your columns into the right shape

Data Wrangling Part 1: Basic to Advanced Ways to Select Columns

The Postmortem Airtime Effect

Who's on everyone's 2017 "hit list"?

Bilingual Town Names

Any Joke's left in Belgium?