Introduction
In this article, we’ll explore a common problem in data manipulation: checking if a variable matches with another column. We’ll use R programming language as our example and cover the three most popular approaches: using tidyverse, base R, and rowwise.
The goal is to create a new column that indicates whether a person’s preferred pet (from a pet column) is available in the store (from corresponding pet_ columns). We’ll assume that the availability of pets varies across different regions or stores.
Background
In our dataset, we have information about three people and their preferences for pets:
| Person | Pet | Pet Cat | Pet Dog | Pet Llama |
|---|---|---|---|---|
| Jack | Dog | 0 | 0 | 1 |
| Jill | Cat | 1 | 1 | 1 |
| Bill | Zebra | 0 | 1 | 1 |
We want to create a new column, match, that contains the value from the corresponding pet_ column if the person’s preferred pet is available in the store (e.g., Dog is available as Pet_Dog). If not, it should be 0.
Tidyverse Approach
The first approach uses the tidyverse package to reshape our data into a long format and then manipulate it using dplyr. Here’s how you can do this:
Step 1: Pivot Long Format
We’ll pivot our data from wide format (data) to long format (pivot_longer). This will help us work with individual rows for each observation.
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = contains('_'))
Step 2: Filter Rows Where Value is 1
We’ll keep only the rows where value equals 1, which corresponds to available pets.
filter(value == 1)
Step 3: Create Logical Column by Matching Substring from ‘pet’ with ’name’
Next, we create a logical column (match) that checks if the substring from 'pet' matches with the corresponding person’s name. We use str_detect.
mutate(match = str_detect(name, pet))
Step 4: Group by Pet and Check if Any TRUE in Match
Now, we group our data by the pet column (which now corresponds to individual pets) and check for any TRUE values in the match column. If there is a match, we set match to 1.
group_by(pet) %>%
summarise(match = +(any(match)))
Step 5: Join with Original Dataset
We join our manipulated data back with the original dataset using right_join. We want all columns from the original dataset, but we’ll add a new column (match) created in previous steps.
right_join(data %>%
mutate(rn = row_number()) %>%
arrange(rn) %>%
select(names(data), match))
The final result will look like this:
| Person | Pet | pet_cat | pet_dog | pet_llama | match |
|---|---|---|---|---|---|
| Jack | Dog | 0 | 0 | 1 | 0 |
| Jill | Cat | 1 | 1 | 1 | 1 |
| Bill | Zebra | 0 | 1 | 1 | 0 |
Rowwise Approach
The second approach uses the rowwise function from dplyr. Here’s how you can implement it:
Step 1: Create rowwise Attribute
First, we create a group within each row (using grouped by row) and apply operations to it.
data %>%
rowwise() %>%
Step 2: Use c_across to Create Logical Vector
Next, we use the c_across function to create a logical vector (match) that checks if there is a match between the person’s name and their preferred pet.
mutate(match = +(str_detect(name, pet)))
Step 3: Remove Substring ‘pet_’ from Column Names
Then we remove the substring 'pet_' from column names to get individual pet availability columns.
str_remove(names(select(cur_data(), contains('_')))[c_across(contains("_")) == 1], ".*_")
Step 4: Concatenate Matching Pet Columns into a Single Vector
We concatenate these pet columns using str_c and pipe the result back to our data frame.
mutate(match = +(str_detect(pet, str_remove(names(select(cur_data(), contains('_')))[c_across(contains("_")) == 1], ".*_"))))
Step 5: Remove Group Attribute (ungroup) and Detect Pet Match
Finally, we drop the group attribute using ungroup and use str_detect to detect if any of our pet columns match.
ungroup %>%
mutate(match = +(str_detect(pet, match)))
The final result will look like this:
| Person | Pet | pet_cat | pet_dog | pet_llama | match |
|---|---|---|---|---|---|
| Jack | Dog | 0 | 0 | 1 | 0 |
| Jill | Cat | 1 | 1 | 1 | 1 |
| Bill | Zebra | 0 | 1 | 1 | 0 |
Base R Approach
The third approach uses only base R functions. Here’s how you can implement it:
Step 1: Select Names Containing ‘pet_’ (nm1)
First, we’ll select the column names containing 'pet'.
names(data)[startsWith(names(data), "pet_")]
Step 2: Create Sequence Column and Get Corresponding Elements
We create a sequence of rows using seq_len(nrow(data)) and get corresponding elements from selected columns.
data$match <- as.data.frame(data[names(data)[startsWith(names(data), "pet_)]][cbind(seq_len(nrow(data)), match(data$pet, sub("pet_", "", names(data)[startsWith(names(data), "pet_")])))]
Step 3: Replace Elements That Are NA with 0
We replace elements that are NA (no match) to 0.
data$match[is.na(data$match)] <- 0
The final result will look like this:
| Person | Pet | pet_cat | pet_dog | pet_llama | match |
|---|---|---|---|---|---|
| Jack | Dog | 0 | 0 | 1 | 0 |
| Jill | Cat | 1 | 1 | 1 | 1 |
| Bill | Zebra | 0 | 1 | 1 | 0 |
The three approaches achieve the same result: a new column indicating whether each person’s preferred pet is available in the store. Depending on your personal preference and familiarity with R packages, one of these methods may be more suitable for you.
In conclusion, when working with data manipulation tasks like this one, it’s essential to consider various methods, evaluate their efficiency and readability, and choose the most appropriate approach based on the problem requirements and available resources.
Last modified on 2025-04-06