Introduction to Dataframe Operations with R
In this article, we will explore how to count events over time and group by conditions based on datetimes using Dataframes in R. We will dive into the world of data manipulation, exploring various techniques for handling missing values, merging datasets, and performing statistical analysis.
We’ll begin by examining a real-world scenario involving two datasets: df1 and df2. These datasets contain information about purchases made at a clothing store and customer calls to the CX service line, respectively. Our ultimate goal is to create a new dataset that combines these two datasets while meeting specific requirements.
Understanding the Data
Let’s take a closer look at the structure of each dataset:
# Loading required libraries
library(dplyr)
library(tibble)
# Creating df1 and df2
df1 <- tibble::tribble(
~NAME, ~PRODUCT, ~AGENT, ~DATE_PURCHASE,
"Karen", "M_14", "X_1", "8-25-2021 18:21:28",
"Jean", "M_78", "X_3", "8-26-2021 18:11:06",
"Jean", "M_71", "X_4", "8-26-2021 18:21:01",
"Jean", "M_64", "X_4", "8-27-2021 20:21:59",
"Keith", "M_57", "X_4", "8-27-2021 20:21:02",
"Alba", "M_50", "X_1", "8-28-2021 20:21:03",
"Alba", "M_43", "X_3", "8-29-2021 20:21:04",
"Alex", "M_36", "X_2", "8-25-2021 20:21:05"
)
df2 <- tibble::tribble(
~NAME, ~TYPE, ~DATE_OF_CALL,
"Karen", "COMPLAIN", "8-26-2021 18:21:28",
"Jean", "CX_SERVICE", "8-27-2021 18:11:06",
"Jean", "COMPLAIN", "8-28-2021 18:21:01",
"Jean", "CX_SERVICE", "8-29-2021 20:21:59",
"Keith", "CX_SERVICE", "8-29-2021 20:21:02",
"Alba", "COMPLAIN", "8-30-2021 20:21:03",
"Alex", "CX_SERVICE", "8-25-2021 21:21:05",
)
Preparing the Data for Analysis
Before we can merge these two datasets, we need to perform some data manipulation.
# Convert date columns to POSIXct format
df1 <- df1 %>%
mutate(DATE_PURCHASE = as.POSIXct(DATE_PURCHASE, format = "%m-%d-%Y %H:%M:%S"))
df2 <- df2 %>%
mutate(DATE_OF_CALL = as.POSIXct(DATE_OF_CALL, format = "%m-%d-%Y %H:%M:%S"))
Grouping and Summarizing the Data
Next, we will group each dataset by name and summarize the data. We’ll retrieve the product with the most recent purchase date and the most recent call date.
# Group df1 by name and summarize the data
df1_mod <- df1 %>%
group_by(NAME) %>%
summarise(
product = PRODUCT[DATE_PURCHASE == max(DATE_PURCHASE)],
DATE_PURCHASE = max(DATE_PURCHASE),
.groups = "drop"
)
# Group df2 by name and summarize the data
df2_mod <- df2 %>%
group_by(NAME) %>%
summarise(
`x attempt` = n(),
TYPE = TYPE[DATE_OF_CALL == max(DATE_OF_CALL)],
DATE_OF_CALL = max(DATE_OF_CALL),
.groups = "drop"
)
Merging the Datasets
Now that we have prepared our data, it’s time to merge the two datasets.
# Merge df1_mod and df2_mod by name
final_dataset <- left_join(df1_mod, df2_mod, by = "NAME")
Final Output
After performing these operations, our final output will be a dataset containing all necessary information:
| NAME | product | DATE_PURCHASE x attempt TYPE DATE_OF_CALL | |||
|---|---|---|---|---|---|
| Alba | M_43 | 2021-08-29 20:21:04 1 COMPLAIN 2021-08-30 20:21:03 | |||
| Alex | M_36 | 2021-08-25 20:21:05 1 CX_SERVICE 2021-08-25 21:21:05 | |||
| Jean | M_64 | 2021-08-27 20:21:59 3 CX_SERVICE 2021-08-29 20:21:59 | |||
| Karen | M_14 | 2021-08-25 18:21:28 1 COMPLAIN 2021-08-26 18:21:28 | |||
| Keith | M_57 | 2021-08-27 20:21:02 1 CX_SERVICE 2021-08-29 20:21:02 |
This final dataset contains all necessary information, including the product with the most recent purchase date and the type of call with the most recent date.
Conclusion
In this article, we have demonstrated how to count events over time and group by conditions based on datetimes using Dataframes in R. We performed data manipulation, grouping, summarizing, and merging operations to create a final dataset containing all necessary information.
We covered various techniques for handling missing values and performing statistical analysis while providing an example of real-world application.
Last modified on 2024-10-23