Dataframe Operations with R: Merging Datasets for Comprehensive Analysis

Introduction to Dataframe Operations with R

In this article, we will explore how to count events over time and group by conditions based on datetimes using Dataframes in R. We will dive into the world of data manipulation, exploring various techniques for handling missing values, merging datasets, and performing statistical analysis.

We’ll begin by examining a real-world scenario involving two datasets: df1 and df2. These datasets contain information about purchases made at a clothing store and customer calls to the CX service line, respectively. Our ultimate goal is to create a new dataset that combines these two datasets while meeting specific requirements.

Understanding the Data

Let’s take a closer look at the structure of each dataset:

# Loading required libraries
library(dplyr)
library(tibble)

# Creating df1 and df2
df1 <- tibble::tribble(
  ~NAME, ~PRODUCT, ~AGENT,        ~DATE_PURCHASE,
  "Karen",   "M_14",  "X_1", "8-25-2021 18:21:28",
  "Jean",   "M_78",  "X_3", "8-26-2021 18:11:06",
  "Jean",   "M_71",  "X_4", "8-26-2021 18:21:01",
  "Jean",   "M_64",  "X_4", "8-27-2021 20:21:59",
  "Keith",   "M_57",  "X_4", "8-27-2021 20:21:02",
  "Alba",   "M_50",  "X_1", "8-28-2021 20:21:03",
  "Alba",   "M_43",  "X_3", "8-29-2021 20:21:04",
  "Alex",   "M_36",  "X_2", "8-25-2021 20:21:05"
)

df2 <- tibble::tribble(
  ~NAME,        ~TYPE,         ~DATE_OF_CALL,       
  "Karen",   "COMPLAIN", "8-26-2021 18:21:28", 
  "Jean", "CX_SERVICE", "8-27-2021 18:11:06",
  "Jean",   "COMPLAIN", "8-28-2021 18:21:01", 
  "Jean", "CX_SERVICE", "8-29-2021 20:21:59",
  "Keith", "CX_SERVICE", "8-29-2021 20:21:02", 
  "Alba",   "COMPLAIN", "8-30-2021 20:21:03", 
  "Alex", "CX_SERVICE", "8-25-2021 21:21:05", 
)

Preparing the Data for Analysis

Before we can merge these two datasets, we need to perform some data manipulation.

# Convert date columns to POSIXct format
df1 <- df1 %>% 
  mutate(DATE_PURCHASE = as.POSIXct(DATE_PURCHASE, format = "%m-%d-%Y %H:%M:%S"))

df2 <- df2 %>% 
  mutate(DATE_OF_CALL = as.POSIXct(DATE_OF_CALL, format = "%m-%d-%Y %H:%M:%S"))

Grouping and Summarizing the Data

Next, we will group each dataset by name and summarize the data. We’ll retrieve the product with the most recent purchase date and the most recent call date.

# Group df1 by name and summarize the data
df1_mod <- df1 %>% 
  group_by(NAME) %>% 
  summarise(
    product = PRODUCT[DATE_PURCHASE == max(DATE_PURCHASE)],
    DATE_PURCHASE = max(DATE_PURCHASE),
    .groups = "drop"
  )

# Group df2 by name and summarize the data
df2_mod <- df2 %>% 
  group_by(NAME) %>% 
  summarise(
    `x attempt` = n(),
    TYPE = TYPE[DATE_OF_CALL == max(DATE_OF_CALL)],
    DATE_OF_CALL = max(DATE_OF_CALL),
    .groups = "drop"
  )

Merging the Datasets

Now that we have prepared our data, it’s time to merge the two datasets.

# Merge df1_mod and df2_mod by name
final_dataset <- left_join(df1_mod, df2_mod, by = "NAME")

Final Output

After performing these operations, our final output will be a dataset containing all necessary information:

NAME	product	DATE_PURCHASE `x attempt` TYPE DATE_OF_CALL
Alba	M_43	2021-08-29 20:21:04 1 COMPLAIN 2021-08-30 20:21:03
Alex	M_36	2021-08-25 20:21:05 1 CX_SERVICE 2021-08-25 21:21:05
Jean	M_64	2021-08-27 20:21:59 3 CX_SERVICE 2021-08-29 20:21:59
Karen	M_14	2021-08-25 18:21:28 1 COMPLAIN 2021-08-26 18:21:28
Keith	M_57	2021-08-27 20:21:02 1 CX_SERVICE 2021-08-29 20:21:02

This final dataset contains all necessary information, including the product with the most recent purchase date and the type of call with the most recent date.

Conclusion

In this article, we have demonstrated how to count events over time and group by conditions based on datetimes using Dataframes in R. We performed data manipulation, grouping, summarizing, and merging operations to create a final dataset containing all necessary information.

We covered various techniques for handling missing values and performing statistical analysis while providing an example of real-world application.

Last modified on 2024-10-23