This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).
We’ll be investigating data sets of FitBit usage to make high-level marketing recommendations for Bellabeat—a high-tech company that manufactures health-focused smart products.
Bellabeat was founded in 2013 and has grown to become a tech-driven wellness company for women. Since 2016, they have opened offices around the globe and launched multiple products. Their apps and devices collect data on activity, sleep, stress, and reproductive health, allowing Bellabeat to empower women with knowledge about their own health and habits.
Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals. There products include:
The guiding questions for this case study are as follows:
The following analysis makes use of the following tools and techniques:
ggplot2
,
tibble
, tidyr
, readr
,
purrr
, dplyr
, stringr
,
lubridate
and forcats
Based on the analysis, the following are three recommendations (detailed below):
Based on the observations from the data, the largest trend that stands out is the increase in activity on Tuesday, Wednesday, and Thursday. Participants were most active on these days (both in terms of step count and calories burned). These also coincided with the days participants got the most sleep and spent the most time in bed. Further, we also know that Sunday is the least active day.
In order to help Bellabeat customers reach their wellness goals, it is recommended that the product team implement the ablity to add reminders when customers have lower level of activities. In particular, if we know that Sunday (and indeed Friday and Satury to some degree) is the least active day the product team could add the ability for customers to increase their wellness score by more points when they are active on Sunday. For example, perhaps activities on Sundays count for 1.5 times or 2 times the regular amount of points awarded when calculating the Wellness Score. This is intended to incentivize customers to be more active on this day.
We know that Tuesday, Wednesday, and Thursday are the most active days. The product team could implement notifications on the following days. These notifications could include reminders about how active the individual has been the past day (or previous days) and offer encouragement to keep staying active.
The Wellness Score in the Bellabeat app is a number calculated on a number of different factors, including activity. We know from the data that Sunday is the least active day. Perhaps, instead of encouraging individuals to be active on this day, rest should be encourage. Bellabeat takes a hollistic approach to wellness and while it is important to be active when achieiving fitness goals rest is also an important factor. In this case, perhaps the activities on Sundays (or on another rest day specified by the customer) would count for silently less so that the overall Wellness Score isn’t adversely affected.
Continue reading for the full details of this analysis that led to these recommendations.
We want to analyze the smart device usage data in order to gain insights into how people are already using their smart devices and make high-level recommends for how these trends can inform Bellabeat marketing strategy. We’ll focus on the following guiding questions:
The following data includes FitBit Fitness tracker data from 30 individuals over a month long period from April 12, 2016 to May 12, 2016. The data includes metrics on daily activities, calories, intensities, steps, heart rate, sleep, weight, and METs.
The data is available as a public data set by Mobius on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This data is available for us under the CC0: Public Domain License.
The data was collected from 30 FitBit users; this seems limiting and any trends found might not align with larger groups. Further, there is no demographic data within the data so it’s unclear what gender, ethnicity, age, etc. these users fall into. Bellabeat is a company focused on providing smart devices specifically for women and this data might not align with the demographics of Bellabeat users.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dplyr")
library("ggplot2")
library("lubridate")
library("corrplot")
## corrplot 0.92 loaded
daily_activity <- read.csv("datasets/dailyActivity_merged.csv")
daily_calories <- read.csv("datasets/dailyCalories_merged.csv")
hourly_calories <- read.csv("datasets/hourlyCalories_merged.csv")
daily_steps <- read.csv("datasets/dailySteps_merged.csv")
hourly_steps <- read.csv("datasets/hourlySteps_merged.csv")
sleep_day <- read.csv("datasets/sleepDay_merged.csv")
weight <- read.csv("datasets/weightLogInfo_merged.csv")
hr <- read.csv("datasets/heartrate_seconds_merged.csv")
For the FitBit Fitness Tracker data set we’ll focus on the following metrics for our analysis: daily activity, calories, steps, sleep, weight, and heart rate.
head(daily_activity)
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
sapply(daily_activity, class)
## Id ActivityDate TotalSteps
## "numeric" "character" "integer"
## TotalDistance TrackerDistance LoggedActivitiesDistance
## "numeric" "numeric" "numeric"
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## "numeric" "numeric" "numeric"
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## "numeric" "integer" "integer"
## LightlyActiveMinutes SedentaryMinutes Calories
## "integer" "integer" "integer"
Summary Statistics
daily_activity %>%
select(TotalSteps, TotalDistance, Calories, SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance Calories SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:1828 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :2134 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean :2304 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:2793 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :4900 Max. :1440.0
head(daily_calories)
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
sapply(daily_calories, class)
## Id ActivityDay Calories
## "numeric" "character" "integer"
Summary Statistics
daily_calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
head(hourly_calories)
colnames(hourly_calories)
## [1] "Id" "ActivityHour" "Calories"
sapply(hourly_calories, class)
## Id ActivityHour Calories
## "numeric" "character" "integer"
Summary Statistics
hourly_calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 42.00
## 1st Qu.: 63.00
## Median : 83.00
## Mean : 97.39
## 3rd Qu.:108.00
## Max. :948.00
head(daily_steps)
colnames(daily_steps)
## [1] "Id" "ActivityDay" "StepTotal"
sapply(daily_steps, class)
## Id ActivityDay StepTotal
## "numeric" "character" "integer"
Summary Statistics
daily_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
head(hourly_steps)
colnames(hourly_steps)
## [1] "Id" "ActivityHour" "StepTotal"
sapply(hourly_steps, class)
## Id ActivityHour StepTotal
## "numeric" "character" "integer"
Summary Statistics
hourly_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
head(sleep_day)
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
sapply(sleep_day, class)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## "numeric" "character" "integer" "integer"
## TotalTimeInBed
## "integer"
Summary Statistics
sleep_day %>%
select(TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalMinutesAsleep TotalTimeInBed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.0
## Median :433.0 Median :463.0
## Mean :419.5 Mean :458.6
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
head(weight)
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
sapply(weight, class)
## Id Date WeightKg WeightPounds Fat
## "numeric" "character" "numeric" "numeric" "integer"
## BMI IsManualReport LogId
## "numeric" "character" "numeric"
Summary Statistics
weight %>%
select(WeightKg, WeightPounds, Fat, BMI) %>%
summary()
## WeightKg WeightPounds Fat BMI
## Min. : 52.60 Min. :116.0 Min. :22.00 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96
## Median : 62.50 Median :137.8 Median :23.50 Median :24.39
## Mean : 72.04 Mean :158.8 Mean :23.50 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :133.50 Max. :294.3 Max. :25.00 Max. :47.54
## NA's :65
head(hr)
colnames(hr)
## [1] "Id" "Time" "Value"
sapply(hr, class)
## Id Time Value
## "numeric" "character" "integer"
Summary Statistics
hr %>%
select(Value) %>%
summary()
## Value
## Min. : 36.00
## 1st Qu.: 63.00
## Median : 73.00
## Mean : 77.33
## 3rd Qu.: 88.00
## Max. :203.00
Let’s inspect our data to see if there are any duplicate information.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
colnames(daily_steps)
## [1] "Id" "ActivityDay" "StepTotal"
colnames(hourly_calories)
## [1] "Id" "ActivityHour" "Calories"
colnames(hourly_steps)
## [1] "Id" "ActivityHour" "StepTotal"
colnames(hr)
## [1] "Id" "Time" "Value"
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
identical(daily_activity["Calories"], daily_calories["Calories"])
## [1] TRUE
identical(daily_activity["TotalSteps"], daily_steps["StepTotal"])
## [1] FALSE
A TRUE
value is returned for the ‘Calories’ columns of
‘daily_activity’ and ‘daily_calories’, thus both columns in the data
frames have identical values.
For ‘daily_activity’ and ‘daily_steps’ the columns for the number of
steps returns a FALSE
value indicating the values are
not the same in these two coloumns. However,
when we view the data frames (see below), at first glance, it looks like
the values are identical. They have have the same number of distinct
rows for ‘Id’.
head(daily_activity)
nrow(daily_activity)
## [1] 940
head(daily_steps)
nrow(daily_steps)
## [1] 940
This seems strange, so let’s compare the summary statistics and plot the data to see if they are indeed the same.
daily_activity %>%
select(TotalSteps) %>%
summary()
## TotalSteps
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
daily_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
Observations
Summary statistics are identical.
ggplot(data = daily_activity) +
geom_point(mapping = aes(x = Id, y = TotalSteps)) +
xlab("Participant ID") +
ylab("Total Number of Steps") +
ggtitle("Total Steps for 'daily_activity'")
ggplot(data = daily_steps) +
geom_point(mapping = aes(x = Id, y = StepTotal)) +
xlab("Participant ID") +
ylab("Total Number of Steps") +
ggtitle("StepTotal from 'daily_steps'")
Observations
Plot data is also identical for the columns ‘TotalSteps’ and ‘StepTotal’ for the data in ‘daily_activity’ and ‘daily_steps’, respectively. I’m not sure why the ‘identical()’ function returned a ‘FALSE’ statement previously, but let’s confirm using the ‘all.equal()’ function.
all.equal(daily_activity$TotalSteps, daily_steps$StepTotal)
## [1] TRUE
The function ‘all.equal()’ does indeed return a ‘TRUE’ value for these columns. Based on this, it would seem that the data in these two columns is indeed the same but in a different order.
Let’s inspect the data for unique values.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(daily_calories$Id)
## [1] 33
n_distinct(hourly_calories$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
n_distinct(hr$Id)
## [1] 14
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
The following data frames have the same number of distinct rows: ‘daily_activity’, ‘daily_steps’, ‘daily_calories’, ‘hourly_calories’, and ‘hourly_steps’. However, the remaining data frames (‘hr’, ‘sleep_day’, and ‘weight’) have different distinct values. This could indicate that these data sets contained fewer participants or fewer measurements for those values.
Let’s check to see if there’s any missing data values.
any(is.na(daily_activity))
## [1] FALSE
any(is.na(daily_calories))
## [1] FALSE
any(is.na(daily_steps))
## [1] FALSE
any(is.na(hourly_calories))
## [1] FALSE
any(is.na(hourly_steps))
## [1] FALSE
any(is.na(hr))
## [1] FALSE
any(is.na(sleep_day))
## [1] FALSE
any(is.na(weight))
## [1] TRUE
There are some NA values in ‘weight’ let’s take a closer look:
is.na(weight)
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [48,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [52,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [53,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [54,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [56,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [57,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [58,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [60,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [62,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [63,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [64,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [67,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Observations
I’m not sure what the ‘Fat’ column is measuring (body fat percentage?), but since there are only two values provided for the entire column it’s likely that this column will need to be excluded during the analysis.
The data frames ‘daily_activity’ and ‘daily_calories’ both have a column called ‘Calories’ and this is the same data. We can ignore the data frame ‘daily_calories’ since it’s repetitive.
Some of the columns have date/time data stored as a chr
(character strings) and will need to change it to a proper datetime
format:
Further, the following data frames have a column called ‘Id’ with identical data: ‘daily_activity’, ‘daily_calories’, ‘daily_steps’, ‘hourly_calories’, and ‘hourly_steps’. This is useful if it becomes necessary to combine the data.
After inspecting the data, the following issues need to be resolved before the analysis can begin:
chr
to a datetime formatIt was determined that the ‘Fat’ column in the ‘weight’ data frame was not needed (since it only has two values). We’ll create a new data frame without it.
# create subset
wt <- subset(weight, select = -c(Fat))
# check that column was removed
colnames(wt)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "BMI" "IsManualReport" "LogId"
The following data sets have a date column that needs to be reformatted: ‘daily_activity’, ‘daily_calories’, and ‘daily_steps’.
# convert col 'ActivityDate' in 'daily_activity'
daily_activity$Date <- as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_activity$Date)
## [1] "double"
# convert col 'ActivityDay' in 'daily_calories'
daily_calories$Date <- as.POSIXct(daily_calories$ActivityDay, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_calories$Date)
## [1] "double"
# convert col 'ActivityDay' in 'daily_steps'
daily_steps$Date <- as.POSIXct(daily_steps$ActivityDay, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_steps$Date)
## [1] "double"
# convert col 'Date' in 'weight'
weight$DateTime <- as.POSIXct(weight$Date, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(weight$DateTime)
## [1] "double"
Now that we have the new column with the date, let’s create a column for the day of the week too. This will allow us to anlayze the usage by the day of the week and see if there are any interesting patterns that emege.
daily_activity$WeekDay <- wday(daily_activity$Date, label = TRUE, abbr = TRUE)
head(daily_activity)
daily_calories$WeekDay <- wday(daily_calories$Date, label = TRUE, abbr = TRUE)
head(daily_calories)
daily_steps$WeekDay <- wday(daily_steps$Date, label = TRUE, abbr = TRUE)
head(daily_steps)
The following data sets have a datetime column that needs to be reformatted: ‘hourly_calories’, ‘hourly_steps’, ‘hr’, ‘sleep_day’, and ‘wt’.
# convert col 'ActivityHour' in 'hourly_calories'
hourly_calories$ActHour <- as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hourly_calories$ActHour)
## [1] "double"
# convert col 'ActivityHour' in 'hourly_steps'
hourly_steps$ActHour <- as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hourly_steps$ActHour)
## [1] "double"
# convert col 'Time' in 'hr'
hr$DateTime <- as.POSIXct(hr$Time, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hr$DateTime)
## [1] "double"
# convert col 'SleepDay' in 'sleep_day'
sleep_day$DateTime <- as.POSIXct(sleep_day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(sleep_day$DateTime)
## [1] "double"
# convert col 'Date' in 'wt'
wt$DateTime <- as.POSIXct(wt$Date, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(wt$DateTime)
## [1] "double"
wt$WeekDay <- wday(wt$DateTime, label = TRUE, abbr = TRUE)
head(wt)
Similarly to creating a day of the week column, we want to create an hour column for these data sets so that we can analyze the data usage by hour.
hourly_calories$Hour <- hour(hourly_calories$ActHour)
head(hourly_calories)
hourly_steps$Hour <- hour(hourly_calories$ActHour)
head(hourly_steps)
hr$Hour <- hour(hr$DateTime)
head(hr)
wt$Hour <- hour(wt$DateTime)
head(wt)
The final step in our cleaning phase is to merge the data sets along the ‘Id’ parameter. This will give us a combined data set which we can easily access and analyze.
# combine data frames
stats_daily <- merge(sleep_day, daily_activity, by = "Id")
stats_hourly <- hourly_calories %>%
select(Id, Calories) %>%
left_join(hourly_steps, by = "Id")
## Warning in left_join(., hourly_steps, by = "Id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# check for missing values
any(is.na(stats_daily))
## [1] FALSE
any(is.na(stats_hourly))
## [1] FALSE
# confirm combination
head(stats_daily)
head(stats_hourly)
Now that we’ve successfully combined the data, we’ll focus our analysis on the following data frames: ‘stats_daily’, ‘stats_hourly’, ‘hr’, and ‘wt’.
We’ll start by setting plot variables for each of data frames we plan to visualize. This will make it easier to access the plot our data.
plt_daily <- ggplot(data = stats_daily)
plt_hourly <- ggplot(data = stats_hourly)
plt_hr <- ggplot(data = hr)
plt_wt <- ggplot(data = wt)
Let’s take a look at how heart rate (hr) changes throughout the course of the day.
This plot displays each hour of the day compared to the 5-number summary distribution (box plot) of heart rate measurements from participants.
plt_hr +
geom_boxplot(mapping = aes(group = Hour, x = Hour, y = Value)) +
xlab("Hour of the Day") +
ylab("Heart Rate") +
scale_x_continuous(breaks = seq(0, 23, 2)) +
scale_y_continuous(breaks = seq(0, 200, 25)) +
ggtitle("What time of day are participants most active?")
Observations
As to be expected, the heart rates of participants are more active during that day than at night. Interestingly, we do see outliers of both high and low heart rates throughout the day. This shows that while most people follow expectations of elevating their heart rates (presumably through exercise) at normal times of days, there are those who do not follow this expectation.
plt_wt +
geom_histogram(mapping = aes(x = WeightPounds), bins = 100) +
xlab("Weight (lb)") +
ylab("Number of Participants") +
scale_x_continuous(breaks = seq(0, 300, 10)) +
scale_y_continuous(breaks = seq(0, 20, 2)) +
ggtitle("Distribution of Weight for Participants")
Observations Most individuals fall between 100 and 200 pounds, however there is one individual who weighs 300 pounds.
plt_wt +
geom_point(mapping = aes(x = WeightPounds, y = BMI)) +
xlab("Weight (lb.)") +
ylab("BMI") +
scale_x_continuous(breaks = seq(0, 300, 10)) +
scale_y_continuous(breaks = seq(0, 50, 5)) +
ggtitle("Weight vs. BMI")
Observations
As expected, individuals with a higher weight also have a higher BMI.
plt_wt +
geom_line(mapping = aes(x = Date, y = WeightPounds, group = Id, color = Id)) +
labs(x = "Day", y = "Weight", title = "Weight Changes Over 30 Days", subtitle = "Individuals' Weight Changes Tracked Over Time", color = "Individual") +
coord_cartesian(ylim = c(100, 200))
Observations
Unfortunately, only a limited number of participants successfully tracked their weight changes over the 30 day period. However, we do see that during this period for those who tracked their changes their weight hardly changed during this time period.
While more data is needed to fully understand how weight changes over a 30-day time period, this could be useful information for individuals when setting their expectations around their weight goals.
plt_daily +
geom_col(mapping = aes(x = WeekDay, y = TotalSteps)) +
xlab("Day of the Week") +
ylab("Total Steps") +
ggtitle("Which day has the most steps?")
Observations
This graph represents the total steps for each day of the week for all 30 participants. Surprisingly, the most steps were taken on Tuesday and Wednesday with Sunday having the least amount of steps.
plt_daily +
geom_boxplot(mapping = aes(x = WeekDay, y = TotalSteps)) +
xlab("Day of the Week") +
ylab("Total Steps") +
scale_y_continuous(breaks = seq(0, 24000, 2000)) +
ggtitle("Week Day vs. Steps")
Observations
Saturday has the widest range in step count. On average, Monday and Tuesday have more step counts than other days. Sunday seems to be the least active day.
plt_daily +
geom_point(mapping = aes(x = TotalSteps, y = Calories)) +
xlab("Steps") +
ylab("Calories") +
ggtitle("Step Total vs. Calories Burned")
Observations
As expected, the more steps taken the more calories burned.
plt_daily +
geom_col(mapping = aes(x = WeekDay, y = Calories)) +
xlab("Day of the Week") +
ylab("Calories") +
ggtitle("Which day has the most Calories burned?")
Observations
Similar to the bar graph comparing the day of the week and total steps, we see that Tuesday is the most active day for the participants with Sunday being the least active. It makes sense that Sunday could be the least active since this is the last day of the week and the day before the start of a new work week. However, I would have expected Saturday to be the most active day, not Tuesday.
My only guess as to why Tuesday is the most active is (perhaps) after having three days (Friday, Saturday, Sunday) of low activity, participants decided to be more active on Tuesday and Wednesday before slowing down again.
Further, it could also be the the weekend (Fri-Sun) tends to involve more leisure activities such as going to the movies, restaurants, etc. instead of activities that involve more physical moment. The high activity on Tuesday and Wednesday are then an attempt to make up for the lack of exercise from the previous weekend or anticipation of the upcoming weekend of leisure.
plt_daily +
geom_boxplot(mapping = aes(x = WeekDay, y = Calories)) +
xlab("Day of the Week") +
ylab("Calories") +
scale_y_continuous(breaks = seq(0, 5000, 500)) +
ggtitle("Week Day vs. Calories")
Observations
Sunday has the least amount of calories burned (with a few exceptions) while the amount of calories burned from the remaining days in the week are approximately the same. This correlates with the pattern seen previously in number of steps and calories burned throughout the week.
plt_daily +
geom_col(mapping = aes(x = WeekDay, y = TotalMinutesAsleep)) +
xlab("Day of the Week") +
ylab("Total Minutes Asleep") +
ggtitle("Which day gets the most sleep?")
Observations
Tuesday, Wednesday, and Thursday have the most (total) amount of sleep minutes. This is the same as the days that have the most steps and calories burned. It could be that since these days required more energy they also required more sleep.
plt_daily +
geom_col(mapping = aes(x = WeekDay, y = TotalTimeInBed)) +
xlab("Day of the Week") +
ylab("Total Time in Bed") +
ggtitle("Which day has the most time spent in bed?")
Observations
Tuesday, Wednesday, and Thursday have the most time spent in bed. This coincides with the observations about time spent asleep. It’s also reasonable that the more time one spends in bed, the more time they’ll be asleep.