Introduction

This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).

We’ll be investigating data sets of FitBit usage to make high-level marketing recommendations for Bellabeat—a high-tech company that manufactures health-focused smart products.

Table of Contents

Summary

Bellabeat was founded in 2013 and has grown to become a tech-driven wellness company for women. Since 2016, they have opened offices around the globe and launched multiple products. Their apps and devices collect data on activity, sleep, stress, and reproductive health, allowing Bellabeat to empower women with knowledge about their own health and habits.

Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals. There products include:

Guiding questions

The guiding questions for this case study are as follows:

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Tools and Techniques

The following analysis makes use of the following tools and techniques:

  • R programming language and libraries; ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, lubridate and forcats
  • Data transformations: joins, visualizations, summary statistics
  • Data inspection: removal of duplicate/unnessary data, change format/datatype, verify unique values

Recommendations

Based on the analysis, the following are three recommendations (detailed below):

  • Double the Points Earned on Sundays
  • Reminder Notifcations Following Active Days
  • Encourage Rest Days

Double the Points Earned on Sundays

Based on the observations from the data, the largest trend that stands out is the increase in activity on Tuesday, Wednesday, and Thursday. Participants were most active on these days (both in terms of step count and calories burned). These also coincided with the days participants got the most sleep and spent the most time in bed. Further, we also know that Sunday is the least active day.

In order to help Bellabeat customers reach their wellness goals, it is recommended that the product team implement the ablity to add reminders when customers have lower level of activities. In particular, if we know that Sunday (and indeed Friday and Satury to some degree) is the least active day the product team could add the ability for customers to increase their wellness score by more points when they are active on Sunday. For example, perhaps activities on Sundays count for 1.5 times or 2 times the regular amount of points awarded when calculating the Wellness Score. This is intended to incentivize customers to be more active on this day.

Reminder Notifcations Following Active Days

We know that Tuesday, Wednesday, and Thursday are the most active days. The product team could implement notifications on the following days. These notifications could include reminders about how active the individual has been the past day (or previous days) and offer encouragement to keep staying active.

Encourage Rest Days

The Wellness Score in the Bellabeat app is a number calculated on a number of different factors, including activity. We know from the data that Sunday is the least active day. Perhaps, instead of encouraging individuals to be active on this day, rest should be encourage. Bellabeat takes a hollistic approach to wellness and while it is important to be active when achieiving fitness goals rest is also an important factor. In this case, perhaps the activities on Sundays (or on another rest day specified by the customer) would count for silently less so that the overall Wellness Score isn’t adversely affected.

Continue reading for the full details of this analysis that led to these recommendations.

Guiding Questions

We want to analyze the smart device usage data in order to gain insights into how people are already using their smart devices and make high-level recommends for how these trends can inform Bellabeat marketing strategy. We’ll focus on the following guiding questions:

Prepare Data

Data Description

The following data includes FitBit Fitness tracker data from 30 individuals over a month long period from April 12, 2016 to May 12, 2016. The data includes metrics on daily activities, calories, intensities, steps, heart rate, sleep, weight, and METs.

License

The data is available as a public data set by Mobius on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This data is available for us under the CC0: Public Domain License.

Limitations

The data was collected from 30 FitBit users; this seems limiting and any trends found might not align with larger groups. Further, there is no demographic data within the data so it’s unclear what gender, ethnicity, age, etc. these users fall into. Bellabeat is a company focused on providing smart devices specifically for women and this data might not align with the demographics of Bellabeat users.

Process Data

Install/Open Libraries

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dplyr")
library("ggplot2")
library("lubridate")
library("corrplot")
## corrplot 0.92 loaded

Set up Data Frames

daily_activity <- read.csv("datasets/dailyActivity_merged.csv")
daily_calories <- read.csv("datasets/dailyCalories_merged.csv")
hourly_calories <- read.csv("datasets/hourlyCalories_merged.csv")
daily_steps <- read.csv("datasets/dailySteps_merged.csv")
hourly_steps <- read.csv("datasets/hourlySteps_merged.csv")
sleep_day <- read.csv("datasets/sleepDay_merged.csv")
weight <- read.csv("datasets/weightLogInfo_merged.csv")
hr <- read.csv("datasets/heartrate_seconds_merged.csv")

Inspect Data Frames

For the FitBit Fitness Tracker data set we’ll focus on the following metrics for our analysis: daily activity, calories, steps, sleep, weight, and heart rate.

Daily Activity

head(daily_activity)
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
sapply(daily_activity, class)
##                       Id             ActivityDate               TotalSteps 
##                "numeric"              "character"                "integer" 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                "numeric"                "numeric"                "numeric" 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                "numeric"                "numeric"                "numeric" 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                "numeric"                "integer"                "integer" 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                "integer"                "integer"                "integer"

Summary Statistics

daily_activity %>%
  select(TotalSteps, TotalDistance, Calories, SedentaryMinutes) %>%
  summary()
##    TotalSteps    TotalDistance       Calories    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :2134   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :4900   Max.   :1440.0

Calories

Daily (Calories)
head(daily_calories)
colnames(daily_calories)
## [1] "Id"          "ActivityDay" "Calories"
sapply(daily_calories, class)
##          Id ActivityDay    Calories 
##   "numeric" "character"   "integer"

Summary Statistics

daily_calories %>%
  select(Calories) %>%
  summary()
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
Hourly (Calories)
head(hourly_calories)
colnames(hourly_calories)
## [1] "Id"           "ActivityHour" "Calories"
sapply(hourly_calories, class)
##           Id ActivityHour     Calories 
##    "numeric"  "character"    "integer"

Summary Statistics

hourly_calories %>%
  select(Calories) %>%
  summary()
##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00

Steps

Daily (Steps)
head(daily_steps)
colnames(daily_steps)
## [1] "Id"          "ActivityDay" "StepTotal"
sapply(daily_steps, class)
##          Id ActivityDay   StepTotal 
##   "numeric" "character"   "integer"

Summary Statistics

daily_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019
Hourly (Steps)
head(hourly_steps)
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"
sapply(hourly_steps, class)
##           Id ActivityHour    StepTotal 
##    "numeric"  "character"    "integer"

Summary Statistics

hourly_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

Daily Sleep

head(sleep_day)
colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
sapply(sleep_day, class)
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##          "numeric"        "character"          "integer"          "integer" 
##     TotalTimeInBed 
##          "integer"

Summary Statistics

sleep_day %>%
  select(TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.0  
##  Median :433.0      Median :463.0  
##  Mean   :419.5      Mean   :458.6  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

Weight

head(weight)
colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
sapply(weight, class)
##             Id           Date       WeightKg   WeightPounds            Fat 
##      "numeric"    "character"      "numeric"      "numeric"      "integer" 
##            BMI IsManualReport          LogId 
##      "numeric"    "character"      "numeric"

Summary Statistics

weight %>%
  select(WeightKg, WeightPounds, Fat, BMI) %>%
  summary()
##     WeightKg       WeightPounds        Fat             BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :22.00   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :23.50   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :23.50   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :25.00   Max.   :47.54  
##                                   NA's   :65

Heart rate

head(hr)
colnames(hr)
## [1] "Id"    "Time"  "Value"
sapply(hr, class)
##          Id        Time       Value 
##   "numeric" "character"   "integer"

Summary Statistics

hr %>%
  select(Value) %>%
  summary()
##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

Duplicate Data

Let’s inspect our data to see if there are any duplicate information.

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(daily_calories)
## [1] "Id"          "ActivityDay" "Calories"
colnames(daily_steps)
## [1] "Id"          "ActivityDay" "StepTotal"
colnames(hourly_calories)
## [1] "Id"           "ActivityHour" "Calories"
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"
colnames(hr)
## [1] "Id"    "Time"  "Value"
colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
identical(daily_activity["Calories"], daily_calories["Calories"])
## [1] TRUE
identical(daily_activity["TotalSteps"], daily_steps["StepTotal"])
## [1] FALSE

A TRUE value is returned for the ‘Calories’ columns of ‘daily_activity’ and ‘daily_calories’, thus both columns in the data frames have identical values.

For ‘daily_activity’ and ‘daily_steps’ the columns for the number of steps returns a FALSE value indicating the values are not the same in these two coloumns. However, when we view the data frames (see below), at first glance, it looks like the values are identical. They have have the same number of distinct rows for ‘Id’.

head(daily_activity)
nrow(daily_activity)
## [1] 940
head(daily_steps)
nrow(daily_steps)
## [1] 940

This seems strange, so let’s compare the summary statistics and plot the data to see if they are indeed the same.

Summary Statistics: Daily Activity and Daily Steps

daily_activity %>%
  select(TotalSteps) %>%
  summary()
##    TotalSteps   
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019
daily_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019

Observations

Summary statistics are identical.

Comparision Plots

‘TotalSteps’ from ‘daily_activity’
ggplot(data = daily_activity) +
  geom_point(mapping = aes(x = Id, y = TotalSteps)) +
  xlab("Participant ID") +
  ylab("Total Number of Steps") +
  ggtitle("Total Steps for 'daily_activity'")

‘StepTotal’ from ‘daily_steps’
ggplot(data = daily_steps) +
  geom_point(mapping = aes(x = Id, y = StepTotal)) +
  xlab("Participant ID") +
  ylab("Total Number of Steps") +
  ggtitle("StepTotal from 'daily_steps'")

Observations

Plot data is also identical for the columns ‘TotalSteps’ and ‘StepTotal’ for the data in ‘daily_activity’ and ‘daily_steps’, respectively. I’m not sure why the ‘identical()’ function returned a ‘FALSE’ statement previously, but let’s confirm using the ‘all.equal()’ function.

All Equal?

all.equal(daily_activity$TotalSteps, daily_steps$StepTotal)
## [1] TRUE

Conclusion

The function ‘all.equal()’ does indeed return a ‘TRUE’ value for these columns. Based on this, it would seem that the data in these two columns is indeed the same but in a different order.

Distinct Data

Let’s inspect the data for unique values.

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(daily_calories$Id)
## [1] 33
n_distinct(hourly_calories$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
n_distinct(hr$Id)
## [1] 14
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8

The following data frames have the same number of distinct rows: ‘daily_activity’, ‘daily_steps’, ‘daily_calories’, ‘hourly_calories’, and ‘hourly_steps’. However, the remaining data frames (‘hr’, ‘sleep_day’, and ‘weight’) have different distinct values. This could indicate that these data sets contained fewer participants or fewer measurements for those values.

Missing Data

Let’s check to see if there’s any missing data values.

any(is.na(daily_activity))
## [1] FALSE
any(is.na(daily_calories))
## [1] FALSE
any(is.na(daily_steps))
## [1] FALSE
any(is.na(hourly_calories))
## [1] FALSE
any(is.na(hourly_steps))
## [1] FALSE
any(is.na(hr))
## [1] FALSE
any(is.na(sleep_day))
## [1] FALSE
any(is.na(weight))
## [1] TRUE

There are some NA values in ‘weight’ let’s take a closer look:

is.na(weight)
##          Id  Date WeightKg WeightPounds   Fat   BMI IsManualReport LogId
##  [1,] FALSE FALSE    FALSE        FALSE FALSE FALSE          FALSE FALSE
##  [2,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [3,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [4,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [5,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [6,] FALSE FALSE    FALSE        FALSE FALSE FALSE          FALSE FALSE
##  [7,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [8,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
##  [9,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [10,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [11,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [12,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [13,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [14,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [15,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [16,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [17,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [18,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [19,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [20,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [21,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [22,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [23,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [24,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [25,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [26,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [27,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [28,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [29,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [30,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [31,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [32,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [33,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [34,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [35,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [36,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [37,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [38,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [39,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [40,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [41,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [42,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [43,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [44,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [45,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [46,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [47,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [48,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [49,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [50,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [51,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [52,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [53,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [54,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [55,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [56,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [57,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [58,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [59,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [60,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [61,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [62,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [63,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [64,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [65,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [66,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE
## [67,] FALSE FALSE    FALSE        FALSE  TRUE FALSE          FALSE FALSE

Observations

I’m not sure what the ‘Fat’ column is measuring (body fat percentage?), but since there are only two values provided for the entire column it’s likely that this column will need to be excluded during the analysis.

Final Observations from the Data Inspection

The data frames ‘daily_activity’ and ‘daily_calories’ both have a column called ‘Calories’ and this is the same data. We can ignore the data frame ‘daily_calories’ since it’s repetitive.

Some of the columns have date/time data stored as a chr (character strings) and will need to change it to a proper datetime format:

  • The following data sets have a date column that needs to change: ‘daily_activity’, ‘daily_calories’, and ‘daily_steps’
  • The following data sets have a datetime column that needs to change: ‘hourly_calories’, ‘hourly_steps’, ‘hr’, ‘sleep_day’, and ‘weight’.

Further, the following data frames have a column called ‘Id’ with identical data: ‘daily_activity’, ‘daily_calories’, ‘daily_steps’, ‘hourly_calories’, and ‘hourly_steps’. This is useful if it becomes necessary to combine the data.

Cleaning & Formatting the Data

Cleaning Checklist

After inspecting the data, the following issues need to be resolved before the analysis can begin:

  • Remove ‘Fat’ column from ‘weight’ data frame
  • Change date/time columns (e.g., ‘ActivityDate’) from chr to a datetime format
  • Merge datasets along the ‘Id’ parameter for easy access and analysis

Drop Unnecessary Data

It was determined that the ‘Fat’ column in the ‘weight’ data frame was not needed (since it only has two values). We’ll create a new data frame without it.

# create subset
wt <- subset(weight, select = -c(Fat))
# check that column was removed
colnames(wt)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "BMI"            "IsManualReport" "LogId"

Change Datatype

Reformat Date Columns

The following data sets have a date column that needs to be reformatted: ‘daily_activity’, ‘daily_calories’, and ‘daily_steps’.

# convert col 'ActivityDate' in 'daily_activity'
daily_activity$Date <- as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_activity$Date)
## [1] "double"
# convert col 'ActivityDay' in 'daily_calories'
daily_calories$Date <- as.POSIXct(daily_calories$ActivityDay, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_calories$Date)
## [1] "double"
# convert col 'ActivityDay' in 'daily_steps'
daily_steps$Date <- as.POSIXct(daily_steps$ActivityDay, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(daily_steps$Date)
## [1] "double"
# convert col 'Date' in 'weight'
weight$DateTime <- as.POSIXct(weight$Date, format = "%m/%d/%Y")
# confirm accurate conversion
typeof(weight$DateTime)
## [1] "double"
Add column for day of the week

Now that we have the new column with the date, let’s create a column for the day of the week too. This will allow us to anlayze the usage by the day of the week and see if there are any interesting patterns that emege.

daily_activity$WeekDay <- wday(daily_activity$Date, label = TRUE, abbr = TRUE)
head(daily_activity)
daily_calories$WeekDay <- wday(daily_calories$Date, label = TRUE, abbr = TRUE)
head(daily_calories)
daily_steps$WeekDay <- wday(daily_steps$Date, label = TRUE, abbr = TRUE)
head(daily_steps)
Create Datetime Column

The following data sets have a datetime column that needs to be reformatted: ‘hourly_calories’, ‘hourly_steps’, ‘hr’, ‘sleep_day’, and ‘wt’.

# convert col 'ActivityHour' in 'hourly_calories'
hourly_calories$ActHour <- as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hourly_calories$ActHour)
## [1] "double"
# convert col 'ActivityHour' in 'hourly_steps'
hourly_steps$ActHour <- as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hourly_steps$ActHour)
## [1] "double"
# convert col 'Time' in 'hr'
hr$DateTime <- as.POSIXct(hr$Time, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(hr$DateTime)
## [1] "double"
# convert col 'SleepDay' in 'sleep_day'
sleep_day$DateTime <- as.POSIXct(sleep_day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(sleep_day$DateTime)
## [1] "double"
# convert col 'Date' in 'wt'
wt$DateTime <- as.POSIXct(wt$Date, format = "%m/%d/%Y %I:%M:%S %p")
# confirm accurate conversion
typeof(wt$DateTime)
## [1] "double"
wt$WeekDay <- wday(wt$DateTime, label = TRUE, abbr = TRUE)
head(wt)
Create Hour Column

Similarly to creating a day of the week column, we want to create an hour column for these data sets so that we can analyze the data usage by hour.

hourly_calories$Hour <- hour(hourly_calories$ActHour)
head(hourly_calories)
hourly_steps$Hour <- hour(hourly_calories$ActHour)
head(hourly_steps)
hr$Hour <- hour(hr$DateTime)
head(hr)
wt$Hour <- hour(wt$DateTime)
head(wt)

Merge Data Sets

The final step in our cleaning phase is to merge the data sets along the ‘Id’ parameter. This will give us a combined data set which we can easily access and analyze.

# combine data frames
stats_daily <- merge(sleep_day, daily_activity, by = "Id")
stats_hourly <- hourly_calories %>%
  select(Id, Calories) %>%
  left_join(hourly_steps, by = "Id")
## Warning in left_join(., hourly_steps, by = "Id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# check for missing values
any(is.na(stats_daily))
## [1] FALSE
any(is.na(stats_hourly))
## [1] FALSE
# confirm combination
head(stats_daily)
head(stats_hourly)

Now that we’ve successfully combined the data, we’ll focus our analysis on the following data frames: ‘stats_daily’, ‘stats_hourly’, ‘hr’, and ‘wt’.

Analysis

Visualizations

We’ll start by setting plot variables for each of data frames we plan to visualize. This will make it easier to access the plot our data.

plt_daily <- ggplot(data = stats_daily)
plt_hourly <- ggplot(data = stats_hourly)
plt_hr <- ggplot(data = hr)
plt_wt <- ggplot(data = wt)

Heart Rate

Let’s take a look at how heart rate (hr) changes throughout the course of the day.

Boxplot: Hour vs. Heart Rate

This plot displays each hour of the day compared to the 5-number summary distribution (box plot) of heart rate measurements from participants.

plt_hr +
  geom_boxplot(mapping = aes(group = Hour, x = Hour, y = Value)) +
  xlab("Hour of the Day") +
  ylab("Heart Rate") +
  scale_x_continuous(breaks = seq(0, 23, 2)) +
  scale_y_continuous(breaks = seq(0, 200, 25)) +
  ggtitle("What time of day are participants most active?")

Observations

As to be expected, the heart rates of participants are more active during that day than at night. Interestingly, we do see outliers of both high and low heart rates throughout the day. This shows that while most people follow expectations of elevating their heart rates (presumably through exercise) at normal times of days, there are those who do not follow this expectation.

Weight

Histogram: Weight

plt_wt +
  geom_histogram(mapping = aes(x = WeightPounds), bins = 100) +
  xlab("Weight (lb)") +
  ylab("Number of Participants") +
  scale_x_continuous(breaks = seq(0, 300, 10)) +
  scale_y_continuous(breaks = seq(0, 20, 2)) +
  ggtitle("Distribution of Weight for Participants")

Observations Most individuals fall between 100 and 200 pounds, however there is one individual who weighs 300 pounds.

Scatter Plot: Weight vs. BMI

plt_wt +
  geom_point(mapping = aes(x = WeightPounds, y = BMI)) +
  xlab("Weight (lb.)") +
  ylab("BMI") +
  scale_x_continuous(breaks = seq(0, 300, 10)) +
  scale_y_continuous(breaks = seq(0, 50, 5)) +
  ggtitle("Weight vs. BMI")

Observations

As expected, individuals with a higher weight also have a higher BMI.

Weight Changes Tracked Over Time

plt_wt +
  geom_line(mapping = aes(x = Date, y = WeightPounds, group = Id, color = Id)) +
  labs(x = "Day", y = "Weight", title = "Weight Changes Over 30 Days", subtitle = "Individuals' Weight Changes Tracked Over Time", color = "Individual") +
  coord_cartesian(ylim = c(100, 200))

Observations

Unfortunately, only a limited number of participants successfully tracked their weight changes over the 30 day period. However, we do see that during this period for those who tracked their changes their weight hardly changed during this time period.

While more data is needed to fully understand how weight changes over a 30-day time period, this could be useful information for individuals when setting their expectations around their weight goals.

Daily Activity

Steps

Bar Graph: Week Day vs. Steps
plt_daily +
  geom_col(mapping = aes(x = WeekDay, y = TotalSteps)) +
  xlab("Day of the Week") +
  ylab("Total Steps") +
  ggtitle("Which day has the most steps?")

Observations

This graph represents the total steps for each day of the week for all 30 participants. Surprisingly, the most steps were taken on Tuesday and Wednesday with Sunday having the least amount of steps.

Boxplot: Week Day vs. Steps
plt_daily +
  geom_boxplot(mapping = aes(x = WeekDay, y = TotalSteps)) +
  xlab("Day of the Week") +
  ylab("Total Steps") +
  scale_y_continuous(breaks = seq(0, 24000, 2000)) +
  ggtitle("Week Day vs. Steps")

Observations

Saturday has the widest range in step count. On average, Monday and Tuesday have more step counts than other days. Sunday seems to be the least active day.

Calories

Scatter Plot: Steps vs. Calories
plt_daily +
  geom_point(mapping = aes(x = TotalSteps, y = Calories)) +
  xlab("Steps") +
  ylab("Calories") +
  ggtitle("Step Total vs. Calories Burned")

Observations

As expected, the more steps taken the more calories burned.

Bar Graph: Week Day vs. Calories
plt_daily +
  geom_col(mapping = aes(x = WeekDay, y = Calories)) +
  xlab("Day of the Week") +
  ylab("Calories") +
  ggtitle("Which day has the most Calories burned?")

Observations

Similar to the bar graph comparing the day of the week and total steps, we see that Tuesday is the most active day for the participants with Sunday being the least active. It makes sense that Sunday could be the least active since this is the last day of the week and the day before the start of a new work week. However, I would have expected Saturday to be the most active day, not Tuesday.

My only guess as to why Tuesday is the most active is (perhaps) after having three days (Friday, Saturday, Sunday) of low activity, participants decided to be more active on Tuesday and Wednesday before slowing down again.

Further, it could also be the the weekend (Fri-Sun) tends to involve more leisure activities such as going to the movies, restaurants, etc. instead of activities that involve more physical moment. The high activity on Tuesday and Wednesday are then an attempt to make up for the lack of exercise from the previous weekend or anticipation of the upcoming weekend of leisure.

Boxplot: Week Day vs. Calories
plt_daily +
  geom_boxplot(mapping = aes(x = WeekDay, y = Calories)) +
  xlab("Day of the Week") +
  ylab("Calories") +
  scale_y_continuous(breaks = seq(0, 5000, 500)) +
  ggtitle("Week Day vs. Calories")

Observations

Sunday has the least amount of calories burned (with a few exceptions) while the amount of calories burned from the remaining days in the week are approximately the same. This correlates with the pattern seen previously in number of steps and calories burned throughout the week.

Sleep

Bar Graph: Week Day vs. Total Sleep
plt_daily +
  geom_col(mapping = aes(x = WeekDay, y = TotalMinutesAsleep)) +
  xlab("Day of the Week") +
  ylab("Total Minutes Asleep") +
  ggtitle("Which day gets the most sleep?")

Observations

Tuesday, Wednesday, and Thursday have the most (total) amount of sleep minutes. This is the same as the days that have the most steps and calories burned. It could be that since these days required more energy they also required more sleep.

Bar Graph: Week Day vs. Total Time In Bed
plt_daily +
  geom_col(mapping = aes(x = WeekDay, y = TotalTimeInBed)) +
  xlab("Day of the Week") +
  ylab("Total Time in Bed") +
  ggtitle("Which day has the most time spent in bed?")

Observations

Tuesday, Wednesday, and Thursday have the most time spent in bed. This coincides with the observations about time spent asleep. It’s also reasonable that the more time one spends in bed, the more time they’ll be asleep.

Summary of Observations

  • Most participants have elevated heart rates between 6am and 8pm
  • Most individuals fall between 100 and 200 pounds
  • Only eight individuals tracked their weight, but over a 30-day period it mostly stayed the same
  • Tuesday, Wednesday, and Thursday are the most active days
  • Tuesday, Wednesday, and Thursday are also the days with the most sleep and time spent in bed
  • Sunday is the least active day

References

Contact