Load data

1: Data

  1. Generalizability: The data surveys approximately 400,000 individuals across 50 states randomly. Thus the generalizability criteria is satisfied, since individual observations are most likely independent.
  2. Causality: cannot be established since there may be more variables required for predictive purposes. Potential drawbacks include lack of key variables for comprehensive analysis.

2: Research questions

3. Exploratory data analysis

Research quesion 1: We compare sleep with bad physical and mental health days.

Physical health summary
Sleep Categories Mean Sd N
0-3 19.09 10.56 77
3-5 16.77 11.34 448
5-7 11.44 11.02 900
7-9 14.26 11.72 457
Mental health summary
Sleep Categories Mean Sd N
0-3 21.34 9.94 77
3-5 17.25 11.01 448
5-7 11.50 10.44 900
7-9 11.95 10.55 457

In the above boxplots, the line in the box specifies the median and the diamond box represents the mean. Please note that only individuals who reported as having at least one bad physical (physhlth) and mental (menthlth) day per year were included in the analysis. We observe that lower sleep leads to higher mental and physical problems. While the variances are higher for sleep categories, medians of both mental and physical health is above 10 for individuals getting less than five hours of sleep. Individuals earning less than 3 hours of sleep cab expect a median of 20 bad physical days and more than 25 bad mental days. Interestingly, we observe that people sleeping more that 7 hours are also in some cases likely to witness health problems.

Research quesion 2:

In this question, we examine if exercise level is tied to health outcomes i.e. are individuals who exercise more are likely to have better health? Note that we cannot establish causality, since health outcomes are tied to several indicators besides education. We use exerany2 (exercise in past 30 days) and compare it with physhlth and menthlth variables.

Physical health summary
Exercise in past 30 days Mean Sd N
Yes 3.3 7.69 7539
No 7.5 11.28 3202
Mental health summary
Exercise in past 30 days Mean Sd N
Yes 2.95 7.14 7539
No 5.29 9.78 3202

While the median bad physical/mental for both yes and no is zero, we do observe higher variance for individuals not exercising and witnessing bad physical/mental days. The visualizations and summary tables confirm that individuals not exercising are more likely to have more bad physical and mental days.

Research quesion 3:

Next we test the relation between income level and physical and mental health. The intuition follows from the fact that more financially sound individuals should have better physical and mental health outcomes. This may due to indirect effects such as an individual’s ability to afford better quality food, housing and other amenities leading to a comfortable and healthy lifestyle.

#Reinitialize data
data1 <- as_tibble(brfss2013)

#Data for Ohio (sleep 9 or less is selected since most individuals for under those categories)
data_ohio <- data1 %>% 
  filter(X_state=="Ohio") %>% 
  mutate(inc_cat = income2) %>% 
  mutate(inc_cat = replace(inc_cat, inc_cat == "Less than $10,000"|inc_cat == "Less than $15,000"|
                            inc_cat == "Less than $20,000"|inc_cat == "Less than $25,000", 
                            "Less than $25,000")) %>%
  mutate(inc_cat = replace(inc_cat, inc_cat == "Less than $35,000"|inc_cat == "Less than $50,000", "Less than $50,000"))

#Drop NAs
datamain_plot <- data_ohio[!is.na(data_ohio$physhlth), ]
datamain_plot <- datamain_plot[!is.na(datamain_plot$menthlth), ]
datamain_plot <- datamain_plot[!is.na(datamain_plot$inc_cat), ]

#Make boxplots
health1 <- c("physhlth", "menthlth")
p <- lapply(health1, function(x){
  datamain <- data.frame("y_axis" = datamain_plot[[x]], "x_axis" = datamain_plot$inc_cat)
  ggplot(datamain, aes(y=y_axis,x=x_axis))+
    geom_boxplot(notch=F, show.legend = F, outlier.shape = NA, aes(fill = factor(x_axis)))+
    labs(title = paste(ifelse(x=="physhlth","Physical health","Mental health")), 
         x = "Income", 
         y = paste(ifelse(x=="physhlth","Bad physical health days","Bad mental health days")))+
     scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 10, simplify = FALSE), paste, collapse="\n"))+
    scale_fill_manual(values = rev(brewer.pal(7,"Blues")))+
    theme(plot.title = element_text(hjust=0.5, face = "bold"), 
          text = element_text(size=8, face = "bold"),
          axis.text.x = element_text(size = 8, face = "plain"))+
    stat_summary(fun=mean, geom="point", shape=23, size=2)

})

#Plot them in a grid
plot_grid(plotlist = p)

Physical health summary
Income Categories Mean Sd N
Less than $25,000 7.73 11.29 3204
Less than $50,000 3.78 8.27 2781
Less than $75,000 2.93 7.21 1583
$75,000 or more 1.79 5.39 2438
Mental health summary
Income Categories Mean Sd N
Less than $25,000 6.41 10.49 3204
Less than $50,000 2.89 7.00 2781
Less than $75,000 2.53 6.48 1583
$75,000 or more 1.70 5.14 2438

The visualizations and the summary table show that individuals earning below $25,000 are more likely to have bad physical and mental days. Thus, suggesting that lower income can lead to poor health outcomes. Conversely, individuals earning more than $75,000 generally are free from bad physical and mental days.

Conclusion

We have done an exploratory analysis on three questions, which examine individual relationships between a response and explanatory variable. While our analysis, shows promising trends, we must be aware of some shortcomings. Note that health outcomes are a function of several co-variates, thus it is difficult to draw causal conclusions from our analysis. A proper predictive model of health outcomes would require multiple linear regression.