Load Data

1: Data

  1. Generalizability: The introductory code manual states (page 8) that each sample is independently drawn using a Block quota sampling strategy. Therefore, the data is randomly sampled. However, we should be careful about certain data collection scenarios. Since quota sampling is typically non-random, the samples might be biased and thus cause difficulties in inference. The data collectors’ approach is justified since they target individuals who are most likely to result in rich datapoints and who are simultaneously representative of the underlying population. For example, the surveyors used random sampling once they divided population individuals into quotas.

  2. *Causality: Since the data was gathered to address numerous social science problems, we should be careful about causality when establishing relationships between variables.

2: Research question

Question:

Importance:

Since obtaining higher education is expensive in the US, if we find that there are no significant differences in income outcomes across different educational levels, then an individual is perhaps better off by choosing an education level, which is financially most viable.

3: Exploratory data analysis

We begin by plotting the relationship between total income (coninc;inflation adjusted) and highest degree (degree) in a box plot.

The diamond-shaped box in each box plot is the mean of the distribution and the notch displays a confidence interval around the median (\(median \pm 1.57*\frac{IQR}{\sqrt{n}}\)). If the two boxes’ notches do not overlap, then evidence of medians differing exists. In the 2012 plot, we observe an upward trend in median and mean as educational attainment increases. In 1974, the situation is more mixed with “graduate” degree holders doing better than everyone else, yet it is difficult to decipher whether there are income differences between “high-school” and “Bachelor’s” degree holders.

Next we present some summary statistics by comparing proportions of educational attainment across different total income categories for both 1974 and 2012.

1974
Lt High School High School Junior College Bachelor Graduate
$0-25k 0.545 0.375 0.019 0.056 0.005
$25-50k 0.311 0.544 0.014 0.100 0.031
$50-75k 0.192 0.615 0.029 0.077 0.087
$75-100k 0.126 0.621 0.000 0.155 0.097
$100k & above 0.070 0.435 0.043 0.243 0.209
2012
Lt High School High School Junior College Bachelor Graduate
$0-25k 0.234 0.590 0.066 0.081 0.029
$25-50k 0.092 0.562 0.103 0.173 0.070
$50-75k 0.059 0.520 0.072 0.214 0.135
$75-100k 0.041 0.308 0.081 0.349 0.221
$100k & above 0.016 0.206 0.058 0.407 0.312

In the tables above, the sum of each row equals one. We observe that in 1974, most people earning between $0-25k went to LT highschool, whereas maximum proportions of individuals earning more than $25k all had atleast a High school degree. However, even among individuals earning $100k and above, most went to a High school. This is in sharp contrast to 2012, where most high income earners ($100k and above) have a Bachelor’s or a Graduate degree. However most people earning below $75k still have High school as their highest degree. The summary tables show a clear contrast between the income outcomes of 1974 and 2012. This data seems to suggest that it is more important to attain a high-level of education in 2012 than 1974 to earn more income.

1974
Degree N Mean SD
Lt High School 425 29028.89 21682.58
High School 661 45885.92 25848.49
Junior College 25 47550.20 35095.39
Bachelor 132 55175.46 32771.54
Graduate 70 71875.06 29296.42
2012
Degree N Mean SD
Lt High School 230 21214.41 22435.00
High School 881 37453.19 34974.57
Junior College 132 46362.46 39287.03
Bachelor 324 75174.48 55495.55
Graduate 185 89988.10 57720.44

The above table clearly shows a pattern, where the mean income increases as an individual’s educational attainement increases in both 1974 and 2012. Notice how the standard deviations of college degree earners are higher than those in 1974.

4. Inference

Hypothesis:

To address our first research question, we hypothesize that average income is the same across all education levels. Mathematically, this is: \[H_0: \mu_1=\mu_2 ... =\mu_5\] Where \(k=5\) is the number of education groups. The alternative hypothesis can be written as: the average incomes differ between at least one pair of education categories. \[H_a: \mu_i\neq\mu_j\]

Note that instead of using the inference function, I developed my own analysis of variance (ANOVA) function since there seems to be a bug in the statr package. I verify my results using R’s aov package, which also computes one-way ANOVA. The results are the same and can be viewed in the appendix after the Conclusion.

yr <- c(1974,2012)
results_table <- lapply(yr, function(x){
#remove NAs
  datamain <- as_tibble(gss)
  datamain <- datamain[!is.na(datamain$degree),]
  datamain <- datamain[!is.na(datamain$coninc),]

#find group means and n: (by degree)
  dfi <- datamain %>%
    select(year, coninc, degree) %>%
    filter(year == x) %>%
    group_by(degree) %>%
    summarise(mean_inc = mean(coninc), nj = n())
  #transopose data frame to usable form 
  dfi2<- as.data.frame(t(dfi))
  #Assignm column names
  names(dfi2) <- as.matrix(dfi2[1, ])
  dfi2 <- dfi2[-1, ]
  dfi2[] <- lapply(dfi2, function(x) type.convert(as.character(x)))

#create function to find SST.
  datamain <- datamain %>% filter(year == x)
  y <- c(datamain$coninc)
  sst.i <- sapply(y,
                  function(x){
                    sst.dev2 = (x - mean(y))^2
                    })
  sst <- sum(sst.i)

#create function to find SSG
  ys <- c(names(dfi2))
  ssg.j <- sapply(ys, 
                  function(x){
                      ssg.dev2 = dfi2[[x]][2]*(dfi2[[x]][1]-mean(y))^2
                  })
  ssg <- sum(ssg.j)
  
#Find degrees of freedom, SSE, MSG, MSE, f-stat & p-vale
  degf_e <- ((data.frame(datamain %>% summarise(n()))[[1]]-1)-(ncol(dfi2)-1))
  degf_g <- ncol(dfi2)-1
  sse <- sst - ssg
  msg <- ssg/degf_g
  mse <- sse/degf_e
  fstat <- msg/mse
  pval <- pf(fstat, degf_g,degf_e, lower.tail = F)
  
#Prepare final table
  df <- data.frame("DF"=c(degf_g,degf_e), "Sum of squares"=c(ssg,sse), "MeanSq"=c(msg,mse),"F-value"=c(round(fstat,4),""),
                   "P-Value"=c(round(pval,4),""))
  df[,2:3] <- df[,2:3]/1e6  
  rownames(df) <- c("Education level","Residuals")
  colnames(df) <- c("DF","Sum of squares (in millions)","Mean Squares (in millions)","F Value","P Value")
  
#Print final table
  print(kable(df, format = "html", caption = paste(x))%>%
         kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                                  font_size = 12.5) %>% 
                 row_spec(0, font_size=10))
}
)
1974
DF Sum of squares (in millions) Mean Squares (in millions) F Value P Value
Education level 4 166847.9 41711.984 62.7273 0
Residuals 1308 869784.7 664.973
2012
DF Sum of squares (in millions) Mean Squares (in millions) F Value P Value
Education level 4 828317.3 207079.313 120.5219 0
Residuals 1747 3001675.4 1718.189

Our results show that we reject the null hypothesis of equal income across education levels in both 1974 and 2012, since the p-value is \(<0.0001\). We thus conclude that at least one pair of educational levels lead to different average incomes. Our results verify our perceptions from the figure shown in the previous section’s plots and tables, where we observe a trend of higher income emerge as an individual’s education level increases.

Paired t-test

Next, we conduct a paired t-test to determine whether college-educated individuals earn a higher income than non-college income earners. Note that we must conduct a paired t-test here because income earned is correlated across different education levels. For example, an individual cannot join college without completing high school. We create a new variable college_educ, which gathers all individuals who either hold a Bachelor or Graduate degree. Since the number of college educated individuals are not the same as non-college individuals, we randomly sample from the group which has the higher sample size in order to match n from both categories. The null hypothesis is given as follows: \[H_0: \mu_{diff} = 0\] and the alternative hypothesis is \[H_a:\mu_{diff}>0\], where \[diff = college\_educ -non\_college\].

yr <- c(1974, 2012)
results_table2 <- lapply(yr, function(x){
#Prepare data
datamain <- as_tibble(gss)

#Remove NAs
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]

#Select and create college_degree variable
df_main <- datamain %>%
  select(coninc, year, degree) %>%
  filter(year==x) %>%
  mutate(college_degree = ifelse(degree == "Bachelor"|degree == "Graduate", "yes","no"))

#Separate college and non-college into their own datasets
df_main_col <- df_main %>%
  filter(college_degree == "yes") 
df_main_no_col <- df_main %>%
  filter(college_degree == "no") 

#Determine sampling rule: since equal 'n' is needed we sample from the dataset with the higher 'n'
if(length(df_main_col$coninc)<length(df_main_no_col$coninc)){
  #sample from data frame
  df_main_no_col <- df_main_no_col[sample(nrow(df_main_no_col),length(df_main_col$coninc)),]
  
  #Take difference of incomes of each group
  diff <- df_main_col$coninc - df_main_no_col$coninc
  
  #create t_stat, degrees of freedom and p_value
  t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
  deg_ft <- length(diff)-1
  p_value <- pt(t_stat, deg_ft, lower.tail = F)
  
  #Prepare final table
   print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
  `colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")), 
  format = "html", caption = paste(1974)) %>%
    kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                  font_size = 12.5) %>% 
    row_spec(0, font_size=10))
} else {
  #sample from data frame
  df_main_col <- df_main_col[sample(nrow(df_main_col),length(df_main_no_col$coninc)),]
  
  #Take difference of incomes of each group
  diff <- df_main_col$coninc - df_main_no_col$coninc
  
  #create t_stat, degrees of freedom and p_value
  t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
  deg_ft <- length(diff)-1
  p_value <- pt(t_stat, deg_ft, lower.tail = F)
  
  #Prepare final table
  print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
  `colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")), 
  format = "html", caption = paste(1974)) %>%
    kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                  font_size = 12.5) %>% 
    row_spec(0, font_size=10))
}
})
1974
n mean diff Std. error t_stat degrees of freedom P-Value
202 24398.42 2891.428 8.43819 201 0
1974
n mean diff Std. error t_stat degrees of freedom P-Value
509 42618.28 3116.182 13.67644 508 0

Our results show that we reject the null hypothesis in favor of the alternative, for both years 1974 and 2012. Thus our analysis suggests that going to college can lead to higher income outcomes for individuals.

5. Conclusion

In this project, we examine the impact of educational attainment on income earned for two years 1974 and 2012. We employ ANOVA and discover that average income for different education levels is indeed not identical. Next we use a paired t-test to determine whether average income for college attendees is similar to non-attendees and discover evidence against that hypothesis.

We must also be cognizant of one important drawback in our analysis. Income earned is a function of several socio-economic and macro factors such as age, gender, geographical location and job-type. Since those factors were not controlled in this analysis, a more comprehensive econometric model can reveal more precise results.

Appendix

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## degree         4 1.668e+11 4.171e+10   62.73 <2e-16 ***
## Residuals   1308 8.698e+11 6.650e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## degree         4 8.283e+11 2.071e+11   120.5 <2e-16 ***
## Residuals   1747 3.002e+12 1.718e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see the results obtained using the aov function exactly match our analysis in the ANOVA table above.