Statistical inference with the GSS data

Load Data

load("gss.Rdata")

1: Data

Generalizability: The introductory code manual states (page 8) that each sample is independently drawn using a Block quota sampling strategy. Therefore, the data is randomly sampled. However, we should be careful about certain data collection scenarios. Since quota sampling is typically non-random, the samples might be biased and thus cause difficulties in inference. The data collectors’ approach is justified since they target individuals who are most likely to result in rich datapoints and who are simultaneously representative of the underlying population. For example, the surveyors used random sampling once they divided population individuals into quotas.
*Causality: Since the data was gathered to address numerous social science problems, we should be careful about causality when establishing relationships between variables.

2: Research question

Question:

We are interested in understanding how different levels of educational attainment influence total income earned by an individual. We compare our results for two separate years 1974 and 2012. We expect individuals with higher-level degrees to earn more income than their less educated counterparts. Note that our analysis can only identify one explanatory-response relationship since in reality there are several variables, which could potentially impact the total income of an individual besides educational attainment.
Next, we ask if the data provide convincing evidence that getting a college degree (Bachelor’s and/or graduate) leads to higher average income.

Importance:

Since obtaining higher education is expensive in the US, if we find that there are no significant differences in income outcomes across different educational levels, then an individual is perhaps better off by choosing an education level, which is financially most viable.

3: Exploratory data analysis

We begin by plotting the relationship between total income (coninc;inflation adjusted) and highest degree (degree) in a box plot.

datamain <- as_tibble(gss)
#drop NA values for degree.
datamain_plot <- datamain[!is.na(datamain$degree), ]

#create comparative graphs
var1 <- c(1974,2012)
P = list()
for(i in var1){
  datamain_plot_loop <- datamain_plot %>%
    filter(year == i)
  p = ggplot(datamain_plot_loop, aes(x=degree, y = coninc))+
    labs(title =paste(i), x = "Highest degree", y = "Total income (cons. $)")+
    geom_boxplot(notch=T, aes(fill = factor(degree)), show.legend = F, outlier.shape = NA)+
    ylim(0, 100000)+
    scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 10, simplify = FALSE), paste, collapse="\n"))+
    scale_fill_manual(values = brewer.pal(7,"Blues"))+
    theme(plot.title = element_text(hjust=0.5, face = "bold"), 
          text = element_text(size=8, face = "bold"),
          axis.text.x = element_text(size = 5, face = "plain"))+
    stat_summary(fun=mean, geom="point", shape=23, size=2)
  P = c(P, list(p))
}

plot_grid(plotlist = P, ncol=2)

The diamond-shaped box in each box plot is the mean of the distribution and the notch displays a confidence interval around the median ($median \pm 1.57*\frac{IQR}{\sqrt{n}}$). If the two boxes’ notches do not overlap, then evidence of medians differing exists. In the 2012 plot, we observe an upward trend in median and mean as educational attainment increases. In 1974, the situation is more mixed with “graduate” degree holders doing better than everyone else, yet it is difficult to decipher whether there are income differences between “high-school” and “Bachelor’s” degree holders.

Next we present some summary statistics by comparing proportions of educational attainment across different total income categories for both 1974 and 2012.

yr <- c(1974,2012)

for( i in 1:2){
  datamain1 <- datamain %>%
  select(year, coninc, degree) %>%
  filter(year == yr[[i]]) %>%
  mutate(
    categorical_income = cut(coninc, breaks=c(0, 25000, 50000, 75000, 100000, Inf), 
                             labels=c("$0-25k","$25-50k","$50-75k","$75-100k","$100k & above"))
    )
  print(landscape(kable(round(prop.table(table(datamain1$categorical_income,datamain1$degree), margin = 1),3), 
                        format = "html",caption = paste(yr[[i]])) %>%
                    kable_styling(full_width = T, latex_options = c("striped","hover", "condensed"), 
                                  font_size = 12.5) %>% 
                    row_spec(0, font_size=10)))
}

1974
	Lt High School	High School	Junior College	Bachelor	Graduate
$0-25k	0.545	0.375	0.019	0.056	0.005
$25-50k	0.311	0.544	0.014	0.100	0.031
$50-75k	0.192	0.615	0.029	0.077	0.087
$75-100k	0.126	0.621	0.000	0.155	0.097
$100k & above	0.070	0.435	0.043	0.243	0.209

2012
	Lt High School	High School	Junior College	Bachelor	Graduate
$0-25k	0.234	0.590	0.066	0.081	0.029
$25-50k	0.092	0.562	0.103	0.173	0.070
$50-75k	0.059	0.520	0.072	0.214	0.135
$75-100k	0.041	0.308	0.081	0.349	0.221
$100k & above	0.016	0.206	0.058	0.407	0.312

In the tables above, the sum of each row equals one. We observe that in 1974, most people earning between $0-25k went to LT highschool, whereas maximum proportions of individuals earning more than $25k all had atleast a High school degree. However, even among individuals earning $100k and above, most went to a High school. This is in sharp contrast to 2012, where most high income earners ($100k and above) have a Bachelor’s or a Graduate degree. However most people earning below $75k still have High school as their highest degree. The summary tables show a clear contrast between the income outcomes of 1974 and 2012. This data seems to suggest that it is more important to attain a high-level of education in 2012 than 1974 to earn more income.

yr <- c(1974,2012)
tablex <- lapply(yr, 
       function(x){
         datamain <- datamain[!is.na(datamain$degree),]
         datamain <- datamain[!is.na(datamain$coninc),]
         datamain2 <- datamain %>%
           filter(year==x) %>%
           select(coninc, year, degree) %>%
           group_by(degree) %>%
           summarise(n = n(), mean_inc = mean(coninc), sd_inc = sd(coninc))
         print(kable(data.frame("Degree"=datamain2$degree,"N"=datamain2$n,"Mean"=datamain2$mean_inc,
                                "SD"=datamain2$sd_inc),format = "html", caption = paste(x)) %>%
                 kable_styling(full_width = T, latex_options = c("striped","hover", "condensed"), 
                                  font_size = 12.5) %>% 
                 row_spec(0, font_size=10))
}
)

1974
Degree	N	Mean	SD
Lt High School	425	29028.89	21682.58
High School	661	45885.92	25848.49
Junior College	25	47550.20	35095.39
Bachelor	132	55175.46	32771.54
Graduate	70	71875.06	29296.42

2012
Degree	N	Mean	SD
Lt High School	230	21214.41	22435.00
High School	881	37453.19	34974.57
Junior College	132	46362.46	39287.03
Bachelor	324	75174.48	55495.55
Graduate	185	89988.10	57720.44

The above table clearly shows a pattern, where the mean income increases as an individual’s educational attainement increases in both 1974 and 2012. Notice how the standard deviations of college degree earners are higher than those in 1974.

4. Inference

Hypothesis:

To address our first research question, we hypothesize that average income is the same across all education levels. Mathematically, this is: \[H_0: \mu_1=\mu_2 ... =\mu_5\] Where $k=5$ is the number of education groups. The alternative hypothesis can be written as: the average incomes differ between at least one pair of education categories. \[H_a: \mu_i\neq\mu_j\]

Note that instead of using the inference function, I developed my own analysis of variance (ANOVA) function since there seems to be a bug in the statr package. I verify my results using R’s aov package, which also computes one-way ANOVA. The results are the same and can be viewed in the appendix after the Conclusion.

yr <- c(1974,2012)
results_table <- lapply(yr, function(x){
#remove NAs
  datamain <- as_tibble(gss)
  datamain <- datamain[!is.na(datamain$degree),]
  datamain <- datamain[!is.na(datamain$coninc),]

#find group means and n: (by degree)
  dfi <- datamain %>%
    select(year, coninc, degree) %>%
    filter(year == x) %>%
    group_by(degree) %>%
    summarise(mean_inc = mean(coninc), nj = n())
  #transopose data frame to usable form 
  dfi2<- as.data.frame(t(dfi))
  #Assignm column names
  names(dfi2) <- as.matrix(dfi2[1, ])
  dfi2 <- dfi2[-1, ]
  dfi2[] <- lapply(dfi2, function(x) type.convert(as.character(x)))

#create function to find SST.
  datamain <- datamain %>% filter(year == x)
  y <- c(datamain$coninc)
  sst.i <- sapply(y,
                  function(x){
                    sst.dev2 = (x - mean(y))^2
                    })
  sst <- sum(sst.i)

#create function to find SSG
  ys <- c(names(dfi2))
  ssg.j <- sapply(ys, 
                  function(x){
                      ssg.dev2 = dfi2[[x]][2]*(dfi2[[x]][1]-mean(y))^2
                  })
  ssg <- sum(ssg.j)
  
#Find degrees of freedom, SSE, MSG, MSE, f-stat & p-vale
  degf_e <- ((data.frame(datamain %>% summarise(n()))[[1]]-1)-(ncol(dfi2)-1))
  degf_g <- ncol(dfi2)-1
  sse <- sst - ssg
  msg <- ssg/degf_g
  mse <- sse/degf_e
  fstat <- msg/mse
  pval <- pf(fstat, degf_g,degf_e, lower.tail = F)
  
#Prepare final table
  df <- data.frame("DF"=c(degf_g,degf_e), "Sum of squares"=c(ssg,sse), "MeanSq"=c(msg,mse),"F-value"=c(round(fstat,4),""),
                   "P-Value"=c(round(pval,4),""))
  df[,2:3] <- df[,2:3]/1e6  
  rownames(df) <- c("Education level","Residuals")
  colnames(df) <- c("DF","Sum of squares (in millions)","Mean Squares (in millions)","F Value","P Value")
  
#Print final table
  print(kable(df, format = "html", caption = paste(x))%>%
         kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                                  font_size = 12.5) %>% 
                 row_spec(0, font_size=10))
}
)

1974
	DF	Sum of squares (in millions)	Mean Squares (in millions)	F Value	P Value
Education level	4	166847.9	41711.984	62.7273	0
Residuals	1308	869784.7	664.973

2012
	DF	Sum of squares (in millions)	Mean Squares (in millions)	F Value	P Value
Education level	4	828317.3	207079.313	120.5219	0
Residuals	1747	3001675.4	1718.189

Our results show that we reject the null hypothesis of equal income across education levels in both 1974 and 2012, since the p-value is $<0.0001$. We thus conclude that at least one pair of educational levels lead to different average incomes. Our results verify our perceptions from the figure shown in the previous section’s plots and tables, where we observe a trend of higher income emerge as an individual’s education level increases.

Paired t-test

Next, we conduct a paired t-test to determine whether college-educated individuals earn a higher income than non-college income earners. Note that we must conduct a paired t-test here because income earned is correlated across different education levels. For example, an individual cannot join college without completing high school. We create a new variable college_educ, which gathers all individuals who either hold a Bachelor or Graduate degree. Since the number of college educated individuals are not the same as non-college individuals, we randomly sample from the group which has the higher sample size in order to match n from both categories. The null hypothesis is given as follows: \[H_0: \mu_{diff} = 0\] and the alternative hypothesis is \[H_a:\mu_{diff}>0\], where \[diff = college\_educ -non\_college\].

yr <- c(1974, 2012)
results_table2 <- lapply(yr, function(x){
#Prepare data
datamain <- as_tibble(gss)

#Remove NAs
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]

#Select and create college_degree variable
df_main <- datamain %>%
  select(coninc, year, degree) %>%
  filter(year==x) %>%
  mutate(college_degree = ifelse(degree == "Bachelor"|degree == "Graduate", "yes","no"))

#Separate college and non-college into their own datasets
df_main_col <- df_main %>%
  filter(college_degree == "yes") 
df_main_no_col <- df_main %>%
  filter(college_degree == "no") 

#Determine sampling rule: since equal 'n' is needed we sample from the dataset with the higher 'n'
if(length(df_main_col$coninc)<length(df_main_no_col$coninc)){
  #sample from data frame
  df_main_no_col <- df_main_no_col[sample(nrow(df_main_no_col),length(df_main_col$coninc)),]
  
  #Take difference of incomes of each group
  diff <- df_main_col$coninc - df_main_no_col$coninc
  
  #create t_stat, degrees of freedom and p_value
  t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
  deg_ft <- length(diff)-1
  p_value <- pt(t_stat, deg_ft, lower.tail = F)
  
  #Prepare final table
   print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
  `colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")), 
  format = "html", caption = paste(1974)) %>%
    kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                  font_size = 12.5) %>% 
    row_spec(0, font_size=10))
} else {
  #sample from data frame
  df_main_col <- df_main_col[sample(nrow(df_main_col),length(df_main_no_col$coninc)),]
  
  #Take difference of incomes of each group
  diff <- df_main_col$coninc - df_main_no_col$coninc
  
  #create t_stat, degrees of freedom and p_value
  t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
  deg_ft <- length(diff)-1
  p_value <- pt(t_stat, deg_ft, lower.tail = F)
  
  #Prepare final table
  print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
  `colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")), 
  format = "html", caption = paste(1974)) %>%
    kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"), 
                  font_size = 12.5) %>% 
    row_spec(0, font_size=10))
}
})

1974
n	mean diff	Std. error	t_stat	degrees of freedom	P-Value
202	24398.42	2891.428	8.43819	201	0

1974
n	mean diff	Std. error	t_stat	degrees of freedom	P-Value
509	42618.28	3116.182	13.67644	508	0

Our results show that we reject the null hypothesis in favor of the alternative, for both years 1974 and 2012. Thus our analysis suggests that going to college can lead to higher income outcomes for individuals.

5. Conclusion

In this project, we examine the impact of educational attainment on income earned for two years 1974 and 2012. We employ ANOVA and discover that average income for different education levels is indeed not identical. Next we use a paired t-test to determine whether average income for college attendees is similar to non-attendees and discover evidence against that hypothesis.

We must also be cognizant of one important drawback in our analysis. Income earned is a function of several socio-economic and macro factors such as age, gender, geographical location and job-type. Since those factors were not controlled in this analysis, a more comprehensive econometric model can reveal more precise results.

Appendix

yr <- c(1974, 2012)
results_appendix <- lapply(yr, function(x){
  datamain <- as_tibble(gss)
  datamain <- datamain[!is.na(datamain$degree),]
  datamain <- datamain[!is.na(datamain$coninc),]
  dt4 <- datamain %>% filter(year==x) 
  print(summary(aov(coninc~degree, data = dt4)))
})

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## degree         4 1.668e+11 4.171e+10   62.73 <2e-16 ***
## Residuals   1308 8.698e+11 6.650e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## degree         4 8.283e+11 2.071e+11   120.5 <2e-16 ***
## Residuals   1747 3.002e+12 1.718e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see the results obtained using the aov function exactly match our analysis in the ANOVA table above.