Generalizability: The introductory code manual states (page 8) that each sample is independently drawn using a Block quota sampling strategy. Therefore, the data is randomly sampled. However, we should be careful about certain data collection scenarios. Since quota sampling is typically non-random, the samples might be biased and thus cause difficulties in inference. The data collectors’ approach is justified since they target individuals who are most likely to result in rich datapoints and who are simultaneously representative of the underlying population. For example, the surveyors used random sampling once they divided population individuals into quotas.
*Causality: Since the data was gathered to address numerous social science problems, we should be careful about causality when establishing relationships between variables.
Question:
Importance:
Since obtaining higher education is expensive in the US, if we find that there are no significant differences in income outcomes across different educational levels, then an individual is perhaps better off by choosing an education level, which is financially most viable.
We begin by plotting the relationship between total income (coninc
;inflation adjusted) and highest degree (degree
) in a box plot.
datamain <- as_tibble(gss)
#drop NA values for degree.
datamain_plot <- datamain[!is.na(datamain$degree), ]
#create comparative graphs
var1 <- c(1974,2012)
P = list()
for(i in var1){
datamain_plot_loop <- datamain_plot %>%
filter(year == i)
p = ggplot(datamain_plot_loop, aes(x=degree, y = coninc))+
labs(title =paste(i), x = "Highest degree", y = "Total income (cons. $)")+
geom_boxplot(notch=T, aes(fill = factor(degree)), show.legend = F, outlier.shape = NA)+
ylim(0, 100000)+
scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 10, simplify = FALSE), paste, collapse="\n"))+
scale_fill_manual(values = brewer.pal(7,"Blues"))+
theme(plot.title = element_text(hjust=0.5, face = "bold"),
text = element_text(size=8, face = "bold"),
axis.text.x = element_text(size = 5, face = "plain"))+
stat_summary(fun=mean, geom="point", shape=23, size=2)
P = c(P, list(p))
}
plot_grid(plotlist = P, ncol=2)
The diamond-shaped box in each box plot is the mean of the distribution and the notch displays a confidence interval around the median (\(median \pm 1.57*\frac{IQR}{\sqrt{n}}\)). If the two boxes’ notches do not overlap, then evidence of medians differing exists. In the 2012 plot, we observe an upward trend in median and mean as educational attainment increases. In 1974, the situation is more mixed with “graduate” degree holders doing better than everyone else, yet it is difficult to decipher whether there are income differences between “high-school” and “Bachelor’s” degree holders.
Next we present some summary statistics by comparing proportions of educational attainment across different total income categories for both 1974 and 2012.
yr <- c(1974,2012)
for( i in 1:2){
datamain1 <- datamain %>%
select(year, coninc, degree) %>%
filter(year == yr[[i]]) %>%
mutate(
categorical_income = cut(coninc, breaks=c(0, 25000, 50000, 75000, 100000, Inf),
labels=c("$0-25k","$25-50k","$50-75k","$75-100k","$100k & above"))
)
print(landscape(kable(round(prop.table(table(datamain1$categorical_income,datamain1$degree), margin = 1),3),
format = "html",caption = paste(yr[[i]])) %>%
kable_styling(full_width = T, latex_options = c("striped","hover", "condensed"),
font_size = 12.5) %>%
row_spec(0, font_size=10)))
}
Lt High School | High School | Junior College | Bachelor | Graduate | |
---|---|---|---|---|---|
$0-25k | 0.545 | 0.375 | 0.019 | 0.056 | 0.005 |
$25-50k | 0.311 | 0.544 | 0.014 | 0.100 | 0.031 |
$50-75k | 0.192 | 0.615 | 0.029 | 0.077 | 0.087 |
$75-100k | 0.126 | 0.621 | 0.000 | 0.155 | 0.097 |
$100k & above | 0.070 | 0.435 | 0.043 | 0.243 | 0.209 |
Lt High School | High School | Junior College | Bachelor | Graduate | |
---|---|---|---|---|---|
$0-25k | 0.234 | 0.590 | 0.066 | 0.081 | 0.029 |
$25-50k | 0.092 | 0.562 | 0.103 | 0.173 | 0.070 |
$50-75k | 0.059 | 0.520 | 0.072 | 0.214 | 0.135 |
$75-100k | 0.041 | 0.308 | 0.081 | 0.349 | 0.221 |
$100k & above | 0.016 | 0.206 | 0.058 | 0.407 | 0.312 |
In the tables above, the sum of each row equals one. We observe that in 1974, most people earning between $0-25k went to LT highschool
, whereas maximum proportions of individuals earning more than $25k all had atleast a High school
degree. However, even among individuals earning $100k and above, most went to a High school
. This is in sharp contrast to 2012, where most high income earners ($100k and above) have a Bachelor
’s or a Graduate
degree. However most people earning below $75k still have High school
as their highest degree. The summary tables show a clear contrast between the income outcomes of 1974 and 2012. This data seems to suggest that it is more important to attain a high-level of education in 2012 than 1974 to earn more income.
yr <- c(1974,2012)
tablex <- lapply(yr,
function(x){
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]
datamain2 <- datamain %>%
filter(year==x) %>%
select(coninc, year, degree) %>%
group_by(degree) %>%
summarise(n = n(), mean_inc = mean(coninc), sd_inc = sd(coninc))
print(kable(data.frame("Degree"=datamain2$degree,"N"=datamain2$n,"Mean"=datamain2$mean_inc,
"SD"=datamain2$sd_inc),format = "html", caption = paste(x)) %>%
kable_styling(full_width = T, latex_options = c("striped","hover", "condensed"),
font_size = 12.5) %>%
row_spec(0, font_size=10))
}
)
Degree | N | Mean | SD |
---|---|---|---|
Lt High School | 425 | 29028.89 | 21682.58 |
High School | 661 | 45885.92 | 25848.49 |
Junior College | 25 | 47550.20 | 35095.39 |
Bachelor | 132 | 55175.46 | 32771.54 |
Graduate | 70 | 71875.06 | 29296.42 |
Degree | N | Mean | SD |
---|---|---|---|
Lt High School | 230 | 21214.41 | 22435.00 |
High School | 881 | 37453.19 | 34974.57 |
Junior College | 132 | 46362.46 | 39287.03 |
Bachelor | 324 | 75174.48 | 55495.55 |
Graduate | 185 | 89988.10 | 57720.44 |
The above table clearly shows a pattern, where the mean income increases as an individual’s educational attainement increases in both 1974 and 2012. Notice how the standard deviations of college degree earners are higher than those in 1974.
To address our first research question, we hypothesize that average income is the same across all education levels. Mathematically, this is: \[H_0: \mu_1=\mu_2 ... =\mu_5\] Where \(k=5\) is the number of education groups. The alternative hypothesis can be written as: the average incomes differ between at least one pair of education categories. \[H_a: \mu_i\neq\mu_j\]
Note that instead of using the inference
function, I developed my own analysis of variance (ANOVA) function since there seems to be a bug in the statr
package. I verify my results using R’s aov
package, which also computes one-way ANOVA. The results are the same and can be viewed in the appendix after the Conclusion.
yr <- c(1974,2012)
results_table <- lapply(yr, function(x){
#remove NAs
datamain <- as_tibble(gss)
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]
#find group means and n: (by degree)
dfi <- datamain %>%
select(year, coninc, degree) %>%
filter(year == x) %>%
group_by(degree) %>%
summarise(mean_inc = mean(coninc), nj = n())
#transopose data frame to usable form
dfi2<- as.data.frame(t(dfi))
#Assignm column names
names(dfi2) <- as.matrix(dfi2[1, ])
dfi2 <- dfi2[-1, ]
dfi2[] <- lapply(dfi2, function(x) type.convert(as.character(x)))
#create function to find SST.
datamain <- datamain %>% filter(year == x)
y <- c(datamain$coninc)
sst.i <- sapply(y,
function(x){
sst.dev2 = (x - mean(y))^2
})
sst <- sum(sst.i)
#create function to find SSG
ys <- c(names(dfi2))
ssg.j <- sapply(ys,
function(x){
ssg.dev2 = dfi2[[x]][2]*(dfi2[[x]][1]-mean(y))^2
})
ssg <- sum(ssg.j)
#Find degrees of freedom, SSE, MSG, MSE, f-stat & p-vale
degf_e <- ((data.frame(datamain %>% summarise(n()))[[1]]-1)-(ncol(dfi2)-1))
degf_g <- ncol(dfi2)-1
sse <- sst - ssg
msg <- ssg/degf_g
mse <- sse/degf_e
fstat <- msg/mse
pval <- pf(fstat, degf_g,degf_e, lower.tail = F)
#Prepare final table
df <- data.frame("DF"=c(degf_g,degf_e), "Sum of squares"=c(ssg,sse), "MeanSq"=c(msg,mse),"F-value"=c(round(fstat,4),""),
"P-Value"=c(round(pval,4),""))
df[,2:3] <- df[,2:3]/1e6
rownames(df) <- c("Education level","Residuals")
colnames(df) <- c("DF","Sum of squares (in millions)","Mean Squares (in millions)","F Value","P Value")
#Print final table
print(kable(df, format = "html", caption = paste(x))%>%
kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"),
font_size = 12.5) %>%
row_spec(0, font_size=10))
}
)
DF | Sum of squares (in millions) | Mean Squares (in millions) | F Value | P Value | |
---|---|---|---|---|---|
Education level | 4 | 166847.9 | 41711.984 | 62.7273 | 0 |
Residuals | 1308 | 869784.7 | 664.973 |
DF | Sum of squares (in millions) | Mean Squares (in millions) | F Value | P Value | |
---|---|---|---|---|---|
Education level | 4 | 828317.3 | 207079.313 | 120.5219 | 0 |
Residuals | 1747 | 3001675.4 | 1718.189 |
Our results show that we reject the null hypothesis of equal income across education levels in both 1974 and 2012, since the p-value is \(<0.0001\). We thus conclude that at least one pair of educational levels lead to different average incomes. Our results verify our perceptions from the figure shown in the previous section’s plots and tables, where we observe a trend of higher income emerge as an individual’s education level increases.
Next, we conduct a paired t-test to determine whether college-educated individuals earn a higher income than non-college income earners. Note that we must conduct a paired t-test here because income earned is correlated across different education levels. For example, an individual cannot join college without completing high school. We create a new variable college_educ
, which gathers all individuals who either hold a Bachelor
or Graduate
degree. Since the number of college educated individuals are not the same as non-college individuals, we randomly sample from the group which has the higher sample size in order to match n
from both categories. The null hypothesis is given as follows: \[H_0: \mu_{diff} = 0\] and the alternative hypothesis is \[H_a:\mu_{diff}>0\], where \[diff = college\_educ -non\_college\].
yr <- c(1974, 2012)
results_table2 <- lapply(yr, function(x){
#Prepare data
datamain <- as_tibble(gss)
#Remove NAs
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]
#Select and create college_degree variable
df_main <- datamain %>%
select(coninc, year, degree) %>%
filter(year==x) %>%
mutate(college_degree = ifelse(degree == "Bachelor"|degree == "Graduate", "yes","no"))
#Separate college and non-college into their own datasets
df_main_col <- df_main %>%
filter(college_degree == "yes")
df_main_no_col <- df_main %>%
filter(college_degree == "no")
#Determine sampling rule: since equal 'n' is needed we sample from the dataset with the higher 'n'
if(length(df_main_col$coninc)<length(df_main_no_col$coninc)){
#sample from data frame
df_main_no_col <- df_main_no_col[sample(nrow(df_main_no_col),length(df_main_col$coninc)),]
#Take difference of incomes of each group
diff <- df_main_col$coninc - df_main_no_col$coninc
#create t_stat, degrees of freedom and p_value
t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
deg_ft <- length(diff)-1
p_value <- pt(t_stat, deg_ft, lower.tail = F)
#Prepare final table
print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
`colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")),
format = "html", caption = paste(1974)) %>%
kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"),
font_size = 12.5) %>%
row_spec(0, font_size=10))
} else {
#sample from data frame
df_main_col <- df_main_col[sample(nrow(df_main_col),length(df_main_no_col$coninc)),]
#Take difference of incomes of each group
diff <- df_main_col$coninc - df_main_no_col$coninc
#create t_stat, degrees of freedom and p_value
t_stat <- mean(diff)/(sd(diff)/sqrt(length(diff)))
deg_ft <- length(diff)-1
p_value <- pt(t_stat, deg_ft, lower.tail = F)
#Prepare final table
print(kable(data.frame(length(df_main_col$coninc),mean(diff),(sd(diff)/sqrt(length(diff))), t_stat, deg_ft, p_value) %>%
`colnames<-`(c("n","mean diff","Std. error", "t_stat","degrees of freedom", "P-Value")),
format = "html", caption = paste(1974)) %>%
kable_styling(full_width = F, latex_options = c("striped","hover", "condensed"),
font_size = 12.5) %>%
row_spec(0, font_size=10))
}
})
n | mean diff | Std. error | t_stat | degrees of freedom | P-Value |
---|---|---|---|---|---|
202 | 24398.42 | 2891.428 | 8.43819 | 201 | 0 |
n | mean diff | Std. error | t_stat | degrees of freedom | P-Value |
---|---|---|---|---|---|
509 | 42618.28 | 3116.182 | 13.67644 | 508 | 0 |
Our results show that we reject the null hypothesis in favor of the alternative, for both years 1974 and 2012. Thus our analysis suggests that going to college can lead to higher income outcomes for individuals.
In this project, we examine the impact of educational attainment on income earned for two years 1974 and 2012. We employ ANOVA and discover that average income for different education levels is indeed not identical. Next we use a paired t-test to determine whether average income for college attendees is similar to non-attendees and discover evidence against that hypothesis.
We must also be cognizant of one important drawback in our analysis. Income earned is a function of several socio-economic and macro factors such as age, gender, geographical location and job-type. Since those factors were not controlled in this analysis, a more comprehensive econometric model can reveal more precise results.
yr <- c(1974, 2012)
results_appendix <- lapply(yr, function(x){
datamain <- as_tibble(gss)
datamain <- datamain[!is.na(datamain$degree),]
datamain <- datamain[!is.na(datamain$coninc),]
dt4 <- datamain %>% filter(year==x)
print(summary(aov(coninc~degree, data = dt4)))
})
## Df Sum Sq Mean Sq F value Pr(>F)
## degree 4 1.668e+11 4.171e+10 62.73 <2e-16 ***
## Residuals 1308 8.698e+11 6.650e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## degree 4 8.283e+11 2.071e+11 120.5 <2e-16 ***
## Residuals 1747 3.002e+12 1.718e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the results obtained using the aov
function exactly match our analysis in the ANOVA table above.