# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
# install.packages("tidycensus")
# install.packages("tidyverse")
# install.packages("knitr")
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
# census_api_key("eb283fc5f20283cf588e1f363afe8bc73cf9c354", install = TRUE)
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Michigan"Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Michigan Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Michigan for this analysis because: I was born in Michigan, lived there until I was 6 and returned for high school.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
acsVariables <- c(med_HH_inc = "B19013_001", totpop = "B01003_001")
MI_2022 <- get_acs(geography="county", variables = acsVariables,
state = "MI", survey = "acs5", year = 2022, output = "wide")
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
MI_2022 <- MI_2022 %>%
mutate(NAME = str_remove(NAME, " County, Michigan"))
# Display the first few rows
glimpse(MI_2022)Rows: 83
Columns: 6
$ GEOID <chr> "26001", "26003", "26005", "26007", "26009", "26011", "260…
$ NAME <chr> "Alcona", "Alger", "Allegan", "Alpena", "Antrim", "Arenac"…
$ med_HH_incE <dbl> 50295, 55528, 75543, 49133, 68850, 53487, 51911, 75182, 57…
$ med_HH_incM <dbl> 2243, 2912, 2369, 2119, 3115, 2018, 2904, 2704, 2395, 4258…
$ totpopE <dbl> 10238, 8866, 120189, 28911, 23662, 15031, 8245, 62581, 103…
$ totpopM <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
#without mutate()
#MI_2022$MOE_per <- (MI_2022$med_HH_incM/MI_2022$med_HH_incE)*100
MI_2022 <- MI_2022 %>%
mutate(MOE_per = (med_HH_incM/med_HH_incE)*100)
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
MI_2022 <- MI_2022 %>%
mutate(reli_cat = case_when(MOE_per < 5 ~ 'High Confidence',
MOE_per > 10 ~ 'Low Confidence',
TRUE ~ 'Moderate Confidence'))
reli_cat_freq <- MI_2022 %>%
count(reli_cat)
reli_cat_freq# A tibble: 3 × 2
reli_cat n
<chr> <int>
1 High Confidence 56
2 Low Confidence 2
3 Moderate Confidence 25
2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
#arrange(MI_2022, desc(MOE_per))
#MOE_min <- MI_2022 %>% slice_min(MOE_per, n=5)
MOE_max <- MI_2022 %>% slice_max(MOE_per, n=5)
MOE_max# A tibble: 5 × 8
GEOID NAME med_HH_incE med_HH_incM totpopE totpopM MOE_per reli_cat
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 26083 Keweenaw 55560 7301 2088 NA 13.1 Low Confide…
2 26153 Schoolcraft 55071 6328 8062 NA 11.5 Low Confide…
3 26053 Gogebic 47913 4766 14597 NA 9.95 Moderate Co…
4 26137 Otsego 62865 5910 25221 NA 9.40 Moderate Co…
5 26119 Montmorency 46345 3796 9261 NA 8.19 Moderate Co…
# Format as table with kable() - include appropriate column names and caption
kable(MOE_max[, c(2,3,4,7,8)], "simple", col.names = c("Name", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the highest median income margins of error")| Name | Median Income | Median Income Margin of Error | Margin of Error Percentage | Reliability |
|---|---|---|---|---|
| Keweenaw | 55560 | 7301 | 13.140749 | Low Confidence |
| Schoolcraft | 55071 | 6328 | 11.490621 | Low Confidence |
| Gogebic | 47913 | 4766 | 9.947196 | Moderate Confidence |
| Otsego | 62865 | 5910 | 9.401098 | Moderate Confidence |
| Montmorency | 46345 | 3796 | 8.190743 | Moderate Confidence |
#consequence bearing hardest on smallest counties?
pop_min <- MI_2022 %>% slice_min(totpopE, n=5)
kable(pop_min[, c(2,5,3,4,7,8)], "simple", col.names = c("Name", "Total Population", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the lowest population")| Name | Total Population | Median Income | Median Income Margin of Error | Margin of Error Percentage | Reliability |
|---|---|---|---|---|---|
| Keweenaw | 2088 | 55560 | 7301 | 13.140749 | Low Confidence |
| Luce | 5442 | 51015 | 3742 | 7.335097 | Moderate Confidence |
| Ontonagon | 5862 | 48316 | 3844 | 7.955957 | Moderate Confidence |
| Schoolcraft | 8062 | 55071 | 6328 | 11.490621 | Low Confidence |
| Baraga | 8245 | 51911 | 2904 | 5.594190 | Moderate Confidence |
Data Quality Commentary:
These results show that even with a government-wide survey, there are very high margins of error, which have policy implications, where these counties may be over- or underestimated in various variables. In this instance, the counties with higher margins of error may be overestimated as more affluent than they actually are, which may lead to them not receiving the aid or funding they need, or they might be underestimated as less wealthy than they actually are, which may lead to resources being needlessly diverted to them, when in fact others need them more. The second table shows that the population size may have at least some effect, since the two counties with “Low Confidence” are both in the top 5 smallest counties in Michigan.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
filtered_MI <- filter(MI_2022, NAME == "Oakland" | NAME == "Mackinac" | NAME == "Keweenaw")
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
kable(filtered_MI[, c(2,3,7,8)], "simple", col.names = c("Name", "Median Income", "Margin of Error Percentage", "Reliability"), caption = "3 counties representing different data reliability levels")| Name | Median Income | Margin of Error Percentage | Reliability |
|---|---|---|---|
| Keweenaw | 55560 | 13.140749 | Low Confidence |
| Mackinac | 60620 | 5.679644 | Moderate Confidence |
| Oakland | 92620 | 1.028935 | High Confidence |
Comment on the output: The wide range of the margins of error is quite intersting, as they all come from one state, and presumably they try to draw county lines to be as representative as possible. Of course it can’t always be helped, but it’s interesting to see regardless. Also, Oakland county is where I grew up!
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
race_eth_vars <- c(White = "B03002_003", Black = "B03002_004", Hisp_Lat = "B03002_012",
Total_Pop = "B03002_001")
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
race_eth_2022 <- get_acs(geography="tract", variables = race_eth_vars,
state = "MI", county = c(083, 097, 125),
survey = "acs5", year = 2022, output = "wide")
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
race_eth_2022 <- race_eth_2022 %>%
mutate(perW = (WhiteE/Total_PopE)*100,
perB = (BlackE/Total_PopE)*100,
perH = (Hisp_LatE/Total_PopE)*100)
# Add readable tract and county name columns using str_extract() or similar
race_eth_2022 <- race_eth_2022 %>%
mutate(tract = str_extract(NAME, "[0-9.]+"),
NAME = str_extract(NAME, "Oakland|Mackinac|Keweenaw"))
race_eth_2022# A tibble: 359 × 14
GEOID NAME WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26083000100 Kewee… 1999 6 0 11 33 4 2086
2 26083980100 Kewee… 2 1 0 11 0 11 2
3 26083990100 Kewee… 0 11 0 11 0 11 0
4 26097950100 Macki… 1536 238 0 11 0 11 1594
5 26097950200 Macki… 1858 216 0 11 49 42 2399
6 26097950300 Macki… 1485 185 3 4 13 12 1761
7 26097950400 Macki… 1788 187 141 107 115 59 2900
8 26097950500 Macki… 1126 142 264 119 75 42 2189
9 26097990000 Macki… 0 11 0 11 0 11 0
10 26125120000 Oakla… 2910 279 20 21 20 24 3074
# ℹ 349 more rows
# ℹ 5 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
# tract <chr>
3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
hisp_max <- race_eth_2022 %>% slice_max(Hisp_LatE/Total_PopE, n=1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
avg_demo <- race_eth_2022 %>%
group_by(NAME) %>%
summarize(avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
# Create a nicely formatted table of your results using kable()
kable(avg_demo[, c(1,5,2,3,4)], "simple", col.names = c("County", "Number of Tracts", "Average Percentage of White Residents", "Average Percentage of Black Residents", "Average Percentage of Hispanic Residents"), caption = "Average demographics of each county")| County | Number of Tracts | Average Percentage of White Residents | Average Percentage of Black Residents | Average Percentage of Hispanic Residents |
|---|---|---|---|---|
| Keweenaw | 3 | 97.91467 | 0.000000 | 0.7909875 |
| Mackinac | 6 | 74.24631 | 3.418546 | 2.0344948 |
| Oakland | 350 | 69.26839 | 14.115251 | 4.6550570 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
race_eth_2022 <- race_eth_2022 %>%
mutate(MOE_perW = (WhiteM/WhiteE)*100,
MOE_perB = (BlackM/BlackE)*100,
MOE_perH = (Hisp_LatM/Hisp_LatE)*100)
# get rid of all infinites
race_eth_2022[sapply(race_eth_2022, is.infinite)] <- NA
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
#misunderstood this as any variable, as in any of the 3 is high MOE
race_eth_2022 <- race_eth_2022 %>%
mutate(MOE_cat = ifelse(MOE_perW > 15 | MOE_perB > 15 | MOE_perH > 15,
"High MOE", "Low MOE"))
race_eth_2022$MOE_cat[is.na(race_eth_2022$MOE_cat)] <- "No Certainty"
#so second version where it's just black residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
mutate(MOE_catB = ifelse(MOE_perB > 15,
"High Black MOE", "Low Black MOE"))
race_eth_2022$MOE_catB[is.na(race_eth_2022$MOE_catB)] <- "No Certainty"
race_eth_2022# A tibble: 359 × 19
GEOID NAME WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26083000100 Kewee… 1999 6 0 11 33 4 2086
2 26083980100 Kewee… 2 1 0 11 0 11 2
3 26083990100 Kewee… 0 11 0 11 0 11 0
4 26097950100 Macki… 1536 238 0 11 0 11 1594
5 26097950200 Macki… 1858 216 0 11 49 42 2399
6 26097950300 Macki… 1485 185 3 4 13 12 1761
7 26097950400 Macki… 1788 187 141 107 115 59 2900
8 26097950500 Macki… 1126 142 264 119 75 42 2189
9 26097990000 Macki… 0 11 0 11 0 11 0
10 26125120000 Oakla… 2910 279 20 21 20 24 3074
# ℹ 349 more rows
# ℹ 10 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
# tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
# MOE_catB <chr>
# Create summary statistics showing how many tracts have data quality issues
data_qual_n <- race_eth_2022 %>%
group_by(MOE_cat) %>%
count(MOE_cat)
data_qual_n# A tibble: 2 × 2
# Groups: MOE_cat [2]
MOE_cat n
<chr> <int>
1 High MOE 351
2 No Certainty 8
data_qual_n_B <- race_eth_2022 %>%
group_by(MOE_catB) %>%
count(MOE_catB)
data_qual_n_B# A tibble: 3 × 2
# Groups: MOE_catB [3]
MOE_catB n
<chr> <int>
1 High Black MOE 322
2 Low Black MOE 6
3 No Certainty 31
4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_qual_examine <- race_eth_2022 %>%
group_by(MOE_cat) %>%
summarize(avg_pop = mean(Total_PopE, na.rm = TRUE),
avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
data_qual_examine# A tibble: 2 × 6
MOE_cat avg_pop avg_W_per avg_B_per avg_H_per count
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 High MOE 3656. 69.4 13.9 4.60 351
2 No Certainty 261. 95.8 0 1.58 8
data_qual_W <- race_eth_2022 %>%
group_by(MOE_perW > 15) %>%
summarize(avg_pop = mean(Total_PopE, na.rm = TRUE),
avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
data_qual_W# A tibble: 3 × 6
`MOE_perW > 15` avg_pop avg_W_per avg_B_per avg_H_per count
<lgl> <dbl> <dbl> <dbl> <dbl> <int>
1 FALSE 3834. 81.8 4.01 3.86 165
2 TRUE 3489. 58.7 22.6 5.25 187
3 NA 0 NaN NaN NaN 7
data_qual_B <- race_eth_2022 %>%
group_by(MOE_perB > 15) %>%
summarize(avg_pop = mean(Total_PopE, na.rm = TRUE),
avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
data_qual_B# A tibble: 3 × 6
`MOE_perB > 15` avg_pop avg_W_per avg_B_per avg_H_per count
<lgl> <dbl> <dbl> <dbl> <dbl> <int>
1 FALSE 3526. 20.6 73.2 1.11 6
2 TRUE 3705. 68.9 13.8 4.81 322
3 NA 2289. 90.2 0 2.62 31
data_qual_H <- race_eth_2022 %>%
group_by(MOE_perH > 15) %>%
summarize(avg_pop = mean(Total_PopE, na.rm = TRUE),
avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
data_qual_H# A tibble: 3 × 6
`MOE_perH > 15` avg_pop avg_W_per avg_B_per avg_H_per count
<lgl> <dbl> <dbl> <dbl> <dbl> <int>
1 FALSE 2482 70.8 12.4 13.6 2
2 TRUE 3690. 69.8 13.6 4.65 342
3 NA 1212. 58.2 25.1 0 15
race_eth_2022 <- race_eth_2022 %>%
mutate(indv_MOE_cat = case_when(MOE_perW > 15 & MOE_perB > 15 & MOE_perH > 15 ~ 'All High MOE',
MOE_perW < 15 & MOE_perB < 15 & MOE_perH < 15 ~ 'All Low MOE',
MOE_perW > 15 & MOE_perB > 15 | MOE_perH > 15 ~ 'Two High MOE',
MOE_perW > 15 | MOE_perB > 15 & MOE_perH > 15 ~ 'Two High MOE',
MOE_perW > 15 & MOE_perH > 15 | MOE_perB > 15 ~ 'Two High MOE',
is.na(MOE_perW) | is.na(MOE_perH) | is.na(MOE_perB) ~ 'At least one NA',
TRUE ~ 'One High MOE'))
race_eth_2022# A tibble: 359 × 20
GEOID NAME WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26083000100 Kewee… 1999 6 0 11 33 4 2086
2 26083980100 Kewee… 2 1 0 11 0 11 2
3 26083990100 Kewee… 0 11 0 11 0 11 0
4 26097950100 Macki… 1536 238 0 11 0 11 1594
5 26097950200 Macki… 1858 216 0 11 49 42 2399
6 26097950300 Macki… 1485 185 3 4 13 12 1761
7 26097950400 Macki… 1788 187 141 107 115 59 2900
8 26097950500 Macki… 1126 142 264 119 75 42 2189
9 26097990000 Macki… 0 11 0 11 0 11 0
10 26125120000 Oakla… 2910 279 20 21 20 24 3074
# ℹ 349 more rows
# ℹ 11 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
# tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
# MOE_catB <chr>, indv_MOE_cat <chr>
detail_MOE_cat <- race_eth_2022 %>%
group_by(indv_MOE_cat) %>%
summarize(avg_pop = mean(Total_PopE, na.rm = TRUE),
avg_W_per = mean(perW, na.rm = TRUE),
avg_B_per = mean(perB, na.rm = TRUE),
avg_H_per = mean(perH, na.rm = TRUE),
count = n())
detail_MOE_cat# A tibble: 3 × 6
indv_MOE_cat avg_pop avg_W_per avg_B_per avg_H_per count
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 All High MOE 3577. 58.6 21.6 5.61 165
2 At least one NA 261. 95.8 0 1.58 8
3 Two High MOE 3725. 79.0 7.12 3.71 186
#just white residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
mutate(MOE_catW = ifelse(MOE_perW > 15,
"High White MOE", "Low White MOE"))
race_eth_2022$MOE_catW[is.na(race_eth_2022$MOE_catW)] <- "No Certainty"
race_eth_2022# A tibble: 359 × 21
GEOID NAME WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26083000100 Kewee… 1999 6 0 11 33 4 2086
2 26083980100 Kewee… 2 1 0 11 0 11 2
3 26083990100 Kewee… 0 11 0 11 0 11 0
4 26097950100 Macki… 1536 238 0 11 0 11 1594
5 26097950200 Macki… 1858 216 0 11 49 42 2399
6 26097950300 Macki… 1485 185 3 4 13 12 1761
7 26097950400 Macki… 1788 187 141 107 115 59 2900
8 26097950500 Macki… 1126 142 264 119 75 42 2189
9 26097990000 Macki… 0 11 0 11 0 11 0
10 26125120000 Oakla… 2910 279 20 21 20 24 3074
# ℹ 349 more rows
# ℹ 12 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
# tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
# MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>
data_qual_n_W <- race_eth_2022 %>%
group_by(MOE_catW) %>%
count(MOE_catW)
data_qual_n_W# A tibble: 3 × 2
# Groups: MOE_catW [3]
MOE_catW n
<chr> <int>
1 High White MOE 187
2 Low White MOE 165
3 No Certainty 7
#just hispanic residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
mutate(MOE_catH = ifelse(MOE_perH > 15,
"High Hispanic MOE", "Low Hispanic MOE"))
race_eth_2022$MOE_catH[is.na(race_eth_2022$MOE_catH)] <- "No Certainty"
race_eth_2022# A tibble: 359 × 22
GEOID NAME WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26083000100 Kewee… 1999 6 0 11 33 4 2086
2 26083980100 Kewee… 2 1 0 11 0 11 2
3 26083990100 Kewee… 0 11 0 11 0 11 0
4 26097950100 Macki… 1536 238 0 11 0 11 1594
5 26097950200 Macki… 1858 216 0 11 49 42 2399
6 26097950300 Macki… 1485 185 3 4 13 12 1761
7 26097950400 Macki… 1788 187 141 107 115 59 2900
8 26097950500 Macki… 1126 142 264 119 75 42 2189
9 26097990000 Macki… 0 11 0 11 0 11 0
10 26125120000 Oakla… 2910 279 20 21 20 24 3074
# ℹ 349 more rows
# ℹ 13 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
# tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
# MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>, MOE_catH <chr>
data_qual_n_H <- race_eth_2022 %>%
group_by(MOE_catH) %>%
count(MOE_catH)
data_qual_n_H# A tibble: 3 × 2
# Groups: MOE_catH [3]
MOE_catH n
<chr> <int>
1 High Hispanic MOE 342
2 Low Hispanic MOE 2
3 No Certainty 15
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
race <- c("White", "Black", "Hispanic")
high_MOE_n <- c(187, 322, 342)
low_MOE_n <- c(165, 6, 2)
no_certain_n <- c(7, 31, 15)
race_n_high_MOE <- data.frame(race, high_MOE_n, low_MOE_n, no_certain_n)
race_n_high_MOE <- race_n_high_MOE %>%
mutate(per_High = high_MOE_n/(high_MOE_n+low_MOE_n)*100)
race_n_high_MOE race high_MOE_n low_MOE_n no_certain_n per_High
1 White 187 165 7 53.12500
2 Black 322 6 31 98.17073
3 Hispanic 342 2 15 99.41860
Pattern Analysis: The most stark pattern is that, whereas all three races’ population estimate overall have quite high margins of errors, this effect is significantly higher for Black and Hispanic/Latino residents. White residents have 187 tracts where the margins of error is higher than 15%, which amounts to 53.13% of all the tracts in Oakland, Mackinac, and Keweenaw counties. Black residents on the other hand have 322, which is 98.17% of all tracts, and Hispanic residents have 342, which is 99.42% of all tracts. This shows that the data for Black and Hispanic residents are much more unreliable than the data for White residents. This might either be because there are less Black and Hispanic people in these counties in general, leading to huge skews in their data depending on tracts, or because they are chronically underestimated in data collection, due to human error, bias or prejudice.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements:
Overall Pattern Identification: Equity Assessment: Root Cause Analysis: Strategic Recommendations:
Executive Summary:
Overall, my analyses shows several systematic patterns. One, smaller counties with less people generally have less reliable data. This is largely due to the fact that with smaller sample sizes, it becomes difficult to extrapolate any findings, making any statistical relationships unsure of its significance. Another pattern is soemthing briefly talked about in the previous section. Among the three counties–Oakland, Mackninac, Keweenaw–it seems that Black and Hispanic residents’ data were clearly more unreliable. This may be due to the sample size, or Black and Hispanic people don’t live in those counties, or there is something more sinister going on, where they are systematically purposefully under-counted or statistically misrepresented.
All of these shows that the Black and Hispanic communities are at great risk of algorithmic bias. The data technically exists, but the large margins of error make it clear that this data should be used sparingly and cautiously. Thought cautious data usage is good practice, it seems to be disadvantageous that these people of color’s data should only be used after taking further necessary precautions. This added level of difficulty in accessing and analyzing the data may lead to policy makers to intentionally not use the data either due to fear of misrepresenting their constituents or due to laziness.
It’s likely that Black and Hispanic people don’t live in these areas, either by change or systemic disadvantages, as seen in the case of redlining. This may lead to problems of reinforcing euro- white-centric social norms with the policies reflecting and amplifying that. The cause of both the bad data quality and bias risk may be due to human bias, where certain White surveyors might be unwilling to approach or collect data from non-White residents, consciously or not. If done subconsciously, this could lead to the belief that the data is impartial and representative, when that may not be the case.
There are plenty of psychology studies showing that even just acknowledging you might have bias leads to you to think through your actions more consciously and reduce personal biases. Employee workshops to increase self-awareness in such a manner might be a simple but effective technique to start address these systematic issues. Another thing to do would be to take a closer look at all the data, including other variables not just income, and across all other intersections of identities, to make sure no one get swept under the rug. Taking a closer look at the data would also reveal where there are outliers versus actual pattern. If a county seems to be an outlier in every single analyses, a closer look should be taken to try to resolve any disparities.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_reliability <- MI_2022[,c(2,3,7,8)]
county_reliability# A tibble: 83 × 4
NAME med_HH_incE MOE_per reli_cat
<chr> <dbl> <dbl> <chr>
1 Alcona 50295 4.46 High Confidence
2 Alger 55528 5.24 Moderate Confidence
3 Allegan 75543 3.14 High Confidence
4 Alpena 49133 4.31 High Confidence
5 Antrim 68850 4.52 High Confidence
6 Arenac 53487 3.77 High Confidence
7 Baraga 51911 5.59 Moderate Confidence
8 Barry 75182 3.60 High Confidence
9 Bay 57887 4.14 High Confidence
10 Benzie 71327 5.97 Moderate Confidence
# ℹ 73 more rows
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
county_reliability <- county_reliability %>%
mutate(algo_rec = case_when(reli_cat == "High Confidence" ~ "Safe for algorithmic decisions",
reli_cat == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
reli_cat == "Low Confidence" ~ "Requires manual review or additional data"))
# Format as a professional table with kable()
kable(county_reliability, "simple", col.names = c("County", "Median Income Estimate", "Margin of Error Percentage", "Reliability Category", "Algorithm Recommendation"), caption = "Algorithm Implementation Decision Framework")| County | Median Income Estimate | Margin of Error Percentage | Reliability Category | Algorithm Recommendation |
|---|---|---|---|---|
| Alcona | 50295 | 4.4596878 | High Confidence | Safe for algorithmic decisions |
| Alger | 55528 | 5.2442011 | Moderate Confidence | Use with caution - monitor outcomes |
| Allegan | 75543 | 3.1359623 | High Confidence | Safe for algorithmic decisions |
| Alpena | 49133 | 4.3127837 | High Confidence | Safe for algorithmic decisions |
| Antrim | 68850 | 4.5243282 | High Confidence | Safe for algorithmic decisions |
| Arenac | 53487 | 3.7728794 | High Confidence | Safe for algorithmic decisions |
| Baraga | 51911 | 5.5941901 | Moderate Confidence | Use with caution - monitor outcomes |
| Barry | 75182 | 3.5966056 | High Confidence | Safe for algorithmic decisions |
| Bay | 57887 | 4.1373711 | High Confidence | Safe for algorithmic decisions |
| Benzie | 71327 | 5.9696889 | Moderate Confidence | Use with caution - monitor outcomes |
| Berrien | 60379 | 2.6101790 | High Confidence | Safe for algorithmic decisions |
| Branch | 60600 | 3.7656766 | High Confidence | Safe for algorithmic decisions |
| Calhoun | 58191 | 2.6309910 | High Confidence | Safe for algorithmic decisions |
| Cass | 65183 | 4.1222405 | High Confidence | Safe for algorithmic decisions |
| Charlevoix | 69764 | 3.4358695 | High Confidence | Safe for algorithmic decisions |
| Cheboygan | 59557 | 4.0566180 | High Confidence | Safe for algorithmic decisions |
| Chippewa | 58408 | 3.0766333 | High Confidence | Safe for algorithmic decisions |
| Clare | 47816 | 4.3311862 | High Confidence | Safe for algorithmic decisions |
| Clinton | 82594 | 2.7520159 | High Confidence | Safe for algorithmic decisions |
| Crawford | 57998 | 5.9346874 | Moderate Confidence | Use with caution - monitor outcomes |
| Delta | 53852 | 6.3303127 | Moderate Confidence | Use with caution - monitor outcomes |
| Dickinson | 59651 | 4.3637156 | High Confidence | Safe for algorithmic decisions |
| Eaton | 77158 | 2.6089323 | High Confidence | Safe for algorithmic decisions |
| Emmet | 69690 | 5.7181805 | Moderate Confidence | Use with caution - monitor outcomes |
| Genesee | 58594 | 2.2408438 | High Confidence | Safe for algorithmic decisions |
| Gladwin | 53717 | 2.7589031 | High Confidence | Safe for algorithmic decisions |
| Gogebic | 47913 | 9.9471960 | Moderate Confidence | Use with caution - monitor outcomes |
| Grand Traverse | 75553 | 2.9899541 | High Confidence | Safe for algorithmic decisions |
| Gratiot | 57934 | 3.8630165 | High Confidence | Safe for algorithmic decisions |
| Hillsdale | 59425 | 3.9713925 | High Confidence | Safe for algorithmic decisions |
| Houghton | 52736 | 4.8600576 | High Confidence | Safe for algorithmic decisions |
| Huron | 54475 | 2.5810005 | High Confidence | Safe for algorithmic decisions |
| Ingham | 62548 | 2.9641236 | High Confidence | Safe for algorithmic decisions |
| Ionia | 71720 | 3.4021194 | High Confidence | Safe for algorithmic decisions |
| Iosco | 46224 | 5.2808065 | Moderate Confidence | Use with caution - monitor outcomes |
| Iron | 52241 | 6.6805766 | Moderate Confidence | Use with caution - monitor outcomes |
| Isabella | 52638 | 4.2706790 | High Confidence | Safe for algorithmic decisions |
| Jackson | 62581 | 2.5247280 | High Confidence | Safe for algorithmic decisions |
| Kalamazoo | 67905 | 2.3061630 | High Confidence | Safe for algorithmic decisions |
| Kalkaska | 56380 | 5.7343029 | Moderate Confidence | Use with caution - monitor outcomes |
| Kent | 76247 | 1.6721969 | High Confidence | Safe for algorithmic decisions |
| Keweenaw | 55560 | 13.1407487 | Low Confidence | Requires manual review or additional data |
| Lake | 45946 | 5.1299351 | Moderate Confidence | Use with caution - monitor outcomes |
| Lapeer | 75402 | 3.0556219 | High Confidence | Safe for algorithmic decisions |
| Leelanau | 82345 | 5.3202987 | Moderate Confidence | Use with caution - monitor outcomes |
| Lenawee | 65484 | 2.6739356 | High Confidence | Safe for algorithmic decisions |
| Livingston | 96135 | 1.9628647 | High Confidence | Safe for algorithmic decisions |
| Luce | 51015 | 7.3350975 | Moderate Confidence | Use with caution - monitor outcomes |
| Mackinac | 60620 | 5.6796437 | Moderate Confidence | Use with caution - monitor outcomes |
| Macomb | 73876 | 1.2250257 | High Confidence | Safe for algorithmic decisions |
| Manistee | 59467 | 4.3217247 | High Confidence | Safe for algorithmic decisions |
| Marquette | 63115 | 4.0133090 | High Confidence | Safe for algorithmic decisions |
| Mason | 60744 | 3.7024233 | High Confidence | Safe for algorithmic decisions |
| Mecosta | 54132 | 5.8763763 | Moderate Confidence | Use with caution - monitor outcomes |
| Menominee | 54074 | 5.3870622 | Moderate Confidence | Use with caution - monitor outcomes |
| Midland | 73643 | 4.3615822 | High Confidence | Safe for algorithmic decisions |
| Missaukee | 57667 | 6.6051641 | Moderate Confidence | Use with caution - monitor outcomes |
| Monroe | 72573 | 2.7930498 | High Confidence | Safe for algorithmic decisions |
| Montcalm | 61250 | 3.8742857 | High Confidence | Safe for algorithmic decisions |
| Montmorency | 46345 | 8.1907433 | Moderate Confidence | Use with caution - monitor outcomes |
| Muskegon | 61347 | 2.1810357 | High Confidence | Safe for algorithmic decisions |
| Newaygo | 59065 | 3.1558453 | High Confidence | Safe for algorithmic decisions |
| Oakland | 92620 | 1.0289354 | High Confidence | Safe for algorithmic decisions |
| Oceana | 60691 | 4.9315384 | High Confidence | Safe for algorithmic decisions |
| Ogemaw | 50377 | 4.9189114 | High Confidence | Safe for algorithmic decisions |
| Ontonagon | 48316 | 7.9559566 | Moderate Confidence | Use with caution - monitor outcomes |
| Osceola | 54875 | 3.8159453 | High Confidence | Safe for algorithmic decisions |
| Oscoda | 48692 | 5.3006654 | Moderate Confidence | Use with caution - monitor outcomes |
| Otsego | 62865 | 9.4010976 | Moderate Confidence | Use with caution - monitor outcomes |
| Ottawa | 83932 | 2.1791450 | High Confidence | Safe for algorithmic decisions |
| Presque Isle | 55986 | 6.3730218 | Moderate Confidence | Use with caution - monitor outcomes |
| Roscommon | 49898 | 6.2567638 | Moderate Confidence | Use with caution - monitor outcomes |
| Saginaw | 56579 | 2.4832535 | High Confidence | Safe for algorithmic decisions |
| St. Clair | 66887 | 2.5819666 | High Confidence | Safe for algorithmic decisions |
| St. Joseph | 62281 | 2.6348325 | High Confidence | Safe for algorithmic decisions |
| Sanilac | 55740 | 3.9415142 | High Confidence | Safe for algorithmic decisions |
| Schoolcraft | 55071 | 11.4906212 | Low Confidence | Requires manual review or additional data |
| Shiawassee | 62498 | 4.5153445 | High Confidence | Safe for algorithmic decisions |
| Tuscola | 59815 | 3.1563989 | High Confidence | Safe for algorithmic decisions |
| Van Buren | 65531 | 5.1822801 | Moderate Confidence | Use with caution - monitor outcomes |
| Washtenaw | 84245 | 2.7977922 | High Confidence | Safe for algorithmic decisions |
| Wayne | 57223 | 0.9838701 | High Confidence | Safe for algorithmic decisions |
| Wexford | 58652 | 5.8071336 | Moderate Confidence | Use with caution - monitor outcomes |
safe_data_counties <- county_reliability %>% filter(algo_rec == "Safe for algorithmic decisions")
cautious_data_counties <- county_reliability %>% filter(algo_rec == "Use with caution - monitor outcomes")
need_more_data_counties <- county_reliability %>% filter(algo_rec == "Requires manual review or additional data")Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
- Counties suitable for immediate algorithmic implementation:
[1] “Alcona” “Allegan” “Alpena” “Antrim” “Arenac”
[6] “Barry” “Bay” “Berrien” “Branch” “Calhoun”
[11] “Cass” “Charlevoix” “Cheboygan” “Chippewa” “Clare”
[16] “Clinton” “Dickinson” “Eaton” “Genesee” “Gladwin”
[21] “Grand Traverse” “Gratiot” “Hillsdale” “Houghton” “Huron”
[26] “Ingham” “Ionia” “Isabella” “Jackson” “Kalamazoo”
[31] “Kent” “Lapeer” “Lenawee” “Livingston” “Macomb”
[36] “Manistee” “Marquette” “Mason” “Midland” “Monroe”
[41] “Montcalm” “Muskegon” “Newaygo” “Oakland” “Oceana”
[46] “Ogemaw” “Osceola” “Ottawa” “Saginaw” “St. Clair”
[51] “St. Joseph” “Sanilac” “Shiawassee” “Tuscola” “Washtenaw”
[56] “Wayne”
- Counties requiring additional oversight:
[1] “Alger” “Baraga” “Benzie” “Crawford” “Delta” “Emmet”
[7] “Gogebic” “Iosco” “Iron” “Kalkaska” “Lake” “Leelanau”
[13] “Luce” “Mackinac” “Mecosta” “Menominee” “Missaukee” “Montmorency” [19] “Ontonagon” “Oscoda” “Otsego” “Presque Isle” “Roscommon” “Van Buren”
[25] “Wexford”
These are counties where we are moderately confident in the data of. The data isn’t to be used without monitoring the results for unusual patterns or distributions. If there are odd results, they should be examined more closely to see what may have caused the outlier. These data should also be examined to see what shape the raw data forms, before conducting any analysis. If the distribution, skew or peaks look abnormal, or there are too many outliers, that county may not be used to that specific analysis. Judgments should be made on a case-to-case basis.
- Counties needing alternative approaches:
[1] “Keweenaw” “Schoolcraft”
There are only two counties in Michigan where we are not confident in the reliability of the data at all, and more data from these counties should be collected if possible. First, we should try to find out why these two counties were specifically unreliable in data, such as examining the sample size. As seen in earlier analysis, these counties are in the top 5 smallest counties in Michigan, which may definitely cause the unreliability. In this case, we should go out of our way to collect more data from this county to ensure the data becomes more reliable, as the ACS is a sample and more data can be collected. If this doesn’t work, perhaps we should consider merging these counties with neighboring counties, especially if there are no major differences between the two areas. If there are massive differences, then maybe we should take a closer look at these counties in general to see why such disparities happen, and work on ameliorating that.
Questions for Further Investigation
[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]
How does data unreliability change as we look at different data?
Is sample size the biggest factor in swaying reliability, or is there something else, and can that be examined by looking at differences between different data?
In Michigan, there are ethnic enclaves (Japanese, Chinese, Korean, Middle Eastern) which the 3 races in these analyses don’t cover. How would including those change the analyses done here today?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]
Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]
Methodology Notes: In 4.2, the last question, I manually inputted the table data, instead of using reproducible variables.
Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html