Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Angie Kwon

Published

February 24, 2026

Assignment Overview

Scenario

You are a data analyst for the Michigan Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
# install.packages("tidycensus")
# install.packages("tidyverse")
# install.packages("knitr")
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key
# census_api_key("eb283fc5f20283cf588e1f363afe8bc73cf9c354", install = TRUE)

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Michigan"

State Selection: I have chosen Michigan for this analysis because: I was born in Michigan, lived there until I was 6 and returned for high school.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
acsVariables <- c(med_HH_inc = "B19013_001", totpop = "B01003_001")
MI_2022 <- get_acs(geography="county", variables = acsVariables,
                        state = "MI", survey = "acs5", year = 2022, output = "wide")

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
MI_2022 <- MI_2022 %>%
  mutate(NAME = str_remove(NAME, " County, Michigan"))

# Display the first few rows
glimpse(MI_2022)
Rows: 83
Columns: 6
$ GEOID       <chr> "26001", "26003", "26005", "26007", "26009", "26011", "260…
$ NAME        <chr> "Alcona", "Alger", "Allegan", "Alpena", "Antrim", "Arenac"…
$ med_HH_incE <dbl> 50295, 55528, 75543, 49133, 68850, 53487, 51911, 75182, 57…
$ med_HH_incM <dbl> 2243, 2912, 2369, 2119, 3115, 2018, 2904, 2704, 2395, 4258…
$ totpopE     <dbl> 10238, 8866, 120189, 28911, 23662, 15031, 8245, 62581, 103…
$ totpopM     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
#without mutate()
#MI_2022$MOE_per <- (MI_2022$med_HH_incM/MI_2022$med_HH_incE)*100

MI_2022 <- MI_2022 %>%
  mutate(MOE_per = (med_HH_incM/med_HH_incE)*100)

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
MI_2022 <- MI_2022 %>% 
  mutate(reli_cat = case_when(MOE_per < 5 ~ 'High Confidence',
                                   MOE_per > 10 ~ 'Low Confidence',
                                   TRUE ~ 'Moderate Confidence'))
reli_cat_freq <- MI_2022 %>%
  count(reli_cat)
reli_cat_freq
# A tibble: 3 × 2
  reli_cat                n
  <chr>               <int>
1 High Confidence        56
2 Low Confidence          2
3 Moderate Confidence    25

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
#arrange(MI_2022, desc(MOE_per))
#MOE_min <- MI_2022 %>% slice_min(MOE_per, n=5)
MOE_max <- MI_2022 %>% slice_max(MOE_per, n=5)
MOE_max
# A tibble: 5 × 8
  GEOID NAME        med_HH_incE med_HH_incM totpopE totpopM MOE_per reli_cat    
  <chr> <chr>             <dbl>       <dbl>   <dbl>   <dbl>   <dbl> <chr>       
1 26083 Keweenaw          55560        7301    2088      NA   13.1  Low Confide…
2 26153 Schoolcraft       55071        6328    8062      NA   11.5  Low Confide…
3 26053 Gogebic           47913        4766   14597      NA    9.95 Moderate Co…
4 26137 Otsego            62865        5910   25221      NA    9.40 Moderate Co…
5 26119 Montmorency       46345        3796    9261      NA    8.19 Moderate Co…
# Format as table with kable() - include appropriate column names and caption
kable(MOE_max[, c(2,3,4,7,8)], "simple", col.names = c("Name", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the highest median income margins of error")
Top 5 counties with the highest median income margins of error
Name Median Income Median Income Margin of Error Margin of Error Percentage Reliability
Keweenaw 55560 7301 13.140749 Low Confidence
Schoolcraft 55071 6328 11.490621 Low Confidence
Gogebic 47913 4766 9.947196 Moderate Confidence
Otsego 62865 5910 9.401098 Moderate Confidence
Montmorency 46345 3796 8.190743 Moderate Confidence
#consequence bearing hardest on smallest counties?
pop_min <- MI_2022 %>% slice_min(totpopE, n=5)
kable(pop_min[, c(2,5,3,4,7,8)], "simple", col.names = c("Name", "Total Population", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the lowest population")
Top 5 counties with the lowest population
Name Total Population Median Income Median Income Margin of Error Margin of Error Percentage Reliability
Keweenaw 2088 55560 7301 13.140749 Low Confidence
Luce 5442 51015 3742 7.335097 Moderate Confidence
Ontonagon 5862 48316 3844 7.955957 Moderate Confidence
Schoolcraft 8062 55071 6328 11.490621 Low Confidence
Baraga 8245 51911 2904 5.594190 Moderate Confidence

Data Quality Commentary:

These results show that even with a government-wide survey, there are very high margins of error, which have policy implications, where these counties may be over- or underestimated in various variables. In this instance, the counties with higher margins of error may be overestimated as more affluent than they actually are, which may lead to them not receiving the aid or funding they need, or they might be underestimated as less wealthy than they actually are, which may lead to resources being needlessly diverted to them, when in fact others need them more. The second table shows that the population size may have at least some effect, since the two counties with “Low Confidence” are both in the top 5 smallest counties in Michigan.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
filtered_MI <- filter(MI_2022, NAME == "Oakland" | NAME == "Mackinac" | NAME == "Keweenaw")

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
kable(filtered_MI[, c(2,3,7,8)], "simple", col.names = c("Name", "Median Income", "Margin of Error Percentage", "Reliability"), caption = "3 counties representing different data reliability levels")
3 counties representing different data reliability levels
Name Median Income Margin of Error Percentage Reliability
Keweenaw 55560 13.140749 Low Confidence
Mackinac 60620 5.679644 Moderate Confidence
Oakland 92620 1.028935 High Confidence

Comment on the output: The wide range of the margins of error is quite intersting, as they all come from one state, and presumably they try to draw county lines to be as representative as possible. Of course it can’t always be helped, but it’s interesting to see regardless. Also, Oakland county is where I grew up!

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_eth_vars <- c(White = "B03002_003", Black = "B03002_004", Hisp_Lat = "B03002_012",
                   Total_Pop = "B03002_001")

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
race_eth_2022 <- get_acs(geography="tract", variables = race_eth_vars,
                        state = "MI", county = c(083, 097, 125), 
                        survey = "acs5", year = 2022, output = "wide")

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
race_eth_2022 <- race_eth_2022 %>%
  mutate(perW = (WhiteE/Total_PopE)*100,
         perB = (BlackE/Total_PopE)*100,
         perH = (Hisp_LatE/Total_PopE)*100)

# Add readable tract and county name columns using str_extract() or similar
race_eth_2022 <- race_eth_2022 %>%
  mutate(tract = str_extract(NAME, "[0-9.]+"),
         NAME = str_extract(NAME, "Oakland|Mackinac|Keweenaw"))

race_eth_2022
# A tibble: 359 × 14
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 5 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
hisp_max <- race_eth_2022 %>% slice_max(Hisp_LatE/Total_PopE, n=1)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
avg_demo <- race_eth_2022 %>%
  group_by(NAME) %>%
  summarize(avg_W_per = mean(perW, na.rm = TRUE), 
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())

# Create a nicely formatted table of your results using kable()
kable(avg_demo[, c(1,5,2,3,4)], "simple", col.names = c("County", "Number of Tracts", "Average Percentage of White Residents", "Average Percentage of Black Residents", "Average Percentage of Hispanic Residents"), caption = "Average demographics of each county")
Average demographics of each county
County Number of Tracts Average Percentage of White Residents Average Percentage of Black Residents Average Percentage of Hispanic Residents
Keweenaw 3 97.91467 0.000000 0.7909875
Mackinac 6 74.24631 3.418546 2.0344948
Oakland 350 69.26839 14.115251 4.6550570

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_perW = (WhiteM/WhiteE)*100,
         MOE_perB = (BlackM/BlackE)*100,
         MOE_perH = (Hisp_LatM/Hisp_LatE)*100)
# get rid of all infinites
race_eth_2022[sapply(race_eth_2022, is.infinite)] <- NA

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

#misunderstood this as any variable, as in any of the 3 is high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_cat = ifelse(MOE_perW > 15 | MOE_perB > 15 | MOE_perH > 15, 
                          "High MOE", "Low MOE"))
race_eth_2022$MOE_cat[is.na(race_eth_2022$MOE_cat)] <- "No Certainty"

#so second version where it's just black residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catB = ifelse(MOE_perB > 15,
                          "High Black MOE", "Low Black MOE"))
race_eth_2022$MOE_catB[is.na(race_eth_2022$MOE_catB)] <- "No Certainty"

race_eth_2022
# A tibble: 359 × 19
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 10 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>
# Create summary statistics showing how many tracts have data quality issues
data_qual_n <- race_eth_2022 %>%
  group_by(MOE_cat) %>%
  count(MOE_cat)
data_qual_n
# A tibble: 2 × 2
# Groups:   MOE_cat [2]
  MOE_cat          n
  <chr>        <int>
1 High MOE       351
2 No Certainty     8
data_qual_n_B <- race_eth_2022 %>%
  group_by(MOE_catB) %>%
  count(MOE_catB)
data_qual_n_B
# A tibble: 3 × 2
# Groups:   MOE_catB [3]
  MOE_catB           n
  <chr>          <int>
1 High Black MOE   322
2 Low Black MOE      6
3 No Certainty      31

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_qual_examine <- race_eth_2022 %>%
  group_by(MOE_cat) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_examine
# A tibble: 2 × 6
  MOE_cat      avg_pop avg_W_per avg_B_per avg_H_per count
  <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
1 High MOE       3656.      69.4      13.9      4.60   351
2 No Certainty    261.      95.8       0        1.58     8
data_qual_W <- race_eth_2022 %>%
  group_by(MOE_perW > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_W
# A tibble: 3 × 6
  `MOE_perW > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             3834.      81.8      4.01      3.86   165
2 TRUE              3489.      58.7     22.6       5.25   187
3 NA                   0      NaN      NaN       NaN        7
data_qual_B <- race_eth_2022 %>%
  group_by(MOE_perB > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_B
# A tibble: 3 × 6
  `MOE_perB > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             3526.      20.6      73.2      1.11     6
2 TRUE              3705.      68.9      13.8      4.81   322
3 NA                2289.      90.2       0        2.62    31
data_qual_H <- race_eth_2022 %>%
  group_by(MOE_perH > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_H
# A tibble: 3 × 6
  `MOE_perH > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             2482       70.8      12.4     13.6      2
2 TRUE              3690.      69.8      13.6      4.65   342
3 NA                1212.      58.2      25.1      0       15
race_eth_2022 <- race_eth_2022 %>% 
  mutate(indv_MOE_cat = case_when(MOE_perW > 15 & MOE_perB > 15 & MOE_perH > 15 ~ 'All High MOE',
                                  MOE_perW < 15 & MOE_perB < 15 & MOE_perH < 15 ~ 'All Low MOE',
                                  MOE_perW > 15 & MOE_perB > 15 | MOE_perH > 15 ~ 'Two High MOE',
                                  MOE_perW > 15 | MOE_perB > 15 & MOE_perH > 15 ~ 'Two High MOE',
                                  MOE_perW > 15 & MOE_perH > 15 | MOE_perB > 15 ~ 'Two High MOE',
                                  is.na(MOE_perW) | is.na(MOE_perH) | is.na(MOE_perB) ~ 'At least one NA',
                                   TRUE ~ 'One High MOE'))
race_eth_2022
# A tibble: 359 × 20
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 11 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>
detail_MOE_cat <- race_eth_2022 %>%
  group_by(indv_MOE_cat) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
detail_MOE_cat
# A tibble: 3 × 6
  indv_MOE_cat    avg_pop avg_W_per avg_B_per avg_H_per count
  <chr>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 All High MOE      3577.      58.6     21.6       5.61   165
2 At least one NA    261.      95.8      0         1.58     8
3 Two High MOE      3725.      79.0      7.12      3.71   186
#just white residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catW = ifelse(MOE_perW > 15,
                          "High White MOE", "Low White MOE"))
race_eth_2022$MOE_catW[is.na(race_eth_2022$MOE_catW)] <- "No Certainty"
race_eth_2022
# A tibble: 359 × 21
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 12 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>
data_qual_n_W <- race_eth_2022 %>%
  group_by(MOE_catW) %>%
  count(MOE_catW)
data_qual_n_W
# A tibble: 3 × 2
# Groups:   MOE_catW [3]
  MOE_catW           n
  <chr>          <int>
1 High White MOE   187
2 Low White MOE    165
3 No Certainty       7
#just hispanic residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catH = ifelse(MOE_perH > 15,
                          "High Hispanic MOE", "Low Hispanic MOE"))
race_eth_2022$MOE_catH[is.na(race_eth_2022$MOE_catH)] <- "No Certainty"
race_eth_2022
# A tibble: 359 × 22
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 13 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>, MOE_catH <chr>
data_qual_n_H <- race_eth_2022 %>%
  group_by(MOE_catH) %>%
  count(MOE_catH)
data_qual_n_H
# A tibble: 3 × 2
# Groups:   MOE_catH [3]
  MOE_catH              n
  <chr>             <int>
1 High Hispanic MOE   342
2 Low Hispanic MOE      2
3 No Certainty         15
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
race <- c("White", "Black", "Hispanic")
high_MOE_n <- c(187, 322, 342)
low_MOE_n <- c(165, 6, 2)
no_certain_n <- c(7, 31, 15)
race_n_high_MOE <- data.frame(race, high_MOE_n, low_MOE_n, no_certain_n)

race_n_high_MOE <- race_n_high_MOE %>%
  mutate(per_High = high_MOE_n/(high_MOE_n+low_MOE_n)*100)
race_n_high_MOE
      race high_MOE_n low_MOE_n no_certain_n per_High
1    White        187       165            7 53.12500
2    Black        322         6           31 98.17073
3 Hispanic        342         2           15 99.41860

Pattern Analysis: The most stark pattern is that, whereas all three races’ population estimate overall have quite high margins of errors, this effect is significantly higher for Black and Hispanic/Latino residents. White residents have 187 tracts where the margins of error is higher than 15%, which amounts to 53.13% of all the tracts in Oakland, Mackinac, and Keweenaw counties. Black residents on the other hand have 322, which is 98.17% of all tracts, and Hispanic residents have 342, which is 99.42% of all tracts. This shows that the data for Black and Hispanic residents are much more unreliable than the data for White residents. This might either be because there are less Black and Hispanic people in these counties in general, leading to huge skews in their data depending on tracts, or because they are chronically underestimated in data collection, due to human error, bias or prejudice.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements:

Overall Pattern Identification: Equity Assessment: Root Cause Analysis: Strategic Recommendations:

Executive Summary:

Overall, my analyses shows several systematic patterns. One, smaller counties with less people generally have less reliable data. This is largely due to the fact that with smaller sample sizes, it becomes difficult to extrapolate any findings, making any statistical relationships unsure of its significance. Another pattern is soemthing briefly talked about in the previous section. Among the three counties–Oakland, Mackninac, Keweenaw–it seems that Black and Hispanic residents’ data were clearly more unreliable. This may be due to the sample size, or Black and Hispanic people don’t live in those counties, or there is something more sinister going on, where they are systematically purposefully under-counted or statistically misrepresented.

All of these shows that the Black and Hispanic communities are at great risk of algorithmic bias. The data technically exists, but the large margins of error make it clear that this data should be used sparingly and cautiously. Thought cautious data usage is good practice, it seems to be disadvantageous that these people of color’s data should only be used after taking further necessary precautions. This added level of difficulty in accessing and analyzing the data may lead to policy makers to intentionally not use the data either due to fear of misrepresenting their constituents or due to laziness.

It’s likely that Black and Hispanic people don’t live in these areas, either by change or systemic disadvantages, as seen in the case of redlining. This may lead to problems of reinforcing euro- white-centric social norms with the policies reflecting and amplifying that. The cause of both the bad data quality and bias risk may be due to human bias, where certain White surveyors might be unwilling to approach or collect data from non-White residents, consciously or not. If done subconsciously, this could lead to the belief that the data is impartial and representative, when that may not be the case.

There are plenty of psychology studies showing that even just acknowledging you might have bias leads to you to think through your actions more consciously and reduce personal biases. Employee workshops to increase self-awareness in such a manner might be a simple but effective technique to start address these systematic issues. Another thing to do would be to take a closer look at all the data, including other variables not just income, and across all other intersections of identities, to make sure no one get swept under the rug. Taking a closer look at the data would also reveal where there are outliers versus actual pattern. If a county seems to be an outlier in every single analyses, a closer look should be taken to try to resolve any disparities.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_reliability <- MI_2022[,c(2,3,7,8)]
county_reliability
# A tibble: 83 × 4
   NAME    med_HH_incE MOE_per reli_cat           
   <chr>         <dbl>   <dbl> <chr>              
 1 Alcona        50295    4.46 High Confidence    
 2 Alger         55528    5.24 Moderate Confidence
 3 Allegan       75543    3.14 High Confidence    
 4 Alpena        49133    4.31 High Confidence    
 5 Antrim        68850    4.52 High Confidence    
 6 Arenac        53487    3.77 High Confidence    
 7 Baraga        51911    5.59 Moderate Confidence
 8 Barry         75182    3.60 High Confidence    
 9 Bay           57887    4.14 High Confidence    
10 Benzie        71327    5.97 Moderate Confidence
# ℹ 73 more rows
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
county_reliability <- county_reliability %>%
  mutate(algo_rec = case_when(reli_cat == "High Confidence" ~ "Safe for algorithmic decisions",
                              reli_cat == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
                              reli_cat == "Low Confidence" ~ "Requires manual review or additional data"))

# Format as a professional table with kable()
kable(county_reliability, "simple", col.names = c("County", "Median Income Estimate", "Margin of Error Percentage", "Reliability Category", "Algorithm Recommendation"), caption = "Algorithm Implementation Decision Framework")
Algorithm Implementation Decision Framework
County Median Income Estimate Margin of Error Percentage Reliability Category Algorithm Recommendation
Alcona 50295 4.4596878 High Confidence Safe for algorithmic decisions
Alger 55528 5.2442011 Moderate Confidence Use with caution - monitor outcomes
Allegan 75543 3.1359623 High Confidence Safe for algorithmic decisions
Alpena 49133 4.3127837 High Confidence Safe for algorithmic decisions
Antrim 68850 4.5243282 High Confidence Safe for algorithmic decisions
Arenac 53487 3.7728794 High Confidence Safe for algorithmic decisions
Baraga 51911 5.5941901 Moderate Confidence Use with caution - monitor outcomes
Barry 75182 3.5966056 High Confidence Safe for algorithmic decisions
Bay 57887 4.1373711 High Confidence Safe for algorithmic decisions
Benzie 71327 5.9696889 Moderate Confidence Use with caution - monitor outcomes
Berrien 60379 2.6101790 High Confidence Safe for algorithmic decisions
Branch 60600 3.7656766 High Confidence Safe for algorithmic decisions
Calhoun 58191 2.6309910 High Confidence Safe for algorithmic decisions
Cass 65183 4.1222405 High Confidence Safe for algorithmic decisions
Charlevoix 69764 3.4358695 High Confidence Safe for algorithmic decisions
Cheboygan 59557 4.0566180 High Confidence Safe for algorithmic decisions
Chippewa 58408 3.0766333 High Confidence Safe for algorithmic decisions
Clare 47816 4.3311862 High Confidence Safe for algorithmic decisions
Clinton 82594 2.7520159 High Confidence Safe for algorithmic decisions
Crawford 57998 5.9346874 Moderate Confidence Use with caution - monitor outcomes
Delta 53852 6.3303127 Moderate Confidence Use with caution - monitor outcomes
Dickinson 59651 4.3637156 High Confidence Safe for algorithmic decisions
Eaton 77158 2.6089323 High Confidence Safe for algorithmic decisions
Emmet 69690 5.7181805 Moderate Confidence Use with caution - monitor outcomes
Genesee 58594 2.2408438 High Confidence Safe for algorithmic decisions
Gladwin 53717 2.7589031 High Confidence Safe for algorithmic decisions
Gogebic 47913 9.9471960 Moderate Confidence Use with caution - monitor outcomes
Grand Traverse 75553 2.9899541 High Confidence Safe for algorithmic decisions
Gratiot 57934 3.8630165 High Confidence Safe for algorithmic decisions
Hillsdale 59425 3.9713925 High Confidence Safe for algorithmic decisions
Houghton 52736 4.8600576 High Confidence Safe for algorithmic decisions
Huron 54475 2.5810005 High Confidence Safe for algorithmic decisions
Ingham 62548 2.9641236 High Confidence Safe for algorithmic decisions
Ionia 71720 3.4021194 High Confidence Safe for algorithmic decisions
Iosco 46224 5.2808065 Moderate Confidence Use with caution - monitor outcomes
Iron 52241 6.6805766 Moderate Confidence Use with caution - monitor outcomes
Isabella 52638 4.2706790 High Confidence Safe for algorithmic decisions
Jackson 62581 2.5247280 High Confidence Safe for algorithmic decisions
Kalamazoo 67905 2.3061630 High Confidence Safe for algorithmic decisions
Kalkaska 56380 5.7343029 Moderate Confidence Use with caution - monitor outcomes
Kent 76247 1.6721969 High Confidence Safe for algorithmic decisions
Keweenaw 55560 13.1407487 Low Confidence Requires manual review or additional data
Lake 45946 5.1299351 Moderate Confidence Use with caution - monitor outcomes
Lapeer 75402 3.0556219 High Confidence Safe for algorithmic decisions
Leelanau 82345 5.3202987 Moderate Confidence Use with caution - monitor outcomes
Lenawee 65484 2.6739356 High Confidence Safe for algorithmic decisions
Livingston 96135 1.9628647 High Confidence Safe for algorithmic decisions
Luce 51015 7.3350975 Moderate Confidence Use with caution - monitor outcomes
Mackinac 60620 5.6796437 Moderate Confidence Use with caution - monitor outcomes
Macomb 73876 1.2250257 High Confidence Safe for algorithmic decisions
Manistee 59467 4.3217247 High Confidence Safe for algorithmic decisions
Marquette 63115 4.0133090 High Confidence Safe for algorithmic decisions
Mason 60744 3.7024233 High Confidence Safe for algorithmic decisions
Mecosta 54132 5.8763763 Moderate Confidence Use with caution - monitor outcomes
Menominee 54074 5.3870622 Moderate Confidence Use with caution - monitor outcomes
Midland 73643 4.3615822 High Confidence Safe for algorithmic decisions
Missaukee 57667 6.6051641 Moderate Confidence Use with caution - monitor outcomes
Monroe 72573 2.7930498 High Confidence Safe for algorithmic decisions
Montcalm 61250 3.8742857 High Confidence Safe for algorithmic decisions
Montmorency 46345 8.1907433 Moderate Confidence Use with caution - monitor outcomes
Muskegon 61347 2.1810357 High Confidence Safe for algorithmic decisions
Newaygo 59065 3.1558453 High Confidence Safe for algorithmic decisions
Oakland 92620 1.0289354 High Confidence Safe for algorithmic decisions
Oceana 60691 4.9315384 High Confidence Safe for algorithmic decisions
Ogemaw 50377 4.9189114 High Confidence Safe for algorithmic decisions
Ontonagon 48316 7.9559566 Moderate Confidence Use with caution - monitor outcomes
Osceola 54875 3.8159453 High Confidence Safe for algorithmic decisions
Oscoda 48692 5.3006654 Moderate Confidence Use with caution - monitor outcomes
Otsego 62865 9.4010976 Moderate Confidence Use with caution - monitor outcomes
Ottawa 83932 2.1791450 High Confidence Safe for algorithmic decisions
Presque Isle 55986 6.3730218 Moderate Confidence Use with caution - monitor outcomes
Roscommon 49898 6.2567638 Moderate Confidence Use with caution - monitor outcomes
Saginaw 56579 2.4832535 High Confidence Safe for algorithmic decisions
St. Clair 66887 2.5819666 High Confidence Safe for algorithmic decisions
St. Joseph 62281 2.6348325 High Confidence Safe for algorithmic decisions
Sanilac 55740 3.9415142 High Confidence Safe for algorithmic decisions
Schoolcraft 55071 11.4906212 Low Confidence Requires manual review or additional data
Shiawassee 62498 4.5153445 High Confidence Safe for algorithmic decisions
Tuscola 59815 3.1563989 High Confidence Safe for algorithmic decisions
Van Buren 65531 5.1822801 Moderate Confidence Use with caution - monitor outcomes
Washtenaw 84245 2.7977922 High Confidence Safe for algorithmic decisions
Wayne 57223 0.9838701 High Confidence Safe for algorithmic decisions
Wexford 58652 5.8071336 Moderate Confidence Use with caution - monitor outcomes
safe_data_counties <- county_reliability %>% filter(algo_rec == "Safe for algorithmic decisions")
cautious_data_counties <- county_reliability %>% filter(algo_rec == "Use with caution - monitor outcomes")
need_more_data_counties <- county_reliability %>% filter(algo_rec == "Requires manual review or additional data")

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation:

[1] “Alcona” “Allegan” “Alpena” “Antrim” “Arenac”
[6] “Barry” “Bay” “Berrien” “Branch” “Calhoun”
[11] “Cass” “Charlevoix” “Cheboygan” “Chippewa” “Clare”
[16] “Clinton” “Dickinson” “Eaton” “Genesee” “Gladwin”
[21] “Grand Traverse” “Gratiot” “Hillsdale” “Houghton” “Huron”
[26] “Ingham” “Ionia” “Isabella” “Jackson” “Kalamazoo”
[31] “Kent” “Lapeer” “Lenawee” “Livingston” “Macomb”
[36] “Manistee” “Marquette” “Mason” “Midland” “Monroe”
[41] “Montcalm” “Muskegon” “Newaygo” “Oakland” “Oceana”
[46] “Ogemaw” “Osceola” “Ottawa” “Saginaw” “St. Clair”
[51] “St. Joseph” “Sanilac” “Shiawassee” “Tuscola” “Washtenaw”
[56] “Wayne”

  1. Counties requiring additional oversight:

[1] “Alger” “Baraga” “Benzie” “Crawford” “Delta” “Emmet”
[7] “Gogebic” “Iosco” “Iron” “Kalkaska” “Lake” “Leelanau”
[13] “Luce” “Mackinac” “Mecosta” “Menominee” “Missaukee” “Montmorency” [19] “Ontonagon” “Oscoda” “Otsego” “Presque Isle” “Roscommon” “Van Buren”
[25] “Wexford”

These are counties where we are moderately confident in the data of. The data isn’t to be used without monitoring the results for unusual patterns or distributions. If there are odd results, they should be examined more closely to see what may have caused the outlier. These data should also be examined to see what shape the raw data forms, before conducting any analysis. If the distribution, skew or peaks look abnormal, or there are too many outliers, that county may not be used to that specific analysis. Judgments should be made on a case-to-case basis.

  1. Counties needing alternative approaches:

[1] “Keweenaw” “Schoolcraft”

There are only two counties in Michigan where we are not confident in the reliability of the data at all, and more data from these counties should be collected if possible. First, we should try to find out why these two counties were specifically unreliable in data, such as examining the sample size. As seen in earlier analysis, these counties are in the top 5 smallest counties in Michigan, which may definitely cause the unreliability. In this case, we should go out of our way to collect more data from this county to ensure the data becomes more reliable, as the ACS is a sample and more data can be collected. If this doesn’t work, perhaps we should consider merging these counties with neighboring counties, especially if there are no major differences between the two areas. If there are massive differences, then maybe we should take a closer look at these counties in general to see why such disparities happen, and work on ameliorating that.

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

  1. How does data unreliability change as we look at different data?

  2. Is sample size the biggest factor in swaying reliability, or is there something else, and can that be examined by looking at differences between different data?

  3. In Michigan, there are ethnic enclaves (Japanese, Chinese, Korean, Middle Eastern) which the 3 races in these analyses don’t cover. How would including those change the analyses done here today?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: In 4.2, the last question, I manually inputted the table data, instead of using reproducible variables.

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html