Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Angie Kwon

Published

February 24, 2026

Assignment Overview

Scenario

You are a data analyst for the Michigan Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
# install.packages("tidycensus")
# install.packages("tidyverse")
# install.packages("knitr")
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key
# census_api_key("eb283fc5f20283cf588e1f363afe8bc73cf9c354", install = TRUE)

# Choose your state for analysis - assign it to a variable called my_state
my_state <- "Michigan"

State Selection: I have chosen Michigan for this analysis because: I was born in Michigan, lived there until I was 6 and returned for high school.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
acsVariables <- c(med_HH_inc = "B19013_001", totpop = "B01003_001")
MI_2022 <- get_acs(geography="county", variables = acsVariables,
                        state = "MI", survey = "acs5", year = 2022, output = "wide")

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
MI_2022 <- MI_2022 %>%
  mutate(NAME = str_remove(NAME, " County, Michigan"))

# Display the first few rows
glimpse(MI_2022)

Rows: 83
Columns: 6
$ GEOID       <chr> "26001", "26003", "26005", "26007", "26009", "26011", "260…
$ NAME        <chr> "Alcona", "Alger", "Allegan", "Alpena", "Antrim", "Arenac"…
$ med_HH_incE <dbl> 50295, 55528, 75543, 49133, 68850, 53487, 51911, 75182, 57…
$ med_HH_incM <dbl> 2243, 2912, 2369, 2119, 3115, 2018, 2904, 2704, 2395, 4258…
$ totpopE     <dbl> 10238, 8866, 120189, 28911, 23662, 15031, 8245, 62581, 103…
$ totpopM     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
#without mutate()
#MI_2022$MOE_per <- (MI_2022$med_HH_incM/MI_2022$med_HH_incE)*100

MI_2022 <- MI_2022 %>%
  mutate(MOE_per = (med_HH_incM/med_HH_incE)*100)

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
MI_2022 <- MI_2022 %>% 
  mutate(reli_cat = case_when(MOE_per < 5 ~ 'High Confidence',
                                   MOE_per > 10 ~ 'Low Confidence',
                                   TRUE ~ 'Moderate Confidence'))
reli_cat_freq <- MI_2022 %>%
  count(reli_cat)
reli_cat_freq

# A tibble: 3 × 2
  reli_cat                n
  <chr>               <int>
1 High Confidence        56
2 Low Confidence          2
3 Moderate Confidence    25

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
#arrange(MI_2022, desc(MOE_per))
#MOE_min <- MI_2022 %>% slice_min(MOE_per, n=5)
MOE_max <- MI_2022 %>% slice_max(MOE_per, n=5)
MOE_max

# A tibble: 5 × 8
  GEOID NAME        med_HH_incE med_HH_incM totpopE totpopM MOE_per reli_cat    
  <chr> <chr>             <dbl>       <dbl>   <dbl>   <dbl>   <dbl> <chr>       
1 26083 Keweenaw          55560        7301    2088      NA   13.1  Low Confide…
2 26153 Schoolcraft       55071        6328    8062      NA   11.5  Low Confide…
3 26053 Gogebic           47913        4766   14597      NA    9.95 Moderate Co…
4 26137 Otsego            62865        5910   25221      NA    9.40 Moderate Co…
5 26119 Montmorency       46345        3796    9261      NA    8.19 Moderate Co…

# Format as table with kable() - include appropriate column names and caption
kable(MOE_max[, c(2,3,4,7,8)], "simple", col.names = c("Name", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the highest median income margins of error")

Top 5 counties with the highest median income margins of error
Name	Median Income	Median Income Margin of Error	Margin of Error Percentage	Reliability
Keweenaw	55560	7301	13.140749	Low Confidence
Schoolcraft	55071	6328	11.490621	Low Confidence
Gogebic	47913	4766	9.947196	Moderate Confidence
Otsego	62865	5910	9.401098	Moderate Confidence
Montmorency	46345	3796	8.190743	Moderate Confidence

#consequence bearing hardest on smallest counties?
pop_min <- MI_2022 %>% slice_min(totpopE, n=5)
kable(pop_min[, c(2,5,3,4,7,8)], "simple", col.names = c("Name", "Total Population", "Median Income", "Median Income Margin of Error", "Margin of Error Percentage", "Reliability"), caption = "Top 5 counties with the lowest population")

Top 5 counties with the lowest population
Name	Total Population	Median Income	Median Income Margin of Error	Margin of Error Percentage	Reliability
Keweenaw	2088	55560	7301	13.140749	Low Confidence
Luce	5442	51015	3742	7.335097	Moderate Confidence
Ontonagon	5862	48316	3844	7.955957	Moderate Confidence
Schoolcraft	8062	55071	6328	11.490621	Low Confidence
Baraga	8245	51911	2904	5.594190	Moderate Confidence

Data Quality Commentary:

These results show that even with a government-wide survey, there are very high margins of error, which have policy implications, where these counties may be over- or underestimated in various variables. In this instance, the counties with higher margins of error may be overestimated as more affluent than they actually are, which may lead to them not receiving the aid or funding they need, or they might be underestimated as less wealthy than they actually are, which may lead to resources being needlessly diverted to them, when in fact others need them more. The second table shows that the population size may have at least some effect, since the two counties with “Low Confidence” are both in the top 5 smallest counties in Michigan.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
filtered_MI <- filter(MI_2022, NAME == "Oakland" | NAME == "Mackinac" | NAME == "Keweenaw")

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
kable(filtered_MI[, c(2,3,7,8)], "simple", col.names = c("Name", "Median Income", "Margin of Error Percentage", "Reliability"), caption = "3 counties representing different data reliability levels")

3 counties representing different data reliability levels
Name	Median Income	Margin of Error Percentage	Reliability
Keweenaw	55560	13.140749	Low Confidence
Mackinac	60620	5.679644	Moderate Confidence
Oakland	92620	1.028935	High Confidence

Comment on the output: The wide range of the margins of error is quite intersting, as they all come from one state, and presumably they try to draw county lines to be as representative as possible. Of course it can’t always be helped, but it’s interesting to see regardless. Also, Oakland county is where I grew up!

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names
race_eth_vars <- c(White = "B03002_003", Black = "B03002_004", Hisp_Lat = "B03002_012",
                   Total_Pop = "B03002_001")

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
race_eth_2022 <- get_acs(geography="tract", variables = race_eth_vars,
                        state = "MI", county = c(083, 097, 125), 
                        survey = "acs5", year = 2022, output = "wide")

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
race_eth_2022 <- race_eth_2022 %>%
  mutate(perW = (WhiteE/Total_PopE)*100,
         perB = (BlackE/Total_PopE)*100,
         perH = (Hisp_LatE/Total_PopE)*100)

# Add readable tract and county name columns using str_extract() or similar
race_eth_2022 <- race_eth_2022 %>%
  mutate(tract = str_extract(NAME, "[0-9.]+"),
         NAME = str_extract(NAME, "Oakland|Mackinac|Keweenaw"))

race_eth_2022

# A tibble: 359 × 14
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 5 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
hisp_max <- race_eth_2022 %>% slice_max(Hisp_LatE/Total_PopE, n=1)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
avg_demo <- race_eth_2022 %>%
  group_by(NAME) %>%
  summarize(avg_W_per = mean(perW, na.rm = TRUE), 
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())

# Create a nicely formatted table of your results using kable()
kable(avg_demo[, c(1,5,2,3,4)], "simple", col.names = c("County", "Number of Tracts", "Average Percentage of White Residents", "Average Percentage of Black Residents", "Average Percentage of Hispanic Residents"), caption = "Average demographics of each county")

Average demographics of each county
County	Number of Tracts	Average Percentage of White Residents	Average Percentage of Black Residents	Average Percentage of Hispanic Residents
Keweenaw	3	97.91467	0.000000	0.7909875
Mackinac	6	74.24631	3.418546	2.0344948
Oakland	350	69.26839	14.115251	4.6550570

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_perW = (WhiteM/WhiteE)*100,
         MOE_perB = (BlackM/BlackE)*100,
         MOE_perH = (Hisp_LatM/Hisp_LatE)*100)
# get rid of all infinites
race_eth_2022[sapply(race_eth_2022, is.infinite)] <- NA

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

#misunderstood this as any variable, as in any of the 3 is high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_cat = ifelse(MOE_perW > 15 | MOE_perB > 15 | MOE_perH > 15, 
                          "High MOE", "Low MOE"))
race_eth_2022$MOE_cat[is.na(race_eth_2022$MOE_cat)] <- "No Certainty"

#so second version where it's just black residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catB = ifelse(MOE_perB > 15,
                          "High Black MOE", "Low Black MOE"))
race_eth_2022$MOE_catB[is.na(race_eth_2022$MOE_catB)] <- "No Certainty"

race_eth_2022

# A tibble: 359 × 19
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 10 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>

# Create summary statistics showing how many tracts have data quality issues
data_qual_n <- race_eth_2022 %>%
  group_by(MOE_cat) %>%
  count(MOE_cat)
data_qual_n

# A tibble: 2 × 2
# Groups:   MOE_cat [2]
  MOE_cat          n
  <chr>        <int>
1 High MOE       351
2 No Certainty     8

data_qual_n_B <- race_eth_2022 %>%
  group_by(MOE_catB) %>%
  count(MOE_catB)
data_qual_n_B

# A tibble: 3 × 2
# Groups:   MOE_catB [3]
  MOE_catB           n
  <chr>          <int>
1 High Black MOE   322
2 Low Black MOE      6
3 No Certainty      31

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_qual_examine <- race_eth_2022 %>%
  group_by(MOE_cat) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_examine

# A tibble: 2 × 6
  MOE_cat      avg_pop avg_W_per avg_B_per avg_H_per count
  <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
1 High MOE       3656.      69.4      13.9      4.60   351
2 No Certainty    261.      95.8       0        1.58     8

data_qual_W <- race_eth_2022 %>%
  group_by(MOE_perW > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_W

# A tibble: 3 × 6
  `MOE_perW > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             3834.      81.8      4.01      3.86   165
2 TRUE              3489.      58.7     22.6       5.25   187
3 NA                   0      NaN      NaN       NaN        7

data_qual_B <- race_eth_2022 %>%
  group_by(MOE_perB > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_B

# A tibble: 3 × 6
  `MOE_perB > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             3526.      20.6      73.2      1.11     6
2 TRUE              3705.      68.9      13.8      4.81   322
3 NA                2289.      90.2       0        2.62    31

data_qual_H <- race_eth_2022 %>%
  group_by(MOE_perH > 15) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
data_qual_H

# A tibble: 3 × 6
  `MOE_perH > 15` avg_pop avg_W_per avg_B_per avg_H_per count
  <lgl>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 FALSE             2482       70.8      12.4     13.6      2
2 TRUE              3690.      69.8      13.6      4.65   342
3 NA                1212.      58.2      25.1      0       15

race_eth_2022 <- race_eth_2022 %>% 
  mutate(indv_MOE_cat = case_when(MOE_perW > 15 & MOE_perB > 15 & MOE_perH > 15 ~ 'All High MOE',
                                  MOE_perW < 15 & MOE_perB < 15 & MOE_perH < 15 ~ 'All Low MOE',
                                  MOE_perW > 15 & MOE_perB > 15 | MOE_perH > 15 ~ 'Two High MOE',
                                  MOE_perW > 15 | MOE_perB > 15 & MOE_perH > 15 ~ 'Two High MOE',
                                  MOE_perW > 15 & MOE_perH > 15 | MOE_perB > 15 ~ 'Two High MOE',
                                  is.na(MOE_perW) | is.na(MOE_perH) | is.na(MOE_perB) ~ 'At least one NA',
                                   TRUE ~ 'One High MOE'))
race_eth_2022

# A tibble: 359 × 20
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 11 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>

detail_MOE_cat <- race_eth_2022 %>%
  group_by(indv_MOE_cat) %>%
  summarize(avg_pop = mean(Total_PopE, na.rm = TRUE), 
            avg_W_per = mean(perW, na.rm = TRUE),
            avg_B_per = mean(perB, na.rm = TRUE),
            avg_H_per = mean(perH, na.rm = TRUE),
            count = n())
detail_MOE_cat

# A tibble: 3 × 6
  indv_MOE_cat    avg_pop avg_W_per avg_B_per avg_H_per count
  <chr>             <dbl>     <dbl>     <dbl>     <dbl> <int>
1 All High MOE      3577.      58.6     21.6       5.61   165
2 At least one NA    261.      95.8      0         1.58     8
3 Two High MOE      3725.      79.0      7.12      3.71   186

#just white residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catW = ifelse(MOE_perW > 15,
                          "High White MOE", "Low White MOE"))
race_eth_2022$MOE_catW[is.na(race_eth_2022$MOE_catW)] <- "No Certainty"
race_eth_2022

# A tibble: 359 × 21
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 12 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>

data_qual_n_W <- race_eth_2022 %>%
  group_by(MOE_catW) %>%
  count(MOE_catW)
data_qual_n_W

# A tibble: 3 × 2
# Groups:   MOE_catW [3]
  MOE_catW           n
  <chr>          <int>
1 High White MOE   187
2 Low White MOE    165
3 No Certainty       7

#just hispanic residents having high MOE
race_eth_2022 <- race_eth_2022 %>%
  mutate(MOE_catH = ifelse(MOE_perH > 15,
                          "High Hispanic MOE", "Low Hispanic MOE"))
race_eth_2022$MOE_catH[is.na(race_eth_2022$MOE_catH)] <- "No Certainty"
race_eth_2022

# A tibble: 359 × 22
   GEOID       NAME   WhiteE WhiteM BlackE BlackM Hisp_LatE Hisp_LatM Total_PopE
   <chr>       <chr>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>     <dbl>      <dbl>
 1 26083000100 Kewee…   1999      6      0     11        33         4       2086
 2 26083980100 Kewee…      2      1      0     11         0        11          2
 3 26083990100 Kewee…      0     11      0     11         0        11          0
 4 26097950100 Macki…   1536    238      0     11         0        11       1594
 5 26097950200 Macki…   1858    216      0     11        49        42       2399
 6 26097950300 Macki…   1485    185      3      4        13        12       1761
 7 26097950400 Macki…   1788    187    141    107       115        59       2900
 8 26097950500 Macki…   1126    142    264    119        75        42       2189
 9 26097990000 Macki…      0     11      0     11         0        11          0
10 26125120000 Oakla…   2910    279     20     21        20        24       3074
# ℹ 349 more rows
# ℹ 13 more variables: Total_PopM <dbl>, perW <dbl>, perB <dbl>, perH <dbl>,
#   tract <chr>, MOE_perW <dbl>, MOE_perB <dbl>, MOE_perH <dbl>, MOE_cat <chr>,
#   MOE_catB <chr>, indv_MOE_cat <chr>, MOE_catW <chr>, MOE_catH <chr>

data_qual_n_H <- race_eth_2022 %>%
  group_by(MOE_catH) %>%
  count(MOE_catH)
data_qual_n_H

# A tibble: 3 × 2
# Groups:   MOE_catH [3]
  MOE_catH              n
  <chr>             <int>
1 High Hispanic MOE   342
2 Low Hispanic MOE      2
3 No Certainty         15

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
race <- c("White", "Black", "Hispanic")
high_MOE_n <- c(187, 322, 342)
low_MOE_n <- c(165, 6, 2)
no_certain_n <- c(7, 31, 15)
race_n_high_MOE <- data.frame(race, high_MOE_n, low_MOE_n, no_certain_n)

race_n_high_MOE <- race_n_high_MOE %>%
  mutate(per_High = high_MOE_n/(high_MOE_n+low_MOE_n)*100)
race_n_high_MOE

      race high_MOE_n low_MOE_n no_certain_n per_High
1    White        187       165            7 53.12500
2    Black        322         6           31 98.17073
3 Hispanic        342         2           15 99.41860

Pattern Analysis: The most stark pattern is that, whereas all three races’ population estimate overall have quite high margins of errors, this effect is significantly higher for Black and Hispanic/Latino residents. White residents have 187 tracts where the margins of error is higher than 15%, which amounts to 53.13% of all the tracts in Oakland, Mackinac, and Keweenaw counties. Black residents on the other hand have 322, which is 98.17% of all tracts, and Hispanic residents have 342, which is 99.42% of all tracts. This shows that the data for Black and Hispanic residents are much more unreliable than the data for White residents. This might either be because there are less Black and Hispanic people in these counties in general, leading to huge skews in their data depending on tracts, or because they are chronically underestimated in data collection, due to human error, bias or prejudice.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements:

Overall Pattern Identification: Equity Assessment: Root Cause Analysis: Strategic Recommendations:

Executive Summary:

Overall, my analyses shows several systematic patterns. One, smaller counties with less people generally have less reliable data. This is largely due to the fact that with smaller sample sizes, it becomes difficult to extrapolate any findings, making any statistical relationships unsure of its significance. Another pattern is soemthing briefly talked about in the previous section. Among the three counties–Oakland, Mackninac, Keweenaw–it seems that Black and Hispanic residents’ data were clearly more unreliable. This may be due to the sample size, or Black and Hispanic people don’t live in those counties, or there is something more sinister going on, where they are systematically purposefully under-counted or statistically misrepresented.

All of these shows that the Black and Hispanic communities are at great risk of algorithmic bias. The data technically exists, but the large margins of error make it clear that this data should be used sparingly and cautiously. Thought cautious data usage is good practice, it seems to be disadvantageous that these people of color’s data should only be used after taking further necessary precautions. This added level of difficulty in accessing and analyzing the data may lead to policy makers to intentionally not use the data either due to fear of misrepresenting their constituents or due to laziness.

It’s likely that Black and Hispanic people don’t live in these areas, either by change or systemic disadvantages, as seen in the case of redlining. This may lead to problems of reinforcing euro- white-centric social norms with the policies reflecting and amplifying that. The cause of both the bad data quality and bias risk may be due to human bias, where certain White surveyors might be unwilling to approach or collect data from non-White residents, consciously or not. If done subconsciously, this could lead to the belief that the data is impartial and representative, when that may not be the case.

There are plenty of psychology studies showing that even just acknowledging you might have bias leads to you to think through your actions more consciously and reduce personal biases. Employee workshops to increase self-awareness in such a manner might be a simple but effective technique to start address these systematic issues. Another thing to do would be to take a closer look at all the data, including other variables not just income, and across all other intersections of identities, to make sure no one get swept under the rug. Taking a closer look at the data would also reveal where there are outliers versus actual pattern. If a county seems to be an outlier in every single analyses, a closer look should be taken to try to resolve any disparities.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_reliability <- MI_2022[,c(2,3,7,8)]
county_reliability

# A tibble: 83 × 4
   NAME    med_HH_incE MOE_per reli_cat           
   <chr>         <dbl>   <dbl> <chr>              
 1 Alcona        50295    4.46 High Confidence    
 2 Alger         55528    5.24 Moderate Confidence
 3 Allegan       75543    3.14 High Confidence    
 4 Alpena        49133    4.31 High Confidence    
 5 Antrim        68850    4.52 High Confidence    
 6 Arenac        53487    3.77 High Confidence    
 7 Baraga        51911    5.59 Moderate Confidence
 8 Barry         75182    3.60 High Confidence    
 9 Bay           57887    4.14 High Confidence    
10 Benzie        71327    5.97 Moderate Confidence
# ℹ 73 more rows

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"
county_reliability <- county_reliability %>%
  mutate(algo_rec = case_when(reli_cat == "High Confidence" ~ "Safe for algorithmic decisions",
                              reli_cat == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
                              reli_cat == "Low Confidence" ~ "Requires manual review or additional data"))

# Format as a professional table with kable()
kable(county_reliability, "simple", col.names = c("County", "Median Income Estimate", "Margin of Error Percentage", "Reliability Category", "Algorithm Recommendation"), caption = "Algorithm Implementation Decision Framework")

Algorithm Implementation Decision Framework
County	Median Income Estimate	Margin of Error Percentage	Reliability Category	Algorithm Recommendation
Alcona	50295	4.4596878	High Confidence	Safe for algorithmic decisions
Alger	55528	5.2442011	Moderate Confidence	Use with caution - monitor outcomes
Allegan	75543	3.1359623	High Confidence	Safe for algorithmic decisions
Alpena	49133	4.3127837	High Confidence	Safe for algorithmic decisions
Antrim	68850	4.5243282	High Confidence	Safe for algorithmic decisions
Arenac	53487	3.7728794	High Confidence	Safe for algorithmic decisions
Baraga	51911	5.5941901	Moderate Confidence	Use with caution - monitor outcomes
Barry	75182	3.5966056	High Confidence	Safe for algorithmic decisions
Bay	57887	4.1373711	High Confidence	Safe for algorithmic decisions
Benzie	71327	5.9696889	Moderate Confidence	Use with caution - monitor outcomes
Berrien	60379	2.6101790	High Confidence	Safe for algorithmic decisions
Branch	60600	3.7656766	High Confidence	Safe for algorithmic decisions
Calhoun	58191	2.6309910	High Confidence	Safe for algorithmic decisions
Cass	65183	4.1222405	High Confidence	Safe for algorithmic decisions
Charlevoix	69764	3.4358695	High Confidence	Safe for algorithmic decisions
Cheboygan	59557	4.0566180	High Confidence	Safe for algorithmic decisions
Chippewa	58408	3.0766333	High Confidence	Safe for algorithmic decisions
Clare	47816	4.3311862	High Confidence	Safe for algorithmic decisions
Clinton	82594	2.7520159	High Confidence	Safe for algorithmic decisions
Crawford	57998	5.9346874	Moderate Confidence	Use with caution - monitor outcomes
Delta	53852	6.3303127	Moderate Confidence	Use with caution - monitor outcomes
Dickinson	59651	4.3637156	High Confidence	Safe for algorithmic decisions
Eaton	77158	2.6089323	High Confidence	Safe for algorithmic decisions
Emmet	69690	5.7181805	Moderate Confidence	Use with caution - monitor outcomes
Genesee	58594	2.2408438	High Confidence	Safe for algorithmic decisions
Gladwin	53717	2.7589031	High Confidence	Safe for algorithmic decisions
Gogebic	47913	9.9471960	Moderate Confidence	Use with caution - monitor outcomes
Grand Traverse	75553	2.9899541	High Confidence	Safe for algorithmic decisions
Gratiot	57934	3.8630165	High Confidence	Safe for algorithmic decisions
Hillsdale	59425	3.9713925	High Confidence	Safe for algorithmic decisions
Houghton	52736	4.8600576	High Confidence	Safe for algorithmic decisions
Huron	54475	2.5810005	High Confidence	Safe for algorithmic decisions
Ingham	62548	2.9641236	High Confidence	Safe for algorithmic decisions
Ionia	71720	3.4021194	High Confidence	Safe for algorithmic decisions
Iosco	46224	5.2808065	Moderate Confidence	Use with caution - monitor outcomes
Iron	52241	6.6805766	Moderate Confidence	Use with caution - monitor outcomes
Isabella	52638	4.2706790	High Confidence	Safe for algorithmic decisions
Jackson	62581	2.5247280	High Confidence	Safe for algorithmic decisions
Kalamazoo	67905	2.3061630	High Confidence	Safe for algorithmic decisions
Kalkaska	56380	5.7343029	Moderate Confidence	Use with caution - monitor outcomes
Kent	76247	1.6721969	High Confidence	Safe for algorithmic decisions
Keweenaw	55560	13.1407487	Low Confidence	Requires manual review or additional data
Lake	45946	5.1299351	Moderate Confidence	Use with caution - monitor outcomes
Lapeer	75402	3.0556219	High Confidence	Safe for algorithmic decisions
Leelanau	82345	5.3202987	Moderate Confidence	Use with caution - monitor outcomes
Lenawee	65484	2.6739356	High Confidence	Safe for algorithmic decisions
Livingston	96135	1.9628647	High Confidence	Safe for algorithmic decisions
Luce	51015	7.3350975	Moderate Confidence	Use with caution - monitor outcomes
Mackinac	60620	5.6796437	Moderate Confidence	Use with caution - monitor outcomes
Macomb	73876	1.2250257	High Confidence	Safe for algorithmic decisions
Manistee	59467	4.3217247	High Confidence	Safe for algorithmic decisions
Marquette	63115	4.0133090	High Confidence	Safe for algorithmic decisions
Mason	60744	3.7024233	High Confidence	Safe for algorithmic decisions
Mecosta	54132	5.8763763	Moderate Confidence	Use with caution - monitor outcomes
Menominee	54074	5.3870622	Moderate Confidence	Use with caution - monitor outcomes
Midland	73643	4.3615822	High Confidence	Safe for algorithmic decisions
Missaukee	57667	6.6051641	Moderate Confidence	Use with caution - monitor outcomes
Monroe	72573	2.7930498	High Confidence	Safe for algorithmic decisions
Montcalm	61250	3.8742857	High Confidence	Safe for algorithmic decisions
Montmorency	46345	8.1907433	Moderate Confidence	Use with caution - monitor outcomes
Muskegon	61347	2.1810357	High Confidence	Safe for algorithmic decisions
Newaygo	59065	3.1558453	High Confidence	Safe for algorithmic decisions
Oakland	92620	1.0289354	High Confidence	Safe for algorithmic decisions
Oceana	60691	4.9315384	High Confidence	Safe for algorithmic decisions
Ogemaw	50377	4.9189114	High Confidence	Safe for algorithmic decisions
Ontonagon	48316	7.9559566	Moderate Confidence	Use with caution - monitor outcomes
Osceola	54875	3.8159453	High Confidence	Safe for algorithmic decisions
Oscoda	48692	5.3006654	Moderate Confidence	Use with caution - monitor outcomes
Otsego	62865	9.4010976	Moderate Confidence	Use with caution - monitor outcomes
Ottawa	83932	2.1791450	High Confidence	Safe for algorithmic decisions
Presque Isle	55986	6.3730218	Moderate Confidence	Use with caution - monitor outcomes
Roscommon	49898	6.2567638	Moderate Confidence	Use with caution - monitor outcomes
Saginaw	56579	2.4832535	High Confidence	Safe for algorithmic decisions
St. Clair	66887	2.5819666	High Confidence	Safe for algorithmic decisions
St. Joseph	62281	2.6348325	High Confidence	Safe for algorithmic decisions
Sanilac	55740	3.9415142	High Confidence	Safe for algorithmic decisions
Schoolcraft	55071	11.4906212	Low Confidence	Requires manual review or additional data
Shiawassee	62498	4.5153445	High Confidence	Safe for algorithmic decisions
Tuscola	59815	3.1563989	High Confidence	Safe for algorithmic decisions
Van Buren	65531	5.1822801	Moderate Confidence	Use with caution - monitor outcomes
Washtenaw	84245	2.7977922	High Confidence	Safe for algorithmic decisions
Wayne	57223	0.9838701	High Confidence	Safe for algorithmic decisions
Wexford	58652	5.8071336	Moderate Confidence	Use with caution - monitor outcomes

safe_data_counties <- county_reliability %>% filter(algo_rec == "Safe for algorithmic decisions")
cautious_data_counties <- county_reliability %>% filter(algo_rec == "Use with caution - monitor outcomes")
need_more_data_counties <- county_reliability %>% filter(algo_rec == "Requires manual review or additional data")

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation:

[1] “Alcona” “Allegan” “Alpena” “Antrim” “Arenac”
[6] “Barry” “Bay” “Berrien” “Branch” “Calhoun”
[11] “Cass” “Charlevoix” “Cheboygan” “Chippewa” “Clare”
[16] “Clinton” “Dickinson” “Eaton” “Genesee” “Gladwin”
[21] “Grand Traverse” “Gratiot” “Hillsdale” “Houghton” “Huron”
[26] “Ingham” “Ionia” “Isabella” “Jackson” “Kalamazoo”
[31] “Kent” “Lapeer” “Lenawee” “Livingston” “Macomb”
[36] “Manistee” “Marquette” “Mason” “Midland” “Monroe”
[41] “Montcalm” “Muskegon” “Newaygo” “Oakland” “Oceana”
[46] “Ogemaw” “Osceola” “Ottawa” “Saginaw” “St. Clair”
[51] “St. Joseph” “Sanilac” “Shiawassee” “Tuscola” “Washtenaw”
[56] “Wayne”

Counties requiring additional oversight:

[1] “Alger” “Baraga” “Benzie” “Crawford” “Delta” “Emmet”
[7] “Gogebic” “Iosco” “Iron” “Kalkaska” “Lake” “Leelanau”
[13] “Luce” “Mackinac” “Mecosta” “Menominee” “Missaukee” “Montmorency” [19] “Ontonagon” “Oscoda” “Otsego” “Presque Isle” “Roscommon” “Van Buren”
[25] “Wexford”

These are counties where we are moderately confident in the data of. The data isn’t to be used without monitoring the results for unusual patterns or distributions. If there are odd results, they should be examined more closely to see what may have caused the outlier. These data should also be examined to see what shape the raw data forms, before conducting any analysis. If the distribution, skew or peaks look abnormal, or there are too many outliers, that county may not be used to that specific analysis. Judgments should be made on a case-to-case basis.

Counties needing alternative approaches:

[1] “Keweenaw” “Schoolcraft”

There are only two counties in Michigan where we are not confident in the reliability of the data at all, and more data from these counties should be collected if possible. First, we should try to find out why these two counties were specifically unreliable in data, such as examining the sample size. As seen in earlier analysis, these counties are in the top 5 smallest counties in Michigan, which may definitely cause the unreliability. In this case, we should go out of our way to collect more data from this county to ensure the data becomes more reliable, as the ACS is a sample and more data can be collected. If this doesn’t work, perhaps we should consider merging these counties with neighboring counties, especially if there are no major differences between the two areas. If there are massive differences, then maybe we should take a closer look at these counties in general to see why such disparities happen, and work on ameliorating that.

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

How does data unreliability change as we look at different data?
Is sample size the biggest factor in swaying reliability, or is there something else, and can that be examined by looking at differences between different data?
In Michigan, there are ethnic enclaves (Japanese, Chinese, Korean, Middle Eastern) which the 3 races in these analyses don’t cover. How would including those change the analyses done here today?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [date]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: In 4.2, the last question, I manually inputted the table data, instead of using reproducible variables.

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.]

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html