Lab 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Angie Kwon

Published

March 17, 2026

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences


Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

Questions to answer: - How many hospitals are in your dataset? 223 - How many census tracts? 3445 - What coordinate reference system is each dataset in? WGS 84 / pseudo mercator, WGS 84, NAD –> standardized to hospital dataset’s CRS which is WGS 84.


Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)
acs_vars <- c(
  total_pop = "B01003_001",
  med_HH_income = "B19013_001",
  m_6566 = "B01001_020",
  m_6769 = "B01001_021",
  m_7074 = "B01001_022", 
  m_7579 = "B01001_023", 
  m_8084 = "B01001_024", 
  m_85over = "B01001_025",
  f_6566 = "B01001_044", 
  f_6769 = "B01001_045", 
  f_7074 = "B01001_046", 
  f_7579 = "B01001_047", 
  f_8084 = "B01001_048", 
  f_85over = "B01001_049"
)

demo_data <- get_acs(geography = "tract",
  variables = acs_vars,
  state = "PA", survey = "acs5", year = 2022, output = "wide"
  ) %>%
  mutate(over_65E = m_6566E+m_6769E+m_7074E+m_7579E+m_8084E+m_85overE
         +f_6566E+f_6769E+f_7074E+f_7579E+f_8084E+f_85overE) %>%
  select(-c(m_6566E,m_6769E,m_7074E,m_7579E,m_8084E,m_85overE,
         f_6566E,f_6769E,f_7074E,f_7579E,f_8084E,f_85overE,
         m_6566M,m_6769M,m_7074M,m_7579M,m_8084M,m_85overM,
         f_6566M,f_6769M,f_7074M,f_7579M,f_8084M,f_85overM))
# %>%
#   mutate(NAME = str_extract(NAME, "Census Tract [0-9.]+"))

#wasn't joining because there was one census tract difference --> check what's in demo_data but not 
#census_tracts --> one census with total pop of 0 --> get rid of and then join again
diff <- setdiff(demo_data$GEOID, census_tracts$GEOID)
diff_row <- which(demo_data$GEOID == diff)
demo_data <- demo_data[-diff_row,]

# Join to tract boundaries
# tracts_demodata <- census_tracts %>%
#   left_join(demo_data %>% select(total_popE, total_popM, med_HH_incomeE, med_HH_incomeM, over_65E), 
#             by = "GEOID")
tracts_demodata <- census_tracts %>%
  left_join(demo_data %>% select(-c(NAME)), 
            by = "GEOID")

#how many tracts have missing income data
missing_income_count <- tracts_demodata %>%
  count(is.na(med_HH_incomeE))
missing_income_tracts <- tracts_demodata %>% filter(is.na(med_HH_incomeE))

#calculate mean of all median incomes
all_med_inc <- mean(tracts_demodata$med_HH_incomeE, na.rm = TRUE)

Questions to answer: - What year of ACS data are you using? 2022 - How many tracts have missing income data? 62 - What is the median income across all PA census tracts? $77527.23


Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria

#taking out rows with missing median income data
missing_inc_rows <- which(is.na(tracts_demodata$med_HH_incomeE))
avail_inc_data <- tracts_demodata[-missing_inc_rows,]

# 1. chose <60K as low median HH income
# 2. adding column of percentage of elderly population --> chose 17% as sig
# elderly pop threshold
avail_inc_data <- avail_inc_data %>%
  mutate(elderly_per = over_65E/total_popE*100)

#filter out vulnerable tracts
vul_pop <- avail_inc_data %>% filter(med_HH_incomeE < 60000 & elderly_per >= 17)

#percent of PA CT that are vulnerable
PA_per_vul <- nrow(vul_pop)/nrow(census_tracts)*100

Questions to answer: - What income threshold did you choose and why? $60,000 because ____ - What elderly population threshold did you choose and why? 17% because the national proportion for population over 65 is around 16.85% - How many tracts meet your vulnerability criteria? 574 - What percentage of PA census tracts are considered vulnerable by your definition? **16.67%*


Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS
st_crs(vul_pop)
# in WGS 84

# Calculate distance from each tract centroid to nearest hospital
vul_tract_centroid <- st_centroid(vul_pop)

nearest_index <- st_nearest_feature(vul_tract_centroid, hospitals)

nearest_dist <- st_distance(vul_tract_centroid, hospitals[nearest_index,],
                            by_element = TRUE)

vul_pop$nearest_hosp <- as.numeric(nearest_dist/1609.344) #make it in km's

# avg distance to hospitals for all vulnerable tracts
avg_hosp_dist <- mean(vul_pop$nearest_hosp, na.rm = TRUE)

# maximum distance
max_dist <- vul_pop %>% slice_max(nearest_hosp, n=1)

# >15 miles
more_15 <- vul_pop %>% filter(nearest_hosp > 15)
nrow(more_15)

Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection

Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? 4.94 miles - What is the maximum distance? 29.06 miles - How many vulnerable tracts are more than 15 miles from the nearest hospital? 34 tracts


Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable
underserved <- more_15

#how many
nrow(more_15)

#per_underserved
per_und <- nrow(underserved)/nrow(vul_pop)*100

Questions to answer: - How many tracts are underserved? 34 tracts - What percentage of vulnerable tracts are underserved? 5.92% of vulnerable tracts are underserved - Does this surprise you? Why or why not? This surprises me a little bit, as it was quite low and I assumed tracts with vulnerable populations would be generally underserved in most aspects, as one of the criteria was low income. However, I see that this may happen because despite the low income, counties and local governments might have prepared ahead for the vulnerable populations, knowing they would need the hospitals more.


Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

#add underserved status to vul_pop
vul_pop <- vul_pop %>% 
  mutate(underserved = case_when(nearest_hosp > 15 ~ 'Underserved',
                                   TRUE ~ 'Appropriately Served'))
vul_pop <- st_transform(vul_pop, st_crs(hospitals))
#hospital buffer of 15 miles
hospital_service_areas <- hospitals %>%
  st_buffer(dist = 24140)
hospital_service_areas <- st_transform(hospital_service_areas, st_crs(hospitals))
combined_service_area <- hospital_service_areas %>%
  st_union() %>%
  st_make_valid()

#outside hospital service area
outside_hosp <- st_difference(pa_counties, combined_service_area)

# ggplot() +
#   geom_sf(data = pa_counties, fill = "grey", color = "gray") +
#   geom_sf(data = outside_hosp, fill = "lightblue", alpha = 0.4) +
#   geom_sf(data = hospitals, color = "red", size = 2) +
#   labs(
#     title = paste("outside hospital service area"),
#     subtitle = "Red points = Hospitals, Blue areas = 15-mile service zones"
#   ) +
#   theme_void()

#ppl living outside hospital service area
pop_summary <- vul_pop %>%
  st_join(outside_hosp %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    pop_outHosp = sum(total_popE, na.rm = TRUE),
    .groups = "drop"
  )
pop_outside_hosp <- pa_counties %>%
  select(COUNTY_NAM) %>%
  st_drop_geometry() %>%
  left_join(pop_summary, by = "COUNTY_NAM") %>%
  mutate(
    pop_outHosp = coalesce(pop_outHosp, 0)
  )

# #ppl over 65 living outside hospital service area
over65_summary <- vul_pop %>%
  st_join(outside_hosp %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    total_vul_pop = sum(over_65E, na.rm = TRUE),
    .groups = "drop"
  )
over65_outside_hosp <- pa_counties %>%
  select(COUNTY_NAM) %>%
  st_drop_geometry() %>%
  left_join(over65_summary, by = "COUNTY_NAM") %>%
  mutate(
    total_vul_pop = coalesce(total_vul_pop, 0)
  )

# Spatial join tracts to counties
tracts_by_county <- vul_pop %>%
  st_join(pa_counties %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    tot_pop = sum(total_popE),
    n_tracts = n_distinct(GEOID),
    tract_ids = paste(unique(NAMELSAD), collapse = ", "),
    n_underserved = sum(underserved == "Underserved"),
    avg_hosp_dist = mean(nearest_hosp),
  ) %>%
  arrange(desc(n_tracts))

# Aggregate statistics by county

# percent of vul tracts that are underserved
#total
34/574*100
#per county
tracts_by_county <- tracts_by_county %>%
  mutate(per_underserved = n_underserved/n_tracts*100)

# total vul pop in PA
sum(vul_pop$total_popE)

# top 5 counties w highest percentage of underserved vul tracts
highest_per_und_vul <- tracts_by_county %>% slice_max(per_underserved, n=5)
highest_per_und_vul_10 <- tracts_by_county %>% slice_max(per_underserved, n=10)

# counties w the most vul ppl living far from hospitals
tracts_by_county <- tracts_by_county %>% left_join(pop_outside_hosp, by = "COUNTY_NAM")
# tracts_by_county <- tracts_by_county %>% left_join(over65_outside_hosp, by = "COUNTY_NAM")
most_vul <- tracts_by_county %>% arrange(desc(pop_outHosp))

Required county-level statistics: - Number of vulnerable tracts 574 vulnerable tracts - Number of underserved tracts 34 underserved tracts - Percentage of vulnerable tracts that are underserved 5.92% are underserved - Average distance to nearest hospital for vulnerable tracts 4.94 miles - Total vulnerable population in PA 1802220

Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? - Which counties have the most vulnerable people living far from hospitals? - Are there any patterns in where underserved counties are located? For high proportion of underserved tracts, it makes sense that most of the counties are smaller ones, with small total number of tracts. For counties with the most vulnerable people living far from hospitals, it makes sense that it would be slightly bigger counties, where people are more spread apart.

kable(highest_per_und_vul[, c(1,3,5,7)], "simple", 
      col.names = c("County Name", "# of Tracts", "# of Underserved Tracts", 
                    "% of Underserved Tracts"), 
      caption = "Which 5 counties have the highest percentage of underserved vulnerable tracts?")
Which 5 counties have the highest percentage of underserved vulnerable tracts?
County Name # of Tracts # of Underserved Tracts % of Underserved Tracts
CHESTER 2 2 100
PERRY 2 2 100
SULLIVAN 4 3 75
CAMERON 5 3 60
WYOMING 2 1 50
kable(most_vul[, c(1,8,2, 3,5,7,6)], "simple", 
      col.names = c("County Name", "Population >15 miles of Hospitals", "Total Population", 
                    "# of Tracts", "# of Underserved Tracts", "% of Underserved Tracts",
                    "Average Distance to Hospitals"), 
      caption = "Which counties have the most vulnerable people living far from hospitals?")
Which counties have the most vulnerable people living far from hospitals?
County Name Population >15 miles of Hospitals Total Population # of Tracts # of Underserved Tracts % of Underserved Tracts Average Distance to Hospitals
CLEARFIELD 43768 68766 19 8 42.105263 12.4081771
POTTER 26155 34726 11 4 36.363636 11.6764763
ARMSTRONG 24002 49547 15 1 6.666667 9.4372737
TIOGA 21444 38118 10 3 30.000000 10.7140760
BRADFORD 17895 38067 10 3 30.000000 11.1379861
CLARION 17556 41135 13 2 15.384615 10.2108888
BEDFORD 16699 46941 13 1 7.692308 9.9544813
ELK 12347 31332 10 4 40.000000 12.4208411
CENTRE 11192 14433 4 1 25.000000 13.1074563
MIFFLIN 11158 30041 10 1 10.000000 9.0377807
LYCOMING 10326 46037 12 1 8.333333 6.3662169
WARREN 10287 35229 11 2 18.181818 7.8595061
CAMERON 9910 11467 5 3 60.000000 15.7720394
CHESTER 9859 9859 2 2 100.000000 21.4548423
HUNTINGDON 9770 39896 11 1 9.090909 10.0597386
MCKEAN 9477 25483 10 1 10.000000 9.7918995
JUNIATA 8600 12340 4 1 25.000000 11.4765289
FOREST 8582 13270 5 2 40.000000 13.1445039
CRAWFORD 8337 42157 13 2 15.384615 6.9826609
FULTON 8061 12883 4 0 0.000000 10.0509533
WYOMING 7984 7984 2 1 50.000000 18.7004949
SULLIVAN 7766 11676 4 3 75.000000 19.8950644
SOMERSET 7631 73998 20 1 5.000000 7.6192410
COLUMBIA 6338 16917 5 1 20.000000 7.3457774
SNYDER 5847 12246 3 1 33.333333 12.3849397
ERIE 5815 48523 14 2 14.285714 4.2480374
PERRY 5800 5800 2 2 100.000000 17.5551908
CLINTON 5724 17194 5 2 40.000000 11.7694914
VENANGO 4684 46359 16 1 6.250000 7.8399110
MERCER 4684 36410 12 0 0.000000 3.6181024
PIKE 4419 11301 5 2 40.000000 14.3310500
WAYNE 4155 27940 11 2 18.181818 9.0944780
LANCASTER 4135 24349 6 1 16.666667 8.4634847
NORTHUMBERLAND 4018 45555 14 1 7.142857 10.3628216
DAUPHIN 4018 21553 6 1 16.666667 5.0769104
BUTLER 3457 20504 6 0 0.000000 6.1528363
MONROE 3222 17814 7 2 28.571429 7.8351757
SUSQUEHANNA 2770 17175 6 0 0.000000 6.2452393
FAYETTE 2138 80604 26 0 0.000000 5.6397156
FRANKLIN 1782 27834 7 1 14.285714 3.7903890
LUZERNE 918 69884 23 1 4.347826 4.4826659
ALLEGHENY 0 240251 89 0 0.000000 2.5217422
PHILADELPHIA 0 214680 54 0 0.000000 1.1988649
WESTMORELAND 0 125411 45 0 0.000000 4.0749669
CAMBRIA 0 70339 25 0 0.000000 5.8635941
LACKAWANNA 0 60528 21 1 4.761905 3.5336646
BEAVER 0 43017 18 0 0.000000 3.6778290
WASHINGTON 0 42123 16 0 0.000000 5.2466567
INDIANA 0 51398 15 1 6.666667 8.5408979
MONTGOMERY 0 64651 15 0 0.000000 1.5956215
DELAWARE 0 41086 14 0 0.000000 1.3205300
JEFFERSON 0 48421 14 3 21.428571 7.1570702
BLAIR 0 37533 13 0 0.000000 5.2396095
SCHUYLKILL 0 28231 10 0 0.000000 4.7552457
LAWRENCE 0 23267 9 0 0.000000 3.7203412
YORK 0 27860 9 0 0.000000 1.8255643
LEHIGH 0 29629 8 0 0.000000 2.5973411
NORTHAMPTON 0 27929 8 0 0.000000 4.3091026
BERKS 0 21520 7 0 0.000000 5.0556084
UNION 0 27687 6 1 16.666667 5.7454115
CARBON 0 17838 5 0 0.000000 5.9278555
LEBANON 0 11420 3 0 0.000000 4.2112314
MONTOUR 0 13033 3 0 0.000000 4.4809895
CUMBERLAND 0 6796 2 0 0.000000 1.6419706
GREENE 0 9495 2 0 0.000000 10.5464881
ADAMS 0 3908 1 0 0.000000 0.1890303

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create and format priority counties table
kable(highest_per_und_vul_10[, c(1,2,3,5,7,6)], "simple", 
      col.names = c("County Name","Total Population", "# of Tracts", "# of Underserved Tracts", 
                    "% of Underserved Tracts", "Average Distance to Hospitals"), 
      caption = "Top 10 priority counties for healthcare investment")
Top 10 priority counties for healthcare investment
County Name Total Population # of Tracts # of Underserved Tracts % of Underserved Tracts Average Distance to Hospitals
CHESTER 9859 2 2 100.00000 21.45484
PERRY 5800 2 2 100.00000 17.55519
SULLIVAN 11676 4 3 75.00000 19.89506
CAMERON 11467 5 3 60.00000 15.77204
WYOMING 7984 2 1 50.00000 18.70049
CLEARFIELD 68766 19 8 42.10526 12.40818
ELK 31332 10 4 40.00000 12.42084
CLINTON 17194 5 2 40.00000 11.76949
FOREST 13270 5 2 40.00000 13.14450
PIKE 11301 5 2 40.00000 14.33105

Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:


Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:


Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:


Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Your Analysis

Your Task:

  1. Find and load additional data
# Load your additional dataset
od_deaths <- st_read(here("labs/lab_2/data/Estimated Drug Overdose Deaths CY 2012-Current County Health_20260223.shp"))
ems_stations <- st_read(here("labs/lab_2/data/Fire_and_Emergency_Medical_Service__EMS__Stations_pt.shp"))

od_deaths <- st_transform(od_deaths, st_crs(hospitals))
ems_stations <- st_transform(ems_stations, st_crs(hospitals))

Questions to answer: - What dataset did you choose and why? Estimated Drug Overdose Deaths, because I wanted to examine whether there is a correlative relationship between vulnerable populations and the opioids crisis. - What is the data source and date? PA Open Data, and the data’s year ranges from 2012-2024. - How many features does it contain? 884 data points - What CRS is it in? Did you need to transform it? WGS 84, transformed just in case


  1. Pose a research question

Are there more opioid overdoses in vulnerable areas?


  1. Conduct spatial analysis
# Your spatial analysis

# ems station buffer
ems_buffer <- ems_stations %>%
  st_buffer(dist = 10000)
ems_buffer <- st_transform(ems_buffer, st_crs(hospitals))
combined_ems_buffer <- ems_buffer %>%
  st_union() %>%
  st_make_valid()

# capitalize county names
od_deaths$county <- toupper(od_deaths$county)

#summarize od deaths by county
od_by_county <- od_deaths %>%
  filter(year == 2024) %>%
  st_join(pa_counties %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    n_od_2024 = sum(count, na.rm = TRUE)
  )

#join od deaths to counties
tracts_by_county <- tracts_by_county %>% left_join(od_by_county, by = "COUNTY_NAM")

#percentage of vulnerable ppl of total pop & od deaths:vulnerable ppl ratio
tracts_by_county <- tracts_by_county %>%
  mutate(od_vul_ratio = n_od_2024/tot_pop*100)
tracts_by_county$od_vul_ratio[is.infinite(tracts_by_county$od_vul_ratio)] <- 0
tracts_by_county$od_vul_ratio[is.nan(tracts_by_county$od_vul_ratio)] <- 0
Top 10 priority counties for healthcare investment
County Name Total Vulnerable Population # of OD in 2024 # of OD to Vulnerable Population Ratio % Underserved Tracts Geometry
CHESTER 9859 66 0.6694391 100.00000 MULTIPOLYGON (((-76.1449 39…
PHILADELPHIA 214680 1058 0.4928265 0.00000 MULTIPOLYGON (((-75.24449 3…
BERKS 21520 92 0.4275093 0.00000 MULTIPOLYGON (((-75.79751 4…
CUMBERLAND 6796 26 0.3825780 0.00000 MULTIPOLYGON (((-77.53248 4…
LEHIGH 29629 111 0.3746330 0.00000 MULTIPOLYGON (((-75.62235 4…
DELAWARE 41086 115 0.2799007 0.00000 MULTIPOLYGON (((-75.40325 3…
YORK 27860 77 0.2763819 0.00000 MULTIPOLYGON (((-76.83294 4…
DAUPHIN 21553 56 0.2598246 16.66667 MULTIPOLYGON (((-76.89093 4…
BUTLER 20504 46 0.2243465 0.00000 MULTIPOLYGON (((-79.89613 4…
MONROE 17814 38 0.2133154 28.57143 MULTIPOLYGON (((-75.51463 4…

Your interpretation:

It seems that there isn’t that clear of a relationship. Further analysis seems to be needed. Many of the top 10 counties with high OD deaths to total population seems to have 0 underserved tracts in the whole county. But maybe this could be exptected given that counties with a bigger opioid crisis might keep the county better prepped. There doesn’t seem to be a large correlation with population either. The map seems to show that the majority of OD’s seem to be concentrated in counties with large populations, such as Philadelphia. This is also to be expected. Maybe further analyses without the outliers may help.