Lab 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Angie Kwon

Published

March 17, 2026

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences

Part 1: Healthcare Access for Vulnerable Populations

Research Question

Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

Questions to answer: - How many hospitals are in your dataset? 223 - How many census tracts? 3445 - What coordinate reference system is each dataset in? WGS 84 / pseudo mercator, WGS 84, NAD –> standardized to hospital dataset’s CRS which is WGS 84.

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)
acs_vars <- c(
  total_pop = "B01003_001",
  med_HH_income = "B19013_001",
  m_6566 = "B01001_020",
  m_6769 = "B01001_021",
  m_7074 = "B01001_022", 
  m_7579 = "B01001_023", 
  m_8084 = "B01001_024", 
  m_85over = "B01001_025",
  f_6566 = "B01001_044", 
  f_6769 = "B01001_045", 
  f_7074 = "B01001_046", 
  f_7579 = "B01001_047", 
  f_8084 = "B01001_048", 
  f_85over = "B01001_049"
)

demo_data <- get_acs(geography = "tract",
  variables = acs_vars,
  state = "PA", survey = "acs5", year = 2022, output = "wide"
  ) %>%
  mutate(over_65E = m_6566E+m_6769E+m_7074E+m_7579E+m_8084E+m_85overE
         +f_6566E+f_6769E+f_7074E+f_7579E+f_8084E+f_85overE) %>%
  select(-c(m_6566E,m_6769E,m_7074E,m_7579E,m_8084E,m_85overE,
         f_6566E,f_6769E,f_7074E,f_7579E,f_8084E,f_85overE,
         m_6566M,m_6769M,m_7074M,m_7579M,m_8084M,m_85overM,
         f_6566M,f_6769M,f_7074M,f_7579M,f_8084M,f_85overM))
# %>%
#   mutate(NAME = str_extract(NAME, "Census Tract [0-9.]+"))

#wasn't joining because there was one census tract difference --> check what's in demo_data but not 
#census_tracts --> one census with total pop of 0 --> get rid of and then join again
diff <- setdiff(demo_data$GEOID, census_tracts$GEOID)
diff_row <- which(demo_data$GEOID == diff)
demo_data <- demo_data[-diff_row,]

# Join to tract boundaries
# tracts_demodata <- census_tracts %>%
#   left_join(demo_data %>% select(total_popE, total_popM, med_HH_incomeE, med_HH_incomeM, over_65E), 
#             by = "GEOID")
tracts_demodata <- census_tracts %>%
  left_join(demo_data %>% select(-c(NAME)), 
            by = "GEOID")

#how many tracts have missing income data
missing_income_count <- tracts_demodata %>%
  count(is.na(med_HH_incomeE))
missing_income_tracts <- tracts_demodata %>% filter(is.na(med_HH_incomeE))

#calculate mean of all median incomes
all_med_inc <- mean(tracts_demodata$med_HH_incomeE, na.rm = TRUE)

Questions to answer: - What year of ACS data are you using? 2022 - How many tracts have missing income data? 62 - What is the median income across all PA census tracts? $77527.23

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria

#taking out rows with missing median income data
missing_inc_rows <- which(is.na(tracts_demodata$med_HH_incomeE))
avail_inc_data <- tracts_demodata[-missing_inc_rows,]

# 1. chose <60K as low median HH income
# 2. adding column of percentage of elderly population --> chose 17% as sig
# elderly pop threshold
avail_inc_data <- avail_inc_data %>%
  mutate(elderly_per = over_65E/total_popE*100)

#filter out vulnerable tracts
vul_pop <- avail_inc_data %>% filter(med_HH_incomeE < 60000 & elderly_per >= 17)

#percent of PA CT that are vulnerable
PA_per_vul <- nrow(vul_pop)/nrow(census_tracts)*100

Questions to answer: - What income threshold did you choose and why? $60,000 because ____ - What elderly population threshold did you choose and why? 17% because the national proportion for population over 65 is around 16.85% - How many tracts meet your vulnerability criteria? 574 - What percentage of PA census tracts are considered vulnerable by your definition? **16.67%*

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS
st_crs(vul_pop)
# in WGS 84

# Calculate distance from each tract centroid to nearest hospital
vul_tract_centroid <- st_centroid(vul_pop)

nearest_index <- st_nearest_feature(vul_tract_centroid, hospitals)

nearest_dist <- st_distance(vul_tract_centroid, hospitals[nearest_index,],
                            by_element = TRUE)

vul_pop$nearest_hosp <- as.numeric(nearest_dist/1609.344) #make it in km's

# avg distance to hospitals for all vulnerable tracts
avg_hosp_dist <- mean(vul_pop$nearest_hosp, na.rm = TRUE)

# maximum distance
max_dist <- vul_pop %>% slice_max(nearest_hosp, n=1)

# >15 miles
more_15 <- vul_pop %>% filter(nearest_hosp > 15)
nrow(more_15)

Requirements: - Use an appropriate projected coordinate system for Pennsylvania - Calculate distances in miles - Explain why you chose your projection

Questions to answer: - What is the average distance to the nearest hospital for vulnerable tracts? 4.94 miles - What is the maximum distance? 29.06 miles - How many vulnerable tracts are more than 15 miles from the nearest hospital? 34 tracts

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable
underserved <- more_15

#how many
nrow(more_15)

#per_underserved
per_und <- nrow(underserved)/nrow(vul_pop)*100

Questions to answer: - How many tracts are underserved? 34 tracts - What percentage of vulnerable tracts are underserved? 5.92% of vulnerable tracts are underserved - Does this surprise you? Why or why not? This surprises me a little bit, as it was quite low and I assumed tracts with vulnerable populations would be generally underserved in most aspects, as one of the criteria was low income. However, I see that this may happen because despite the low income, counties and local governments might have prepared ahead for the vulnerable populations, knowing they would need the hospitals more.

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

#add underserved status to vul_pop
vul_pop <- vul_pop %>% 
  mutate(underserved = case_when(nearest_hosp > 15 ~ 'Underserved',
                                   TRUE ~ 'Appropriately Served'))
vul_pop <- st_transform(vul_pop, st_crs(hospitals))
#hospital buffer of 15 miles
hospital_service_areas <- hospitals %>%
  st_buffer(dist = 24140)
hospital_service_areas <- st_transform(hospital_service_areas, st_crs(hospitals))
combined_service_area <- hospital_service_areas %>%
  st_union() %>%
  st_make_valid()

#outside hospital service area
outside_hosp <- st_difference(pa_counties, combined_service_area)

# ggplot() +
#   geom_sf(data = pa_counties, fill = "grey", color = "gray") +
#   geom_sf(data = outside_hosp, fill = "lightblue", alpha = 0.4) +
#   geom_sf(data = hospitals, color = "red", size = 2) +
#   labs(
#     title = paste("outside hospital service area"),
#     subtitle = "Red points = Hospitals, Blue areas = 15-mile service zones"
#   ) +
#   theme_void()

#ppl living outside hospital service area
pop_summary <- vul_pop %>%
  st_join(outside_hosp %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    pop_outHosp = sum(total_popE, na.rm = TRUE),
    .groups = "drop"
  )
pop_outside_hosp <- pa_counties %>%
  select(COUNTY_NAM) %>%
  st_drop_geometry() %>%
  left_join(pop_summary, by = "COUNTY_NAM") %>%
  mutate(
    pop_outHosp = coalesce(pop_outHosp, 0)
  )

# #ppl over 65 living outside hospital service area
over65_summary <- vul_pop %>%
  st_join(outside_hosp %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    total_vul_pop = sum(over_65E, na.rm = TRUE),
    .groups = "drop"
  )
over65_outside_hosp <- pa_counties %>%
  select(COUNTY_NAM) %>%
  st_drop_geometry() %>%
  left_join(over65_summary, by = "COUNTY_NAM") %>%
  mutate(
    total_vul_pop = coalesce(total_vul_pop, 0)
  )

# Spatial join tracts to counties
tracts_by_county <- vul_pop %>%
  st_join(pa_counties %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    tot_pop = sum(total_popE),
    n_tracts = n_distinct(GEOID),
    tract_ids = paste(unique(NAMELSAD), collapse = ", "),
    n_underserved = sum(underserved == "Underserved"),
    avg_hosp_dist = mean(nearest_hosp),
  ) %>%
  arrange(desc(n_tracts))

# Aggregate statistics by county

# percent of vul tracts that are underserved
#total
34/574*100
#per county
tracts_by_county <- tracts_by_county %>%
  mutate(per_underserved = n_underserved/n_tracts*100)

# total vul pop in PA
sum(vul_pop$total_popE)

# top 5 counties w highest percentage of underserved vul tracts
highest_per_und_vul <- tracts_by_county %>% slice_max(per_underserved, n=5)
highest_per_und_vul_10 <- tracts_by_county %>% slice_max(per_underserved, n=10)

# counties w the most vul ppl living far from hospitals
tracts_by_county <- tracts_by_county %>% left_join(pop_outside_hosp, by = "COUNTY_NAM")
# tracts_by_county <- tracts_by_county %>% left_join(over65_outside_hosp, by = "COUNTY_NAM")
most_vul <- tracts_by_county %>% arrange(desc(pop_outHosp))

Required county-level statistics: - Number of vulnerable tracts 574 vulnerable tracts - Number of underserved tracts 34 underserved tracts - Percentage of vulnerable tracts that are underserved 5.92% are underserved - Average distance to nearest hospital for vulnerable tracts 4.94 miles - Total vulnerable population in PA 1802220

Questions to answer: - Which 5 counties have the highest percentage of underserved vulnerable tracts? - Which counties have the most vulnerable people living far from hospitals? - Are there any patterns in where underserved counties are located? For high proportion of underserved tracts, it makes sense that most of the counties are smaller ones, with small total number of tracts. For counties with the most vulnerable people living far from hospitals, it makes sense that it would be slightly bigger counties, where people are more spread apart.

kable(highest_per_und_vul[, c(1,3,5,7)], "simple", 
      col.names = c("County Name", "# of Tracts", "# of Underserved Tracts", 
                    "% of Underserved Tracts"), 
      caption = "Which 5 counties have the highest percentage of underserved vulnerable tracts?")

Which 5 counties have the highest percentage of underserved vulnerable tracts?
County Name	# of Tracts	# of Underserved Tracts	% of Underserved Tracts
CHESTER	2	2	100
PERRY	2	2	100
SULLIVAN	4	3	75
CAMERON	5	3	60
WYOMING	2	1	50

kable(most_vul[, c(1,8,2, 3,5,7,6)], "simple", 
      col.names = c("County Name", "Population >15 miles of Hospitals", "Total Population", 
                    "# of Tracts", "# of Underserved Tracts", "% of Underserved Tracts",
                    "Average Distance to Hospitals"), 
      caption = "Which counties have the most vulnerable people living far from hospitals?")

Which counties have the most vulnerable people living far from hospitals?
County Name	Population >15 miles of Hospitals	Total Population	# of Tracts	# of Underserved Tracts	% of Underserved Tracts	Average Distance to Hospitals
CLEARFIELD	43768	68766	19	8	42.105263	12.4081771
POTTER	26155	34726	11	4	36.363636	11.6764763
ARMSTRONG	24002	49547	15	1	6.666667	9.4372737
TIOGA	21444	38118	10	3	30.000000	10.7140760
BRADFORD	17895	38067	10	3	30.000000	11.1379861
CLARION	17556	41135	13	2	15.384615	10.2108888
BEDFORD	16699	46941	13	1	7.692308	9.9544813
ELK	12347	31332	10	4	40.000000	12.4208411
CENTRE	11192	14433	4	1	25.000000	13.1074563
MIFFLIN	11158	30041	10	1	10.000000	9.0377807
LYCOMING	10326	46037	12	1	8.333333	6.3662169
WARREN	10287	35229	11	2	18.181818	7.8595061
CAMERON	9910	11467	5	3	60.000000	15.7720394
CHESTER	9859	9859	2	2	100.000000	21.4548423
HUNTINGDON	9770	39896	11	1	9.090909	10.0597386
MCKEAN	9477	25483	10	1	10.000000	9.7918995
JUNIATA	8600	12340	4	1	25.000000	11.4765289
FOREST	8582	13270	5	2	40.000000	13.1445039
CRAWFORD	8337	42157	13	2	15.384615	6.9826609
FULTON	8061	12883	4	0	0.000000	10.0509533
WYOMING	7984	7984	2	1	50.000000	18.7004949
SULLIVAN	7766	11676	4	3	75.000000	19.8950644
SOMERSET	7631	73998	20	1	5.000000	7.6192410
COLUMBIA	6338	16917	5	1	20.000000	7.3457774
SNYDER	5847	12246	3	1	33.333333	12.3849397
ERIE	5815	48523	14	2	14.285714	4.2480374
PERRY	5800	5800	2	2	100.000000	17.5551908
CLINTON	5724	17194	5	2	40.000000	11.7694914
VENANGO	4684	46359	16	1	6.250000	7.8399110
MERCER	4684	36410	12	0	0.000000	3.6181024
PIKE	4419	11301	5	2	40.000000	14.3310500
WAYNE	4155	27940	11	2	18.181818	9.0944780
LANCASTER	4135	24349	6	1	16.666667	8.4634847
NORTHUMBERLAND	4018	45555	14	1	7.142857	10.3628216
DAUPHIN	4018	21553	6	1	16.666667	5.0769104
BUTLER	3457	20504	6	0	0.000000	6.1528363
MONROE	3222	17814	7	2	28.571429	7.8351757
SUSQUEHANNA	2770	17175	6	0	0.000000	6.2452393
FAYETTE	2138	80604	26	0	0.000000	5.6397156
FRANKLIN	1782	27834	7	1	14.285714	3.7903890
LUZERNE	918	69884	23	1	4.347826	4.4826659
ALLEGHENY	0	240251	89	0	0.000000	2.5217422
PHILADELPHIA	0	214680	54	0	0.000000	1.1988649
WESTMORELAND	0	125411	45	0	0.000000	4.0749669
CAMBRIA	0	70339	25	0	0.000000	5.8635941
LACKAWANNA	0	60528	21	1	4.761905	3.5336646
BEAVER	0	43017	18	0	0.000000	3.6778290
WASHINGTON	0	42123	16	0	0.000000	5.2466567
INDIANA	0	51398	15	1	6.666667	8.5408979
MONTGOMERY	0	64651	15	0	0.000000	1.5956215
DELAWARE	0	41086	14	0	0.000000	1.3205300
JEFFERSON	0	48421	14	3	21.428571	7.1570702
BLAIR	0	37533	13	0	0.000000	5.2396095
SCHUYLKILL	0	28231	10	0	0.000000	4.7552457
LAWRENCE	0	23267	9	0	0.000000	3.7203412
YORK	0	27860	9	0	0.000000	1.8255643
LEHIGH	0	29629	8	0	0.000000	2.5973411
NORTHAMPTON	0	27929	8	0	0.000000	4.3091026
BERKS	0	21520	7	0	0.000000	5.0556084
UNION	0	27687	6	1	16.666667	5.7454115
CARBON	0	17838	5	0	0.000000	5.9278555
LEBANON	0	11420	3	0	0.000000	4.2112314
MONTOUR	0	13033	3	0	0.000000	4.4809895
CUMBERLAND	0	6796	2	0	0.000000	1.6419706
GREENE	0	9495	2	0	0.000000	10.5464881
ADAMS	0	3908	1	0	0.000000	0.1890303

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create and format priority counties table
kable(highest_per_und_vul_10[, c(1,2,3,5,7,6)], "simple", 
      col.names = c("County Name","Total Population", "# of Tracts", "# of Underserved Tracts", 
                    "% of Underserved Tracts", "Average Distance to Hospitals"), 
      caption = "Top 10 priority counties for healthcare investment")

Top 10 priority counties for healthcare investment
County Name	Total Population	# of Tracts	# of Underserved Tracts	% of Underserved Tracts	Average Distance to Hospitals
CHESTER	9859	2	2	100.00000	21.45484
PERRY	5800	2	2	100.00000	17.55519
SULLIVAN	11676	4	3	75.00000	19.89506
CAMERON	11467	5	3	60.00000	15.77204
WYOMING	7984	2	1	50.00000	18.70049
CLEARFIELD	68766	19	8	42.10526	12.40818
ELK	31332	10	4	40.00000	12.42084
CLINTON	17194	5	2	40.00000	11.76949
FOREST	13270	5	2	40.00000	13.14450
PIKE	11301	5	2	40.00000	14.33105

Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Your Analysis

Your Task:

Find and load additional data

# Load your additional dataset
od_deaths <- st_read(here("labs/lab_2/data/Estimated Drug Overdose Deaths CY 2012-Current County Health_20260223.shp"))
ems_stations <- st_read(here("labs/lab_2/data/Fire_and_Emergency_Medical_Service__EMS__Stations_pt.shp"))

od_deaths <- st_transform(od_deaths, st_crs(hospitals))
ems_stations <- st_transform(ems_stations, st_crs(hospitals))

Questions to answer: - What dataset did you choose and why? Estimated Drug Overdose Deaths, because I wanted to examine whether there is a correlative relationship between vulnerable populations and the opioids crisis. - What is the data source and date? PA Open Data, and the data’s year ranges from 2012-2024. - How many features does it contain? 884 data points - What CRS is it in? Did you need to transform it? WGS 84, transformed just in case

Pose a research question

Are there more opioid overdoses in vulnerable areas?

Conduct spatial analysis

# Your spatial analysis

# ems station buffer
ems_buffer <- ems_stations %>%
  st_buffer(dist = 10000)
ems_buffer <- st_transform(ems_buffer, st_crs(hospitals))
combined_ems_buffer <- ems_buffer %>%
  st_union() %>%
  st_make_valid()

# capitalize county names
od_deaths$county <- toupper(od_deaths$county)

#summarize od deaths by county
od_by_county <- od_deaths %>%
  filter(year == 2024) %>%
  st_join(pa_counties %>% select(COUNTY_NAM)) %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarize(
    n_od_2024 = sum(count, na.rm = TRUE)
  )

#join od deaths to counties
tracts_by_county <- tracts_by_county %>% left_join(od_by_county, by = "COUNTY_NAM")

#percentage of vulnerable ppl of total pop & od deaths:vulnerable ppl ratio
tracts_by_county <- tracts_by_county %>%
  mutate(od_vul_ratio = n_od_2024/tot_pop*100)
tracts_by_county$od_vul_ratio[is.infinite(tracts_by_county$od_vul_ratio)] <- 0
tracts_by_county$od_vul_ratio[is.nan(tracts_by_county$od_vul_ratio)] <- 0

Top 10 priority counties for healthcare investment
County Name	Total Vulnerable Population	# of OD in 2024	# of OD to Vulnerable Population Ratio	% Underserved Tracts	Geometry
CHESTER	9859	66	0.6694391	100.00000	MULTIPOLYGON (((-76.1449 39…
PHILADELPHIA	214680	1058	0.4928265	0.00000	MULTIPOLYGON (((-75.24449 3…
BERKS	21520	92	0.4275093	0.00000	MULTIPOLYGON (((-75.79751 4…
CUMBERLAND	6796	26	0.3825780	0.00000	MULTIPOLYGON (((-77.53248 4…
LEHIGH	29629	111	0.3746330	0.00000	MULTIPOLYGON (((-75.62235 4…
DELAWARE	41086	115	0.2799007	0.00000	MULTIPOLYGON (((-75.40325 3…
YORK	27860	77	0.2763819	0.00000	MULTIPOLYGON (((-76.83294 4…
DAUPHIN	21553	56	0.2598246	16.66667	MULTIPOLYGON (((-76.89093 4…
BUTLER	20504	46	0.2243465	0.00000	MULTIPOLYGON (((-79.89613 4…
MONROE	17814	38	0.2133154	28.57143	MULTIPOLYGON (((-75.51463 4…

Your interpretation:

It seems that there isn’t that clear of a relationship. Further analysis seems to be needed. Many of the top 10 counties with high OD deaths to total population seems to have 0 underserved tracts in the whole county. But maybe this could be exptected given that counties with a bigger opioid crisis might keep the county better prepped. There doesn’t seem to be a large correlation with population either. The map seems to show that the majority of OD’s seem to be concentrated in counties with large populations, such as Philadelphia. This is also to be expected. Maybe further analyses without the outliers may help.