| tutorial-id |
none |
131-stops |
| name |
question |
Luke Li |
| email |
question |
lukelex2008@gmail.com |
| introduction-1 |
question |
wisdom justice courage temperance |
| introduction-2 |
question |
> show_file(".gitignore")
stops_files
> |
| introduction-3 |
question |
> show_file("stops.qmd", chunk = "Last")
#| message: false
library(tidyverse)
library(primer.data)
> |
| introduction-4 |
question |
> library(tidyverse)
── Attaching core tidyverse packages ──────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
> |
| introduction-5 |
question |
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| introduction-6 |
question |
difference in potential outcomes |
| introduction-7 |
question |
only one potential outcome can be observed |
| introduction-8 |
question |
arrested |
| introduction-9 |
question |
day time night time |
| introduction-10 |
question |
2, one for mask, one for no mask |
| introduction-11 |
question |
0 - no mask - no arrest, 1 - mask - arrest. Causal effect is difference between no arrest and arrest |
| introduction-12 |
question |
time |
| introduction-13 |
question |
white vs black |
| introduction-14 |
question |
How is the race of a driver related to whether or not they get arrested at a traffic stop? |
| wisdom-1 |
question |
preceptor table examination of data research question |
| wisdom-2 |
question |
data in least rows and columns to find quantity of interest |
| wisdom-3 |
question |
Rows are units, columns for outcomes and covariates |
| wisdom-4 |
question |
individual drivers |
| wisdom-5 |
question |
arrested |
| wisdom-6 |
question |
race |
| wisdom-7 |
question |
none |
| wisdom-8 |
question |
current |
| wisdom-9 |
question |
rows are units of individual drivers stopped, column for outcome of arrested or not, column for covariate of race |
| wisdom-10 |
question |
What is difference in whether or not someone is arrested between white people and black people? |
| wisdom-11 |
question |
Often times, a person's appearance unfortunately factors into how others treat them. Using data from the Stanford Open Policing Project, we use 400,000 traffic stops from New Orleans to find the difference in arrest rate between Black and White drivers adjusting for other variables. |
| justice-1 |
question |
population table, validity, stability, representativeness, unconfoundedness |
| justice-2 |
question |
Consistency between columns from data and columns from preceptor table |
| justice-3 |
question |
Arrested is a very objective outcome. Validity should hold |
| justice-4 |
question |
unit/time combination from population where data and preceptor table are derived from |
| justice-5 |
question |
individual drivers stopped / year |
| justice-6 |
question |
no change over time |
| justice-7 |
question |
conditions that warrant an arrest changed since that time period |
| justice-8 |
question |
data and preceptor table are random samples from same popoulation |
| justice-9 |
question |
traffic stop data only came from reported traffic stops, what if some werent reported |
| justice-10 |
question |
New Orleans is not representative of the whole country |
| justice-11 |
question |
Treatments are randomly assigned and independent of covariates |
| justice-12 |
question |
> library(tidymodels)
── Attaching packages ────────────────────────────── tidymodels 1.3.0 ──
✔ broom 1.0.8 ✔ rsample 1.3.0
✔ dials 1.4.0 ✔ tune 1.3.0
✔ infer 1.0.8 ✔ workflows 1.2.0
✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
✔ parsnip 1.3.2 ✔ yardstick 1.3.2
✔ recipes 1.3.1
── Conflicts ───────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
> |
| justice-13 |
question |
> library(broom)
> |
| justice-14 |
question |
$$
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}}
$$ |
| justice-15 |
question |
A potential weakness of our model is that the dataset is heavily modified from the actual study, and is reduced down to around 400,000 entries, therefore leaving out nearly 3.1 million entries. The deletion of the entries may have led to a misrepresentation of the population, and this would mess up our final predictions because it would be providing values that don't represent to the area, causing our entire model to be unrealistic. |
| courage-1 |
question |
data generating mechanism |
| courage-2 |
exercise |
linear_reg(engine = "lm") |
| courage-3 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |
| courage-4 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |>
tidy(conf.int = TRUE) |
| courage-5 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ race, data = x) |
| courage-6 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-7 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race, data = x) |>
tidy(conf.int = TRUE) |
| courage-8 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race*zone, data = x) |>
tidy(conf.int = TRUE) |
| courage-9 |
exercise |
fit_stops |
| courage-10 |
question |
> x <- stops |>
+ filter(race %in% c("black", "white")) |>
+ mutate(race = str_to_title(race),
+ sex = str_to_title(sex))
+
+ fit_stops <- linear_reg() |>
+ set_engine("lm") |>
+ fit(arrested ~ sex + race*zone, data = x)
> |
| courage-11 |
question |
> library(easystats)
# Attaching packages: easystats 0.7.4 (red = needs update)
✖ bayestestR 0.16.0 ✖ correlation 0.8.7
✖ datawizard 1.1.0 ✔ effectsize 1.0.1
✖ insight 1.3.0 ✖ modelbased 0.11.2
✖ performance 0.14.0 ✖ parameters 0.26.0
✔ report 0.6.1 ✔ see 0.11.0
Restart the R-Session and update packages with `easystats::easystats_update()`.
> |
| courage-12 |
question |
> check_predictions(extract_fit_engine(fit_stops))
> |
| courage-13 |
question |
$$
\widehat{\text{arrested}} = 0.177
+ 0.0614 \cdot \text{sex}_{\text{Male}}
- 0.0445 \cdot \text{race}_{\text{White}}
+ 0.0146 \cdot \text{zone}_{\text{B}}
+ 0.00610 \cdot \text{zone}_{\text{C}}
+ 0.0781 \cdot \text{zone}_{\text{D}}
+ 0.00190 \cdot \text{zone}_{\text{E}}
- 0.00271 \cdot \text{zone}_{\text{F}}
+ 0.0309 \cdot \text{zone}_{\text{G}}
+ 0.0757 \cdot \text{zone}_{\text{H}}
+ \text{(interaction terms for race and zone)}
$$ |
| courage-14 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race*zone, data = x)
> |
| courage-15 |
question |
> tutorial.helpers::show_file(".gitignore")
stops_files
*_cache
> |
| courage-16 |
exercise |
tidy(fit_stops, conf.int = TRUE) |
| courage-17 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
library(dplyr)
library(gt)
tidy(fit_stops, conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high) |>
mutate(
estimate = round(estimate, 3),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3),
`95% CI` = paste0("[", conf.low, ", ", conf.high, "]")
) |>
select(term, estimate, `95% CI`) |>
gt() |>
tab_header(
title = "Model Coefficients",
subtitle = "Estimates and 95% Confidence Intervals"
) |>
cols_label(
term = "Term",
estimate = "Estimate"
) |>
fmt_number(
columns = estimate,
decimals = 3
) |>
tab_options(
table.font.size = "small",
column_labels.font.weight = "bold"
)
> |
| courage-18 |
question |
We model the log odds of getting arrested at a traffic stop as a logistic function of sex and race. |
| temperance-1 |
question |
Using our data generating mechanism and interpreting results; humility reminds us that it is always a lie. |
| temperance-2 |
question |
Holding all other variables constant, the if the individual is male, then their probability of getting arrested increases by 0.06 |
| temperance-3 |
question |
When comparing white people with other races, white people have a 0.04 lower value for arrested, meaning that they are less likely to get arrested, relative to other races, conditional on the other variables in the model. |
| temperance-4 |
question |
If all other variables are at their base value, then the chance of getting arrested at a traffic stop is 0.18. |
| temperance-5 |
question |
> library(marginaleffects)
> |
| temperance-6 |
question |
What is the difference in chance of getting arrested at a traffic stop between black and white people? |
| temperance-7 |
question |
> predictions(fit_stops)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.179 0.00343 52.2 <0.001 Inf 0.173 0.186
0.142 0.00419 33.8 <0.001 828.0 0.133 0.150
0.250 0.00451 55.5 <0.001 Inf 0.241 0.259
0.142 0.00419 33.8 <0.001 828.0 0.133 0.150
0.232 0.01776 13.1 <0.001 127.6 0.198 0.267
--- 378457 rows omitted. See ?print.marginaleffects ---
0.208 0.00390 53.4 <0.001 Inf 0.201 0.216
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.189 0.00545 34.7 <0.001 874.0 0.179 0.200
Type: numeric |
| temperance-8 |
question |
> plot_predictions(fit_stops, by = "sex")
> |
| temperance-9 |
question |
> plot_predictions(fit_stops, condition = "sex")
> |
| temperance-10 |
question |
> plot_predictions(fit_stops, condition = c("sex", "race"))
> |
| temperance-11 |
question |
library(ggplot2)
library(dplyr)
# Here's the polished plot
plot_predictions(fit_stops, condition = c("sex", "race")) |>
ggplot(aes(x = sex, y = estimate, color = race)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
width = 0.1, size = 0.8) +
scale_color_manual(values = c("Black" = "#F8766D", "White" = "#00BFC4")) +
labs(
title = "Arrest Probability by Sex and Race",
subtitle = "Black individuals—especially men—are arrested at higher rates than White counterparts.",
x = "Sex",
y = "Estimated Probability of Arrest",
color = "Race",
caption = "Source: Model predictions from study dataset"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
plot.caption = element_text(size = 10, face = "italic"),
legend.position = "right",
panel.grid.minor = element_blank()
) |
| temperance-12 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
plot_predictions(fit_stops, condition = c("sex", "race")) |>
ggplot(aes(x = sex, y = estimate, color = race)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error),
width = 0.1, size = 0.8) +
scale_color_manual(values = c("Black" = "#F8766D", "White" = "#00BFC4")) +
labs(
title = "Arrest Probability by Sex and Race",
subtitle = "Black individuals—especially men—are arrested at higher rates than White counterparts.",
x = "Sex",
y = "Estimated Probability of Arrest",
color = "Race",
caption = "Source: Model predictions from study dataset"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
plot.caption = element_text(size = 10, face = "italic"),
legend.position = "right",
panel.grid.minor = element_blank()
)
> |
| temperance-13 |
question |
One quantity of interest and a measure of its uncertainty is the coefficient for having the male sex. When comparing men with women, men have a 0.06 higher value for arrested, meaning that they are more likely to get arrested, relative to women, conditional on the other variables in the model. The 95% confidence interval for this value ranges from 0.059 to 0.064. |
| temperance-14 |
question |
The estimates may be wrong because the |
| temperance-15 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
format: html
author: Luke Li
execute:
echo: false
---
```{r}
#| message: false
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(gt)
library(marginaleffects)
```
```{r}
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race*zone, data = x)
```
```{r}
library(ggplot2)
plot_predictions(fit_stops, condition = c("sex", "race")) +
labs(
title = "Arrest Probability Varies by Sex and Race",
subtitle = "Black males have the highest predicted probability of arrest; disparities exist across both race and gender.",
x = "Sex",
y = "Estimated Probability of Being Arrested",
caption = "Source: Model predictions from fitted logistic regression on stop data"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
plot.caption = element_text(size = 10, face = "italic"),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.title = element_text(size = 12),
legend.text = element_text(size = 11)
) +
scale_color_manual(
values = c("Black" = "#F8766D", "White" = "#00BFC4")
)
```
```{r}
tidy(fit_stops, conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high) |>
mutate(
estimate = round(estimate, 3),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3),
`95% CI` = paste0("[", conf.low, ", ", conf.high, "]")
) |>
select(term, estimate, `95% CI`) |>
gt() |>
tab_header(
title = "Model Coefficients",
subtitle = "Estimates and 95% Confidence Intervals"
) |>
cols_label(
term = "Term",
estimate = "Estimate"
) |>
fmt_number(
columns = estimate,
decimals = 3
) |>
tab_options(
table.font.size = "small",
column_labels.font.weight = "bold"
)
```
$$
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}}
$$
$$
\widehat{\text{arrested}} = 0.177
+ 0.0614 \cdot \text{sex}_{\text{Male}}
- 0.0445 \cdot \text{race}_{\text{White}}
+ 0.0146 \cdot \text{zone}_{\text{B}}
+ 0.00610 \cdot \text{zone}_{\text{C}}
+ 0.0781 \cdot \text{zone}_{\text{D}}
+ 0.00190 \cdot \text{zone}_{\text{E}}
- 0.00271 \cdot \text{zone}_{\text{F}}
+ 0.0309 \cdot \text{zone}_{\text{G}}
+ 0.0757 \cdot \text{zone}_{\text{H}}
+ \text{(interaction terms for race and zone)}
$$
Often times, a person's appearance unfortunately factors into how others treat them. Using data from the Stanford Open Policing Project, we use 400,000 traffic stops from New Orleans to find the difference in arrest rate between Black and White drivers adjusting for other variables. A potential weakness of our model is that the dataset is heavily modified from the actual study, and is reduced down to around 400,000 entries, therefore leaving out nearly 3.1 million entries. The deletion of the entries may have led to a misrepresentation of the population, and this would mess up our final predictions because it would be providing values that don't represent to the area, causing our entire model to be unrealistic. We modeled the probability of getting arrested at a traffic stop as a linear function of sex and the product of race and zone. One quantity of interest and a measure of its uncertainty is the coefficient for having the male sex. When comparing men with women, men have a 6% higher probability for getting arrested, conditional on the other variables in the model. The 95% confidence interval for this value ranges from 5.9% to 6.4%.Based on all of the errors in our data that we collected during Justice, our original data source to base our model on was already flawed. Additionally, our model can only predict estimates on paper, based on an ideal world. It doesn’t account for many factors of the real world that we, ourselves, can’t even describe.
> |
| temperance-16 |
question |
https://lukearoni.github.io/stops/ |
| temperance-17 |
question |
https://github.com/lukearoni/stops |
| minutes |
question |
70 |