| tutorial-id |
none |
stops |
| name |
question |
Ansh Patel |
| email |
question |
anshbrunch@gmail.com |
| introduction-1 |
question |
Wisdom, justice, courage, temperance |
| the-question-1 |
exercise |
library(tidyverse) |
| the-question-2 |
exercise |
library(primer.data) |
| the-question-3 |
question |
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| the-question-4 |
question |
Because we are studying arrests during traffic stops, the outcome variable should be arrested, which shows whether an arrest was made or not.
Ask ChatGPT |
| the-question-5 |
question |
One example is giving a warning instead of a ticket; we could manipulate this by deciding which drivers get a warning. |
| the-question-6 |
question |
One outcome if the driver wears a mask, and one outcome if the driver does not wear a mask.
There are two because for each driver, only one condition can be true at a time (mask or no mask), but we are interested in what would happen under both scenarios to find the causal effect. |
| the-question-7 |
question |
For a single driver, the treatment variable mask can be either mask = 1 (wearing a mask) or mask = 0 (not wearing a mask).
Suppose if the driver wears a mask, they would not be arrested (outcome = 0), but if they do not wear a mask, they would be arrested (outcome = 1).
The causal effect for this driver would be 0 − 1 = −1, meaning wearing a mask reduces the chance of arrest for this driver by one unit. |
| the-question-8 |
question |
One variable in stops that likely has an important connection to arrested is race, since race often shows differences in stop outcomes.
If we wanted another good predictor that isn’t in the data, a useful one could be reason for stop, which might strongly affect the chance of arrest (for example, speeding vs. driving under the influence). |
| the-question-9 |
question |
Two groups that differ by race and might have different average arrest rates are Black drivers and White drivers.
Studies often show that Black drivers have higher average arrest rates compared to White drivers during traffic stops. |
| the-question-10 |
question |
What is the probability that a driver is arrested during a traffic stop, given their race? |
| wisdom-1 |
question |
Wisdom in data science means using good judgment, asking the right questions, checking your work, and making sure results are meaningful and responsible. |
| wisdom-2 |
question |
A Preceptor Table is a clear reference table that defines and explains each variable in your dataset. It shows what each variable represents, its possible values, units (if any), and any special notes needed to interpret it correctly. This helps anyone using the data understand exactly what each column means. |
| wisdom-3 |
question |
A Preceptor Table lists and describes all the key parts of a dataset. It includes the outcomes you are trying to measure or predict, the covariates or predictors you are using, and the units for each variable so it’s clear what is being counted or measured. It often explains what each value means, any coding used, and any special notes needed to interpret the data correctly. |
| wisdom-4 |
question |
> show_file(".gitignore")
*Rproj
> |
| wisdom-5 |
question |
For this problem, the units are individual traffic stops — each row in the dataset represents one stop involving one driver. |
| wisdom-6 |
question |
The outcome variable for this problem is arrested, which shows whether an arrest was made during the traffic stop. |
| wisdom-7 |
question |
Some useful covariates for this problem could include race, sex, age, time of day, reason for the stop, location or zone, whether the driver was speeding or intoxicated, whether there was a warrant, and the driver’s prior record if available. |
| wisdom-8 |
question |
None |
| wisdom-9 |
question |
When the stop is amde |
| wisdom-10 |
question |
Diffence between 2 outcomkes |
| wisdom-11 |
question |
you can only see one outcome |
| wisdom-12 |
question |
It doesnt? |
| wisdom-13 |
question |
The Preceptor Table for this problem lists all the key variables involved in the traffic stop data. It includes the outcome variable, arrested, which indicates whether an arrest was made during the stop. The table also describes the key covariates—race, sex, age, and zone—explaining what each variable represents and their possible values. For example, race and sex are categorical variables showing the driver’s demographic group, age is a numerical variable measured in years, and zone indicates the location of the stop. The table specifies the units for each variable and any coding or special notes needed to interpret the data correctly. |
| wisdom-14 |
question |
> show_file("stops.qmd", start = -5)
```{r}
#| message: false
library(tidyverse)
library(primer.data)
```
> |
| wisdom-15 |
question |
Validity means that the data and the measurements accurately represent what they are supposed to measure. In other words, the information truly reflects the real-world concepts or events we want to study, without systematic errors or mistakes. |
| wisdom-16 |
question |
One reason validity might not hold is if the column for the outcome variable contains errors or inconsistencies, such as arrests being recorded incorrectly or missing. This would mean the column does not accurately reflect the true arrest status. |
| wisdom-17 |
question |
Arrests during traffic stops are an important measure of law enforcement outcomes and can be influenced by various demographic factors. This analysis uses data from nearly 400,000 traffic stops conducted by the New Orleans Police Department between 2011 and 2018 to explore how race relates to the likelihood of arrest. |
| wisdom-18 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "last")
---
title: "Stops"
format: html
author: Ansh Patel
---
```{r}
#| message: false
library(tidyverse)
library(primer.data)
```
Arrests during traffic stops are an important measure of law enforcement outcomes and can be influenced by various demographic factors. This analysis uses data from nearly 400,000 traffic stops conducted by the New Orleans Police Department between 2011 and 2018 to explore how race relates to the likelihood of arrest.
> |
| justice-1 |
question |
Validity, stability, representativeness, unconfoundedness |
| justice-2 |
question |
A Population Table describes the entire group of individuals or units that a study aims to understand or make conclusions about. It lists the key variables and their values as they exist in the full population, not just in the sample or dataset. |
| justice-3 |
question |
Stability means that the relationships between variables stay consistent over time or across different groups. In data science, it assumes that patterns found in one dataset or time period will hold true in others, so models or conclusions remain reliable when applied beyond the original data. |
| justice-4 |
question |
One reason stability might not hold is if policing practices or arrest policies changed over the years covered by the data, causing the relationship between race and arrest rates to shift over time. |
| justice-5 |
question |
Representativeness means that the data we collect accurately reflects the larger population we want to learn about. In other words, the sample should include the same kinds of people and characteristics as the whole population, so conclusions drawn from the data apply broadly and aren’t biased. |
| justice-6 |
question |
One reason representativeness might not hold is if certain types of stops or drivers are less likely to be recorded in the data—for example, if stops involving arrests have more missing information—so the data does not fully reflect the characteristics of the entire population of traffic stops. |
| justice-7 |
question |
One reason representativeness might not hold is if the Preceptor Table does not include all relevant variables or categories present in the population, causing it to misrepresent the true diversity or characteristics of the population. |
| justice-8 |
question |
Unconfoundedness means that all important factors that affect both the treatment and the outcome are measured and included in the analysis. This ensures that any observed relationship between the treatment and outcome is not distorted by hidden variables, allowing for a clearer estimate of the treatment’s true effect. |
| justice-9 |
question |
So far, I have been working with traffic stop data from New Orleans, which includes information on arrests, driver demographics, and stop locations. My goal is to understand how race and other factors relate to the likelihood of arrest. However, a key problem is that missing data, especially for stops that resulted in arrest, may bias the results and limit the validity of my conclusions. |
| courage-1 |
question |
Courage in data analysis means being willing to face difficult or uncertain results honestly, even if they challenge your expectations or desires. It involves asking tough questions, acknowledging mistakes, and making decisions based on the evidence rather than convenience or pressure. Courage also means standing by your findings while being open to revising them when new information arises. |
| courage-2 |
exercise |
library(tidymodels) |
| courage-3 |
exercise |
library(broom) |
| courage-4 |
question |
\[
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}}
\]
where
\[
Y \sim \text{Bernoulli}(\rho)
\]
and
\[
\rho = P(Y=1)
\] |
| courage-5 |
question |
> tutorial.helpers::show_file("stops.qmd", pattern = "library")
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
> |
| courage-6 |
exercise |
linear_reg(engine = "lm") |
| courage-7 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |
| courage-8 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex, data = x) |>
tidy(conf.int = TRUE) |
| courage-9 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-10 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-11 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race, data = x) |>
tidy(conf.int = TRUE) |
| courage-12 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race*zone) |>
tidy(conf.int = TRUE) |
| courage-13 |
exercise |
fit_stops |
| courage-15 |
exercise |
library(easystats) |
| courage-17 |
exercise |
check_predictions(extract_fit_engine(fit_stops)) |
| courage-18 |
question |
\[
\hat{Y} = \beta_0
+ \beta_1 \, \text{sim}_1 + \beta_2 \, \text{sim}_2 + \beta_3 \, \text{sim}_3 + \beta_4 \, \text{sim}_4 + \beta_5 \, \text{sim}_5
\]
\[
+ \beta_6 \, \text{sim}_6 + \beta_7 \, \text{sim}_7 + \beta_8 \, \text{sim}_8 + \beta_9 \, \text{sim}_9 + \beta_{10} \, \text{sim}_{10}
\]
\[
+ \cdots
\]
\[
+ \beta_{46} \, \text{sim}_{46} + \beta_{47} \, \text{sim}_{47} + \beta_{48} \, \text{sim}_{48} + \beta_{49} \, \text{sim}_{49} + \beta_{50} \, \text{sim}_{50}
\] |
| courage-19 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -8)
\]
\[
+ \beta_{46} \, \text{sim}_{46} + \beta_{47} \, \text{sim}_{47} + \beta_{48} \, \text{sim}_{48} + \beta_{49} \, \text{sim}_{49} + \beta_{50} \, \text{sim}_{50}
\]
```{r}
#| cache: true
```
> |
| courage-20 |
question |
> tutorial.helpers::show_file(".gitignore")
*Rproj
*_cache
> |
| courage-21 |
exercise |
tidy(fit_stops, conf.int = TRUE) |
| courage-22 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -10)
gt() %>%
tab_header(title = "Model Coefficients with Confidence Intervals") %>%
fmt_number(columns = c("estimate", "conf.low", "conf.high"), decimals = 2) %>%
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "Lower 95% CI",
conf.high = "Upper 95% CI"
)
```
> |
| courage-23 |
question |
I model the probability of arrest, a binary outcome indicating whether an arrest was made or not, as a logistic function of race, sex, age, and zone. |
| temperance-1 |
question |
Temperance in data science means practicing self-control and balance when analyzing data. It involves avoiding overconfidence, resisting the urge to over-interpret results, and being careful not to overcomplicate models or jump to conclusions without sufficient evidence. Temperance helps ensure that findings are reliable, honest, and responsibly communicated. |
| temperance-2 |
question |
The estimate of 0.06 for sexMale means that, holding other variables constant, being male is associated with an increase of about 0.06 in the predicted outcome compared to being female. |
| temperance-3 |
question |
The estimate of -0.04 for raceWhite means that, all else being equal, identifying as White is associated with a decrease of about 0.04 in the predicted outcome compared to the reference race group. |
| temperance-4 |
question |
The intercept estimate of 0.18 represents the predicted value of the outcome when all other variables are at their reference levels or zero. |
| temperance-5 |
exercise |
library(marginaleffects) |
| temperance-6 |
question |
We are investigating arrests during traffic stops and how different factors relate to the likelihood of being arrested. The specific question we are trying to answer is how race and other characteristics like sex and location influence the probability of arrest. |
| temperance-7 |
exercise |
plot_predictions(fit_stops, condition = c("sex", "race")) |
| temperance-8 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) |> as_tibble() |>
group_by(zone, sex) |>
mutate(sort_order = estimate[race == "Black"]) |>
ungroup() |>
mutate(zone = reorder_within(zone, sort_order, sex)) |>
ggplot(aes(x = zone,
color = race)) +
geom_errorbar(aes(ymin = conf.low,
ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~ sex, scales = "free_x") +
scale_x_reordered() +
theme(axis.text.x = element_text(size = 8)) +
scale_y_continuous(labels = percent_format()) |
| temperance-9 |
question |
#| echo: false
library(marginaleffects)
library(dplyr)
library(ggplot2)
library(scales)
# Fit the model
fit_stops <- glm(arrested ~ race, data = stops, family = binomial)
# Get predictions grouped by race
preds <- plot_predictions(
fit_stops,
condition = "race",
draw = FALSE
) |>
as_tibble()
# Plot
ggplot(preds, aes(x = race, y = estimate, color = race)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Predicted Probability of Arrest by Race",
x = "Race",
y = "Predicted Probability"
) +
theme_minimal() +
theme(legend.position = "none") |
| temperance-10 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -8)
labs(
title = "Predicted Probability of Arrest by Race",
x = "Race",
y = "Predicted Probability"
) +
theme_minimal() +
theme(legend.position = "none")
```
> |
| temperance-11 |
question |
For example, we estimate that the predicted probability of arrest for White drivers is about 4 percentage points lower than for Black drivers, with a 95% confidence interval ranging from about 3% to 6%. |
| temperance-12 |
question |
Our estimates for the difference in arrest probabilities might be inaccurate because there could be unmeasured factors, like the reason for the stop or prior driving record, that affect both race and the likelihood of arrest. If these confounding variables are not included, our model may overstate or understate the true effect. Because of this, the true difference might be smaller — for example, the actual gap could be closer to 2 percentage points, with a wider confidence interval ranging from 0% to 4%. |
| temperance-13 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
format: html
author: Ansh Patel
---
```{r}
#| message: false
#| echo: false
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(marginaleffects)
```
Arrests during traffic stops are an important measure of law enforcement outcomes and can be influenced by various demographic factors. This analysis uses data from nearly 400,000 traffic stops conducted by the New Orleans Police Department between 2011 and 2018 to explore how race relates to the likelihood of arrest. So far, I have been working with traffic stop data from New Orleans, which includes information on arrests, driver demographics, and stop locations. My goal is to understand how race and other factors relate to the likelihood of arrest. However, a key problem is that missing data, especially for stops that resulted in arrest, may bias the results and limit the validity of my conclusions. I model the probability of arrest, a binary outcome indicating whether an arrest was made or not, as a logistic function of race, sex, age, and zone.
```{r}
#| label: eda
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
```
\[
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}}
\]
where
\[
Y \sim \text{Bernoulli}(\rho)
\]
and
\[
\rho = P(Y=1)
\]
\[
\hat{Y} = \beta_0
+ \beta_1 \, \text{sim}_1 + \beta_2 \, \text{sim}_2 + \beta_3 \, \text{sim}_3 + \beta_4 \, \text{sim}_4 + \beta_5 \, \text{sim}_5
\]
\[
+ \beta_6 \, \text{sim}_6 + \beta_7 \, \text{sim}_7 + \beta_8 \, \text{sim}_8 + \beta_9 \, \text{sim}_9 + \beta_{10} \, \text{sim}_{10}
\]
\[
+ \cdots
\]
\[
+ \beta_{46} \, \text{sim}_{46} + \beta_{47} \, \text{sim}_{47} + \beta_{48} \, \text{sim}_{48} + \beta_{49} \, \text{sim}_{49} + \beta_{50} \, \text{sim}_{50}
\]
```{r}
#| cache: true
```
```{r}
#| echo: false
fit <- lm(arrested ~ race + sex + age + zone, data = stops)
tidy_results <- tidy(fit, conf.int = TRUE)
library(gt)
tidy_results %>%
select(term, estimate, conf.low, conf.high) %>%
gt() %>%
tab_header(title = "Model Coefficients with Confidence Intervals") %>%
fmt_number(columns = c("estimate", "conf.low", "conf.high"), decimals = 2) %>%
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "Lower 95% CI",
conf.high = "Upper 95% CI"
)
```
```{r}
#| echo: false
library(marginaleffects)
library(dplyr)
library(ggplot2)
library(scales)
# Fit the model
fit_stops <- glm(arrested ~ race, data = stops, family = binomial)
# Get predictions grouped by race
preds <- plot_predictions(
fit_stops,
condition = "race",
draw = FALSE
) |>
as_tibble()
# Plot
ggplot(preds, aes(x = race, y = estimate, color = race)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Predicted Probability of Arrest by Race",
x = "Race",
y = "Predicted Probability"
) +
theme_minimal() +
theme(legend.position = "none")
```
> |
| temperance-14 |
question |
https://brunchmaster-ap.github.io/stops/ |
| temperance-15 |
question |
https://github.com/brunchmaster-ap/stops |
| minutes |
question |
75 |