id submission_type answer
tutorial-id none 131-stops
name question Neelam Arshad
email question aneelam888@gmail.com
introduction-1 question Wisdom, Justice, Courage and Temperance.
introduction-2 question > show_file(".gitignore") Error in `show_file()`: ! could not find function "show_file" > library(tutorial.helpers) > show_file(".gitignore") stops_files >
introduction-3 question > show_file("stops.qmd", chunk = "Last") #| message: false library(tidyverse) library(primer.data) >
introduction-4 question > library(tidyverse) ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package to force all conflicts to become errors Warning message: package ‘purrr’ was built under R version 4.5.1 >
introduction-5 question stops {primer.data} R Documentation New Orleans Traffic Stops Data Description This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. Usage stops Format A tibble with about 400,000 observations and 7 variables: date date variable indicating the date of the stop time time variable indicating the time of the stop zone character variable indicating the zone of the officer conducting the stop race character variable indicating the race of the driver sex character variable indicating the sex of the driver age integer variable indicating the age of the driver arrested 0/1 variable indicating whether an arrest was made Details The dataset includes information about the date, time, and location of each stop, as well as demographic details about the driver and the outcomes of the stop. The data covers traffic stops from July 1, 2011 to July 18, 2018. Any records with missing values were deleted. This might cause some issues because stops which resulted in an arrest were 4 times more likely to feature a missing value for 'age'. Author(s) Sanaka Dash Source https://openpolicing.stanford.edu/data/ [Package primer.data version 0.7.2.9011 Index]
introduction-6 question A causal effect is the difference between two potential outcomes.
introduction-7 question We can observe only one potential outcome for a situation.
introduction-8 question We can use arrested as our outcome variable.
introduction-9 question A face mask can be used as an imaginary variable. It has two binary values: 1 = Person wears a mask 0 = Person does not wear a mask This variable is manipulable in theory; a person can choose to wear or not wear a mask. In a causal model, we might use this to estimate whether wearing a mask influences the likelihood of arrest.
introduction-10 question For each arrest, there are two potential outcomes: The outcome if the person wears a mask (mask = 1): Not arrested The outcome if the person does not wear a mask (mask = 0): Arrested
introduction-11 question Let’s consider one person (one unit): If the person wears a mask (mask = 1), we guess the potential outcome is: Not arrested If the person does not wear a mask (mask = 0), we guess the potential outcome is: Arrested So, the causal effect of wearing a mask for this person is: Arrested (0) − Not arrested (1) = −1 This means wearing a mask reduces the chance of arrest by one unit for this individual.
introduction-12 question Age
introduction-13 question We can compare the following two groups: Group 1: Black individuals Group 2: White individuals These groups differ in their race value, and based on historical data and social context, they may also show different average arrest rates. For example: Black individuals might have a higher average probability of arrest due to systemic bias in policing. White individuals might have a lower average probability of arrest, even under similar conditions.
introduction-14 question How many chances are black people being arrested during a traffic stop than white people?
wisdom-1 question Components of wisdom are the preceptor table and data.
wisdom-2 question A preceptor table is a table with a minimum rows and columns and is used to answer our question.
wisdom-3 question Components of a preceptor table are as follows: Rows which are units, and at least one column with two potential outcomes. Other columns are the covariates, which are required to answer our question. If the problem is causal, there will be one column called treatment. A predictive problem does not have any treatments.
wisdom-4 question Drivers / Individuals
wisdom-5 question Arrested
wisdom-6 question The covariates that might be useful are race and age.
wisdom-7 question Since this is a predictive model, there are no any treatments.
wisdom-8 question The Preceptor Table refers to the moment of the traffic stop, 2015.
wisdom-9 question The Preceptor Table would include: Units: Each individual stop Outcome: Whether the driver was arrested Covariates: Race, age, gender, zone.
wisdom-10 question Are Black drivers more likely to be arrested than White drivers?
wisdom-11 question Racial disparities in law enforcement remain a major concern in modern society, especially when it comes to outcomes like arrests during traffic stops. Using a dataset of traffic stops sourced from the Open Policing project, specifically derived from their New Orleans dataset, we examine whether Black drivers are arrested at higher rates than White drivers, even after accounting for age, gender, and zone.
justice-1 question Components of justice are the population table with four assumptions of validity, stability, representativeness and unconfoundedness.
justice-2 question Validity is the consistency or the lack thereof in the columns of our data set and the corresponding columns in our Preceptor Table.
justice-3 question The column for "arrested" in the dataset may not perfectly reflect the true legal status of each stop, especially if arrest procedures vary by officer or zone. Similarly, the column labeled "race" might be based on officer perception rather than self-reported race, which could compromise validity.
justice-4 question A population table is a unit/time combination of the Preceptor Table and the data.
justice-5 question Each row in the Population Table represents a unique driver who was pulled over at a specific point in time (e.g., date and hour). So, a unit/time combination might be: Driver A at 10:45 PM on May 7th, 2010.
justice-6 question Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.
justice-7 question The relationship between race and arrest might not be stable over time due to changes in department policy, public scrutiny, or training procedures. For instance, an increase in public attention to racial profiling may lead to fewer arrests of Black drivers over time, even if other factors remain constant.
justice-8 question Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.
justice-9 question Unconfounded is an assumption which only applies to causal models which assumes that treatment assignment is random.
justice-10 question The dataset includes only a subset of all traffic stops—those that were properly recorded and shared. If some zones or times of day are underreported (e.g., due to technology failure or officer discretion), the sample may not be representative of all stops in the population. The dataset includes only a subset of all traffic stops—those that were properly recorded and shared. If some zones or times of day are underreported (e.g., due to technology failure or officer discretion), the sample may not be representative of all stops in the population.
justice-11 question Unconfounded is an assumption which only applies to causal models which assumes that treatment assignment is random.
justice-12 question > library(tidymodels) ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ── ✔ broom 1.0.8 ✔ rsample 1.3.0 ✔ dials 1.4.0 ✔ tune 1.3.0 ✔ infer 1.0.8 ✔ workflows 1.2.0 ✔ modeldata 1.4.0 ✔ workflowsets 1.1.1 ✔ parsnip 1.3.2 ✔ yardstick 1.3.2 ✔ recipes 1.3.1 ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Learn how to get started at https://www.tidymodels.org/start/ >
justice-13 question > library(broom) >
justice-14 question $$ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} $$ with $$ Y \sim \text{Bernoulli}(\rho) where \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} $$
justice-15 question However, a potential weakness in our model is that the data may not be fully representative of all stops across the city or times, which could bias our estimates.
courage-1 question Components oof courage are the data generating mechanism.
courage-2 exercise linear_reg(engine = "lm")
courage-3 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x)
courage-4 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x) |> tidy(conf.int = TRUE)
courage-5 exercise linear_reg(engine = "lm") |> fit(arrested ~ race, data = x)
courage-6 exercise linear_reg(engine = "lm") |> fit(arrested ~ race, data = x) |> tidy(conf.int = TRUE)
courage-7 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race, data = x) |> tidy(conf.int = TRUE)
courage-8 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race*zone, data = x) |> tidy(conf.int = TRUE)
courage-9 exercise fit_stops
courage-10 question > fit_stops <- linear_reg() |> + set_engine("lm") |> + fit(arrested ~ sex + race*zone, data = x) > x <- stops |> + filter(race %in% c("black", "white")) |> + mutate(race = str_to_title(race), + sex = str_to_title(sex)) + + fit_stops <- linear_reg() |> + set_engine("lm") |> + fit(arrested ~ sex + race*zone, data = x) >
courage-11 question > library(easystats) # Attaching packages: easystats 0.7.4 (red = needs update) ✖ bayestestR 0.16.0 ✖ correlation 0.8.7 ✖ datawizard 1.1.0 ✔ effectsize 1.0.1 ✖ insight 1.3.0 ✖ modelbased 0.11.2 ✖ performance 0.14.0 ✖ parameters 0.26.0 ✔ report 0.6.1 ✔ see 0.11.0 Restart the R-Session and update packages with `easystats::easystats_update()`. >
courage-12 question > check_predictions(extract_fit_engine(fit_stops)) >
courage-13 question $$ \widehat{\text{arrested}} = 0.177 + 0.0614 \cdot \text{sex}_{\text{Male}} - 0.0445 \cdot \text{race}_{\text{White}} + 0.0146 \cdot \text{zone}_B + 0.0061 \cdot \text{zone}_C + 0.0781 \cdot \text{zone}_D + 0.0019 \cdot \text{zone}_E - 0.0027 \cdot \text{zone}_F + 0.0309 \cdot \text{zone}_G + 0.0757 \cdot \text{zone}_H + \text{(interaction terms between race and zone)} $$
courage-14 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race*zone, data = x) >
courage-15 question > tutorial.helpers::show_file(".gitignore") stops_files *_cache >
courage-16 exercise tidy(fit_stops, conf.int = TRUE)
courage-17 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") #| label: table_fit_stops #| cache: true library(dplyr) library(knitr) tidy(fit_stops, conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |> slice(1:10) |> # Only showing first 10 terms; adjust or remove as needed mutate(across(where(is.numeric), ~round(., 3))) |> kable( caption = "Logistic Regression Estimates for Arrest Probability (Source: Traffic stops dataset)", col.names = c("Variable", "Estimate", "95% CI (Lower)", "95% CI (Upper)") ) >
courage-18 question We model the likelihood of a driver being arrested during a traffic stop (a binary outcome: arrested or not arrested) as a logistic function of the driver’s sex, race, and the zone in which the stop occurred. This allows us to estimate how these covariates are associated with the probability of an arrest.
temperance-1 question Temperance tells us the use of data generating mechanism.
temperance-2 question Being male is associated with a 6 percentage point higher probability of being arrested during a traffic stop, compared to being female, holding other variables constant.
temperance-3 question Being White is associated with a 0.04 point lower probability of being arrested compared to the baseline racial group (likely Black), holding other variables constant.
temperance-4 question The intercept of 0.18 represents the estimated probability of arrest for someone in the baseline category: a female, non-White person in zone A.
temperance-5 question > library(marginaleffects) Please cite the software developers who make your work possible. One package: citation("package_name") All project packages: softbib::softbib() >
temperance-6 question How the probability of being arrested during a traffic stop vary by sex, race, and location (zone), and how do these factors contribute to disparities in policing?
temperance-7 question > predictions(fit_stops) Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % 0.179 0.00343 52.2 <0.001 Inf 0.173 0.186 0.142 0.00419 33.8 <0.001 828.0 0.133 0.150 0.250 0.00451 55.5 <0.001 Inf 0.241 0.259 0.142 0.00419 33.8 <0.001 828.0 0.133 0.150 0.232 0.01776 13.1 <0.001 127.6 0.198 0.267 --- 378457 rows omitted. See ?print.marginaleffects --- 0.208 0.00390 53.4 <0.001 Inf 0.201 0.216 0.270 0.00377 71.5 <0.001 Inf 0.262 0.277 0.270 0.00377 71.5 <0.001 Inf 0.262 0.277 0.270 0.00377 71.5 <0.001 Inf 0.262 0.277 0.189 0.00545 34.7 <0.001 874.0 0.179 0.200 Type: numeric >
temperance-8 question > plot_predictions(fit_stops, by = "sex") >
temperance-9 question > plot_predictions(fit_stops, condition = "sex") >
temperance-10 question > plot_predictions(fit_stops, condition = c("sex", "race")) > plot_predictions(fit_stops, condition = c("sex", "race"), draw = FALSE) rowid estimate std.error statistic p.value s.value conf.low conf.high df arrested zone sex race 1 1 0.2553898 0.002763715 92.40814 0 Inf 0.2499730 0.2608066 Inf 0 D Female Black 2 2 0.2402690 0.003309070 72.60922 0 Inf 0.2337834 0.2467547 Inf 0 D Female White 3 3 0.3168358 0.002589462 122.35583 0 Inf 0.3117606 0.3219111 Inf 0 D Male Black 4 4 0.3017151 0.003143758 95.97272 0 Inf 0.2955534 0.3078767 Inf 0 D Male White >
temperance-11 question library(ggplot2) pred_data <- plot_predictions(fit_stops, condition = c("sex", "race", "zone"), draw = FALSE) ggplot(pred_data, aes(x = race, y = estimate, fill = sex)) + geom_col(position = position_dodge()) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), position = position_dodge(0.9), width = 0.2) + facet_wrap(~ zone) + labs(title = "Predicted Probabilities of Arrest by Race, Sex, and Zone", subtitle = "Black males consistently have the highest predicted arrest probabilities across zones", caption = "Source: Police stop data, model-estimated probabilities using logistic regression", y = "Predicted Probability of Arrest", x = "Race") + theme_minimal()
temperance-12 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") library(ggplot2) pred_data <- plot_predictions(fit_stops, condition = c("sex", "race", "zone"), draw = FALSE) ggplot(pred_data, aes(x = race, y = estimate, fill = sex)) + geom_col(position = position_dodge()) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), position = position_dodge(0.9), width = 0.2) + facet_wrap(~ zone) + labs(title = "Predicted Probabilities of Arrest by Race, Sex, and Zone", subtitle = "Black males consistently have the highest predicted arrest probabilities across zones", caption = "Source: Police stop data, model-estimated probabilities using logistic regression", y = "Predicted Probability of Arrest", x = "Race") + theme_minimal() >
temperance-13 question The model suggests that, all else equal, being stopped in Zone D is associated with a 7.8 percentage point increase in the probability of arrest compared to the reference zone, with a 95% confidence interval from 7.0% to 8.6%.
temperance-14 question The estimates for the quantities of interest, such as the probability of arrest for different race and sex groups, may be wrong due to model assumptions that don’t fully reflect reality. For example, unmeasured variables like the reason for the stop, officer behavior, or neighborhood-specific crime rates could bias the results. Additionally, our model assumes that the relationships between variables are linear and additive, which may oversimplify complex social dynamics. The uncertainty may also be underestimated if the confidence intervals rely on idealized assumptions, such as independent observations or correct model specification. If systematic biases exist, for example, over-policing in certain zones, the estimated probability of arrest for Black males in Zone D (31.7%, 95% CI: [31.2%, 32.2%]) might be inflated. A more cautious alternative might be to widen the confidence interval to reflect possible model misspecification, e.g., [30.5%, 33.0%].
temperance-15 question > tutorial.helpers::show_file("stops.qmd") --- title: "Stops" author: "Neelam Arshad" format: html execute: echo: false warning: false --- ```{r} #| message: false library(tidyverse) library(primer.data) library(tidymodels) library(broom) library(marginaleffects) x <- stops |> filter(race %in% c("black", "white")) |> mutate(race = str_to_title(race), sex = str_to_title(sex)) ``` $$ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} $$ with $$ Y \sim \text{Bernoulli}(\rho) where \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} $$ $$ \widehat{\text{arrested}} = 0.177 + 0.0614 \cdot \text{sex}_{\text{Male}} - 0.0445 \cdot \text{race}_{\text{White}} + 0.0146 \cdot \text{zone}_B + 0.0061 \cdot \text{zone}_C + 0.0781 \cdot \text{zone}_D + 0.0019 \cdot \text{zone}_E - 0.0027 \cdot \text{zone}_F + 0.0309 \cdot \text{zone}_G + 0.0757 \cdot \text{zone}_H + \text{(interaction terms between race and zone)} $$ ```{r} #| cache: true fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race*zone, data = x) ``` ```{r} #| label: table_fit_stops #| cache: true library(dplyr) library(knitr) tidy(fit_stops, conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |> slice(1:10) |> # Only showing first 10 terms; adjust or remove as needed mutate(across(where(is.numeric), ~round(., 3))) |> kable( caption = "Logistic Regression Estimates for Arrest Probability (Source: Traffic stops dataset)", col.names = c("Variable", "Estimate", "95% CI (Lower)", "95% CI (Upper)") ) ``` ```{r} library(ggplot2) pred_data <- plot_predictions(fit_stops, condition = c("sex", "race", "zone"), draw = FALSE) ggplot(pred_data, aes(x = race, y = estimate, fill = sex)) + geom_col(position = position_dodge()) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), position = position_dodge(0.9), width = 0.2) + facet_wrap(~ zone) + labs(title = "Predicted Probabilities of Arrest by Race, Sex, and Zone", subtitle = "Black males consistently have the highest predicted arrest probabilities across zones", caption = "Source: Police stop data, model-estimated probabilities using logistic regression", y = "Predicted Probability of Arrest", x = "Race") + theme_minimal() ``` Racial disparities in law enforcement remain a major concern in modern society, especially when it comes to outcomes like arrests during traffic stops. Using a dataset of traffic stops sourced from the Open Policing project, specifically derived from their New Orleans dataset, we examine whether Black drivers are arrested at higher rates than White drivers, even after accounting for age, gender, and zone. However, a potential weakness in our model is that the data may not be fully representative of all stops across the city or times, which could bias our estimates. Our data may come from biased officers, who may target certain groups of individuals. We model the likelihood of a driver being arrested during a traffic stop (a binary outcome: arrested or not arrested) as a logistic function of the driver’s sex, race, and the zone in which the stop occurred. This allows us to estimate how these covariates are associated with the probability of an arrest. The model suggests that, all else equal, being stopped in Zone D is associated with a 7.8 percentage point increase in the probability of arrest compared to the reference zone, with a 95% confidence interval from 7.0% to 8.6%. >
temperance-16 question https://neelamarshad.github.io/stops/
temperance-17 question https://github.com/NeelamArshad/stops
minutes question 240