id submission_type answer
tutorial-id none 131-stops
name question Jishnu Veerapaneni
email question jishnuvee2035@gmail.com
introduction-1 question Wisdom, Justice, Courage, and Temperance
introduction-2 question > show_file(".gitignore") stops_files >
introduction-3 question > show_file("stops.qmd", chunk = "Last") #| message: false library(tidyverse) library(primer.data) >
introduction-4 question > library(tidyverse) ── Attaching core tidyverse packages ──────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.2 ✔ tibble 3.3.0 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.1.0 ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package to force all conflicts to become errors Warning messages: 1: package ‘purrr’ was built under R version 4.5.1 2: package ‘dplyr’ was built under R version 4.5.1 >
introduction-5 question This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department.
introduction-6 question A causal effect is the difference between 2 potential outcomes.
introduction-7 question The fundamental problem of causal inference is that you are only able to observe 1 potential outcome at a time.
introduction-8 question Number of arrests
introduction-9 question The imaginary variable can be compliance. We can manpulate this variable by assigning 0 as not being compliant and 1 as being compliant.
introduction-10 question The potential outcomes for each arrest are 2, 1 potential outcome for wearing the mask, and another potential outcome for not wearing the mask.
introduction-11 question The 2 different values for the varialbe mask can be 1 for wearing a mask and 0 for not wearing a mask, the potential outcome for wearing a mask can be arrested and the potential outcome for not wearing a mask is NOT arrested. The causal effect for a unit would be not arrested - arrested.
introduction-12 question Sex
introduction-13 question Two different groups of people could be caucasians and african-americans and the average value for arrested for caucasians might be lower than the average value of arrested for african-americans.
introduction-14 question Do black people get arrested more often than white people?
wisdom-1 question Creating a preceptor table and examining the data.
wisdom-2 question A Preceptor table is a table with the least number of rows and columns necessary to answer our question with no missing data.
wisdom-3 question The key components are the units which are the rows of the table consisting of individual drivers, the outcomes is the arrest variable, and the covariate is race which are other variables/columns necessary to answer our initially proposed question. There is also an ID_driver column.
wisdom-4 question The units for this problem are the individual drivers
wisdom-5 question Arrested
wisdom-6 question Race
wisdom-7 question There are no treatments because this is a predictive model
wisdom-8 question The moment presently studying the data/right now
wisdom-9 question The Preceptor table has the units consisting of the rows of the table consisting of individual drivers, the outcomes is the arrest variable, and the covariate is race which are other variables/columns necessary to answer our initially proposed question. There is also an ID_driver column.
wisdom-10 question Are Black drivers more likely to be arrested than White drivers?
wisdom-11 question We examine traffic stops across the United States and how race may vary in the arrest rate to improve police accountability and transparency. Using data from the Open Policing Project by Standford University, we attempt to see what the difference in arrest rate between Black and White drivers are while adjusting for other covariates.
justice-1 question Creating a population table and considering the role of the assumptions we make of validity, stability, representativeness, and unconfoudedness in our problem.
justice-2 question Validity is the consistency or lack of consistency among the columns in the Preceptor table and the data.
justice-3 question Validity might not hold for the race variable/column as the definition of race might vary for a lot of people and some people might be mixed or might not exactly know their race, resulting in a discrepancy between the preceptor and data table columns.
justice-4 question A Population table consists of unit/time combinations. Both the data and preceptor table are drawn from the population table.
justice-5 question Individual pulled over driver / year
justice-6 question The assumption of stability in data science is the consistency or lack thereof that the columns in the Population Table is the same for all the other rows consisting of the data, the Preceptor Table, and the larger population from which both are drawn.
justice-7 question The assumption of stability might not hold if due to the passage of time new apps or techonology came out that decreases risks of people being pulled over.
justice-8 question The assumption of representativeness consists of the relationships in the population table. The 1st relationship is the Preceptor table and other rows, and second is the data and other rows.
justice-9 question The assumption of representativeness might not be true involving the relationship between data and population if the data isn't representative of the popuation and did not employ a random sampling method.
justice-10 question The preceptor table and population relationship assumption might not be true because if the new laws or norms get more emphazized to handle certain races with more "care" in a certain area or zone compared to other areas, it wouldn't be representative of the population.
justice-11 question The assumption of unconfoundedness is the assumption that the potential outcome does not dictate the treatment being assigned.
justice-12 question library(tidymodels) ── Attaching packages ────────────────────────────────────────────── tidymodels 1.3.0 ── ✔ broom 1.0.8 ✔ recipes 1.3.1 ✔ dials 1.4.0 ✔ rsample 1.3.0 ✔ dplyr 1.1.4 ✔ tibble 3.3.0 ✔ ggplot2 3.5.2 ✔ tidyr 1.3.1 ✔ infer 1.0.8 ✔ tune 1.3.0 ✔ modeldata 1.4.0 ✔ workflows 1.2.0 ✔ parsnip 1.3.2 ✔ workflowsets 1.1.1 ✔ purrr 1.1.0 ✔ yardstick 1.3.2 ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ── ✖ purrr::discard() masks scales::discard() ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ✖ recipes::step() masks stats::step() • Learn how to get started at https://www.tidymodels.org/start/ Warning messages: 1: package ‘dplyr’ was built under R version 4.5.1 2: package ‘purrr’ was built under R version 4.5.1 >
justice-13 question > library(broom) >
justice-14 question $$ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} $$ with \( Y \sim \text{Bernoulli}(\rho) \) where \( \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} \)
justice-15 question A potential weakness in our model is that the dataset removed almost 3.1 million entries from the real data, reducing it to only about 400,000 entries, this deletion of the entries may lead to a misrepresentation of the population, as the possibility exists that the remaining/current data is only from select areas with selected and specific conditions present, and could be from biased officers who are more likely to arrest drivers compared to other officers in the zone.
courage-1 question Creating a data generating mechanism
courage-2 exercise linear_reg(engine = "lm")
courage-3 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x)
courage-4 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x) |> tidy(conf.int = TRUE)
courage-5 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x)
courage-6 exercise linear_reg(engine = "lm") |> fit(arrested ~ race, data = x) |> tidy(conf.int = TRUE)
courage-7 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race, data = x) |> tidy(conf.int = TRUE)
courage-8 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race*zone, data = x) |> tidy(conf.int = TRUE)
courage-9 exercise fit_stops
courage-10 question > x <- stops |> + filter(race %in% c("black", "white")) |> + mutate(race = str_to_title(race), + sex = str_to_title(sex)) + + fit_stops <- linear_reg() |> + set_engine("lm") |> + fit(arrested ~ sex + race*zone, data = x) >
courage-11 question > library(easystats) # Attaching packages: easystats 0.7.5 ✔ bayestestR 0.16.1 ✔ correlation 0.8.8 ✔ datawizard 1.2.0 ✔ effectsize 1.0.1 ✔ insight 1.3.1 ✔ modelbased 0.12.0 ✔ performance 0.15.0 ✔ parameters 0.27.0 ✔ report 0.6.1 ✔ see 0.11.0 Warning message: package ‘easystats’ was built under R version 4.5.1 >
courage-12 question > check_predictions(extract_fit_engine(fit_stops)) >
courage-13 question $$ \widehat{\text{arrested}} = 0.177 + 0.0614 \cdot \text{sex}_{\text{Male}} - 0.0445 \cdot \text{race}_{\text{White}} + 0.0146 \cdot \text{zone}_{\text{B}} + 0.00610 \cdot \text{zone}_{\text{C}} + 0.0781 \cdot \text{zone}_{\text{D}} + 0.00190 \cdot \text{zone}_{\text{E}} - 0.00271 \cdot \text{zone}_{\text{F}} + 0.0309 \cdot \text{zone}_{\text{G}} + 0.0757 \cdot \text{zone}_{\text{H}} + \text{(interaction terms for race and zone)} $$
courage-14 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") #| cache: true x <- stops |> filter(race %in% c("black", "white")) |> mutate( race = str_to_title(race), sex = str_to_title(sex), arrested = factor(arrested, levels = c(0, 1)) # Convert to factor ) fit_stops <- logistic_reg() |> fit(arrested ~ sex + race + zone + sex:race, data = x) >
courage-15 question > tutorial.helpers::show_file(".gitignore") stops_files *_cache >
courage-16 exercise tidy(fit_stops, conf.int = TRUE)
courage-17 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") # Get tidy results with confidence intervals tidy_table <- tidy(fit_stops, conf.int = TRUE) %>% select(term, estimate, conf.low, conf.high) # Make a nice-looking table with gt tidy_table %>% gt() %>% tab_header( title = "Model Estimates with Confidence Intervals", subtitle = "Logistic regression on arrest probability" ) %>% fmt_number( columns = c(estimate, conf.low, conf.high), decimals = 3 ) %>% cols_label( term = "Variable", estimate = "Estimate", conf.low = "Lower CI", conf.high = "Upper CI" ) >
courage-18 question We model the probability of being arrested during a traffic stop (arrested vs. not arrested) as a logistic function of driver sex, race, and the zone of the stop, including interactions between race and zone.
temperance-1 question Temperance is using the data generating mechanism created in courage to answer our initially proposed question.
temperance-2 question Males typically get pulled over 0.06 more than the average of 0.177 of any other gender.
temperance-3 question White people are 0.04 less likely to get arrested than any other race
temperance-4 question The average chance of getting pulled over for a black woman
temperance-5 question > library(marginaleffects) Please cite the software developers who make your work possible. One package: citation("package_name") All project packages: softbib::softbib() >
temperance-6 question What is the difference in arrest rate between Black and White drivers adjusting for other variables?
temperance-7 question > predictions(fit_stops$fit) Estimate Pr(>|z|) S 2.5 % 97.5 % 0.182 <0.001 Inf 0.177 0.187 0.146 <0.001 Inf 0.141 0.152 0.245 <0.001 Inf 0.236 0.254 0.146 <0.001 Inf 0.141 0.152 0.257 <0.001 529.5 0.243 0.272 --- 378457 rows omitted. See ?print.marginaleffects --- 0.209 <0.001 Inf 0.203 0.215 0.275 <0.001 Inf 0.269 0.282 0.275 <0.001 Inf 0.269 0.282 0.275 <0.001 Inf 0.269 0.282 0.185 <0.001 Inf 0.176 0.194 Type: invlink(link) > The
temperance-8 question > plot_predictions(fit_stops$fit, by = "sex") >
temperance-9 question > plot_predictions(fit_stops$fit, condition = "sex") >
temperance-10 question > plot_predictions(fit_stops$fit, condition = c("sex", "race")) >
temperance-11 question plot_predictions(fit_stops$fit, condition = c("sex", "race")) + labs( title = "Predicted Arrest Probabilities by Sex and Race", subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes", caption = "Source: Open Policing Project, Stanford University", x = "Driver Sex", y = "Predicted Probability of Arrest" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), axis.title = element_text(face = "bold"), legend.title = element_blank() )
temperance-12 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") plot_predictions(fit_stops$fit, condition = c("sex", "race")) + labs( title = "Predicted Arrest Probabilities by Sex and Race", subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes", caption = "Source: Open Policing Project, Stanford University", x = "Driver Sex", y = "Predicted Probability of Arrest" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), axis.title = element_text(face = "bold"), legend.title = element_blank() )
temperance-13 question The predicted probability of arrest for Black male drivers is about 0.33 (95% CI: 0.32–0.34), which is roughly 6 percentage points higher than that for White male drivers in similar zones.
temperance-14 question The estimates may be inaccurate because the dataset excludes over three million stops, which could bias results if the removed cases differ systematically by race or sex. A more conservative alternative might place the probability for Black male drivers closer to 0.30 with a wider 95% CI of 0.28–0.33 to reflect this uncertainty.
temperance-15 question > tutorial.helpers::show_file("stops.qmd") --- title: "Stops" format: html author : "Jishnu Veerapaneni" execute: echo: false --- ```{r} #| message: false #| warning: false library(tidyverse) library(primer.data) library(tidymodels) library(broom) library(easystats) library(dplyr) library(broom) library(gt) # for pretty tables library(marginaleffects) ``` ```{r} #| cache: true x <- stops |> filter(race %in% c("black", "white")) |> mutate( race = str_to_title(race), sex = str_to_title(sex), arrested = factor(arrested, levels = c(0, 1)) # Convert to factor ) fit_stops <- logistic_reg() |> fit(arrested ~ sex + race + zone + sex:race, data = x) ``` ```{r} plot_predictions(fit_stops$fit, condition = c("sex", "race")) + labs( title = "Predicted Arrest Probabilities by Sex and Race", subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes", caption = "Source: Open Policing Project, Stanford University", x = "Driver Sex", y = "Predicted Probability of Arrest" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), axis.title = element_text(face = "bold"), legend.title = element_blank() ) ``` We examine traffic stops across the United States and how race may vary in the arrest rate to improve police accountability and transparency. Using data from the Open Policing Project by Standford University, we attempt to see what the difference in arrest rate between Black and White drivers are while adjusting for other covariates. A potential weakness in our model is that the dataset removed almost 3.1 million entries from the real data, reducing it to only about 400,000 entries, this deletion of the entries may lead to a misrepresentation of the population, as the possibility exists that the remaining/current data is only from select areas with selected and specific conditions present, and could be from biased officers who are more likely to arrest drivers compared to other officers in the zone. We model the probability of being arrested during a traffic stop (arrested vs. not arrested) as a logistic function of driver sex, race, and the zone of the stop, including interactions between race and zone. The predicted probability of arrest for Black male drivers is about 0.33 (95% CI: 0.32–0.34), which is nearly 6 percentage points higher than that for White male drivers in similar zones. This is my data generating mechanism : $$ \widehat{\text{arrested}} = 0.177 + 0.0614 \cdot \text{sex}_{\text{Male}} - 0.0445 \cdot \text{race}_{\text{White}} + 0.0146 \cdot \text{zone}_{\text{B}} + 0.00610 \cdot \text{zone}_{\text{C}} + 0.0781 \cdot \text{zone}_{\text{D}} + 0.00190 \cdot \text{zone}_{\text{E}} - 0.00271 \cdot \text{zone}_{\text{F}} + 0.0309 \cdot \text{zone}_{\text{G}} + 0.0757 \cdot \text{zone}_{\text{H}} + \text{(interaction terms for race and zone)} $$ ```{r} # Get tidy results with confidence intervals tidy_table <- tidy(fit_stops$fit, conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) # Make a nice-looking table with gt tidy_table %>% gt() %>% tab_header( title = "Model Estimates with Confidence Intervals", subtitle = "Logistic regression on arrest probability" ) %>% fmt_number( columns = c(estimate, conf.low, conf.high), decimals = 3 ) %>% cols_label( term = "Variable", estimate = "Estimate", conf.low = "Lower CI", conf.high = "Upper CI" ) ``` >
temperance-16 question https://jishnuvee.github.io/stops/
temperance-17 question https://github.com/jishnuvee/stops
minutes question 120