Motivation

We will recreate a study by the American Journal of Public Health “to examine whether stricter firearm legislation is associated with rates of fatal police shootings.”

The bulk of this code will consist of collecting, cleaning and manipulating data in order to control for potential covariates. We will also run a Poisson regression and tests of statistical significance.

The libraries used in this study are dplyr, rvest, ggplot2, zoo, openintro, readxl, data.table, mediation, MASS,ggrepel, and kableExtra. In order to run this code please ensure you have these packages installed.

The learning objectives include data cleaning and manipulation, problem solving, scraping data from the web, cleaning text, merging data frames, Poisson regression and significance testing.

What is the data?

This study uses data from many different sources.

  1. The Brady Campaign State Scorecard 2015: numerical scores for firearm legislation in each state.
  2. The Counted: Persons killed by police in the US. The Counted project started because “the US government has no comprehensive record of the number of people killed by law enforcement.”
  3. US Census: Population characteristics by state.
  4. FBI’s Uniform Crime Report.
  5. Unemployment rate data.
  6. US Census 2010 Land Area.
  7. Education data for 2010 via the US Census education table. editor.
  8. “Household firearm owndership rates, represented as the percentage of firearm suicides to all suicides.” Downloaded from the CDC’s Web-Based Injury Statistics Query and Reporting System.

Data import

Since this is the first R code chunk, we load all the necessary libraries.

We import the Census population characteristics data, Brady Campaign Scorecard (2015), “The Counted” data for 2015 & 2016, FBI’s crime report for 2015, land area by state from the Census, and suicide rate data. These files were downloaded from their respective websites listed in the above section “What is the data?”

##   SUMLEV REGION DIVISION STATE    NAME
## 1     40      3        6     1 Alabama
## 2     40      3        6     1 Alabama
## 3     40      3        6     1 Alabama
## 4     40      3        6     1 Alabama
## 5     40      3        6     1 Alabama
## 6     40      3        6     1 Alabama
## # A tibble: 6 x 5
##   `States can receive a m… `Category Points` `Sub Category P… Points AL   
##   <chr>                                <dbl>            <dbl>  <dbl> <chr>
## 1 TOTAL STATE POINTS                      NA               NA     NA -18  
## 2 CATEGORY 1:  KEEPING GU…                50               NA     NA <NA> 
## 3 BACKGROUND CHECKS TO ST…                NA               25     NA AL   
## 4 Background Checks on Al…                NA               NA     25 <NA> 
## 5 Background Checks on Al…                NA               NA     20 <NA> 
## 6 Background Checks on Al…                NA               NA      5 <NA>
##   gender          raceethnicity state    classification
## 1   Male                  Black    GA  Death in custody
## 2   Male                  White    OR           Gunshot
## 3   Male                  White    HI Struck by vehicle
## 4   Male        Hispanic/Latino    KS           Gunshot
## 5   Male Asian/Pacific Islander    WA           Gunshot
## 6   Male                  White    CA           Gunshot
##                 lawenforcementagency              armed
## 1    Chatham County Sheriff's Office                 No
## 2 Washington County Sheriff's Office            Firearm
## 3            Kauai Police Department                 No
## 4          Wichita Police Department                 No
## 5      Mason County Sheriff's Office            Firearm
## 6    San Francisco Police Department Non-lethal firearm
## # A tibble: 6 x 5
##   State  Area                X__1          Population     `Violent\ncrime…
##   <chr>  <chr>               <chr>         <chr>                     <dbl>
## 1 ALABA… Metropolitan Stati… <NA>          3708033                      NA
## 2 <NA>   <NA>                Area actuall… 0.97099999999…            18122
## 3 <NA>   <NA>                Estimated to… 1                         18500
## 4 <NA>   Cities outside met… <NA>          522241                       NA
## 5 <NA>   <NA>                Area actuall… 0.97399999999…             3178
## 6 <NA>   <NA>                Estimated to… 1                          3240
## # A tibble: 6 x 5
##   Areaname      STCOU LND010190F LND010190D LND010190N1
##   <chr>         <chr>      <dbl>      <dbl> <chr>      
## 1 UNITED STATES 00000          0   3787425. 0000       
## 2 ALABAMA       01000          0     52423. 0000       
## 3 Autauga, AL   01001          0       604. 0000       
## 4 Baldwin, AL   01003          0      2027. 0000       
## 5 Barbour, AL   01005          0       905. 0000       
## 6 Bibb, AL      01007          0       626. 0000
##          Sex      Race      State Ethnicity Age.Group
## 1 Both Sexes All Races    Alabama      Both  All Ages
## 2 Both Sexes All Races     Alaska      Both  All Ages
## 3 Both Sexes All Races    Arizona      Both  All Ages
## 4 Both Sexes All Races   Arkansas      Both  All Ages
## 5 Both Sexes All Races California      Both  All Ages
## 6 Both Sexes All Races   Colorado      Both  All Ages

Data wrangling

1. Census Data

1.1 Percent Male, White, Black, Hispanic

The following is taken from the document sc-est2017-alldata6.pdf:

  • The key for SEX is as follows:
    • 0 = Total
    • 1 = Male
    • 2 = Female
  • The key for ORIGIN is as follows:
    • 0 = Total
    • 1 = Not Hispanic
    • 2 = Hispanic
  • The key for RACE is as follows:
    • 1 = White Alone
    • 2 = Black or African American Alone
    • 3 = American Indian and Alaska Native Alone
    • 4 = Asian Alone
    • 5 = Native Hawaiian and Other Pacific Islander Alone
    • 6 = Two or more races

Here we use the dplyr package to filter, group_by, and summarize the data in order to calculate the necessary statistics. For each state, we add rows in the column POPESTIMATE2015 since we are looking at the year 2015. Setting the ORIGIN or SEX equal to 0 ensures we don’t add duplicate data, since 0 is the key for both Hispanic and non Hispanic residents and total male and female residents. We group by each state since all data in this study should be at the state level.

## # A tibble: 6 x 6
##   NAME       white black hispanic  male total_pop
##   <chr>      <dbl> <dbl>    <dbl> <dbl>     <int>
## 1 alabama     69.5 26.7      4.13  48.5   4850858
## 2 alaska      66.5  3.67     6.82  52.4    737979
## 3 arizona     83.5  4.80    30.9   49.7   6802262
## 4 arkansas    79.6 15.7      7.18  49.1   2975626
## 5 california  73.0  6.49    38.7   49.7  39032444
## 6 colorado    87.6  4.47    21.3   50.3   5440445

1.2 Median Age

First we need to get the total counts for each age (per state), since that’s divided up into all the other categories. We set ORIGIN and SEX to “Total” so we don’t have repeats in the data then group by state and age since these are the two factors we need to keep seperated. The column sum_ages is the total residents in each specified state for each age.

## # A tibble: 6 x 3
## # Groups:   NAME [1]
##   NAME      AGE sum_ages
##   <chr>   <int>    <int>
## 1 Alabama     0    59080
## 2 Alabama     1    58738
## 3 Alabama     2    57957
## 4 Alabama     3    58800
## 5 Alabama     4    59329
## 6 Alabama     5    59610

We begin to see that finding the median is tricky, since we don’t have a nice list of ages (i.e., [22,55,4,27,…,35]) for each state, in which case we could probably use a built in R function. Instead we have the number of people that are ages 0-85+ in each state. For example, the first row of data above says that in Alabama (in 2015), there are 59080 residents that are 0 years old.

To solve this problem we first transform the dataframe to have each state as a column name with the sumAges data underneath. In otherwords we want to transform our 51*86 by 3 dataframe to a 51 by 86 dataframe. We remove the age column since we can use the index instead (index minus 1, since R dataframes are indexed from 1 and the ages start at 0).

##    Alabama Alaska Arizona Arkansas California
## 1:   59080  11253   86653    38453     500834
## 2:   58738  11109   86758    38005     499070
## 3:   57957  11009   86713    37711     499614
## 4:   58800  10756   86914    38381     498536
## 5:   59329  10895   87624    38443     510026
## 6:   59610  10537   87234    38582     498754

Now that we’ve made the data easier to work with, we need to find a way to get the median. One method is to take the cumulative sum of each column and then divide all the rows by the last row in each repective column, calculating a percentile/quantile for each age. Then, we find the age in each state where the population exceeds the 50th percentile (the median!) for the first time.

##      Alabama     Alaska    Arizona   Arkansas California   Colorado
## 1 0.01217929 0.01524840 0.01273885 0.01292266 0.01283122 0.01217217
## 2 0.02428807 0.03030168 0.02549314 0.02569476 0.02561725 0.02440058
## 3 0.03623586 0.04521944 0.03824081 0.03836806 0.03841722 0.03655841
## 4 0.04835742 0.05979438 0.05101803 0.05126652 0.05118957 0.04888552
## 5 0.06058804 0.07455768 0.06389963 0.06418582 0.06425629 0.06146593
## 6 0.07287659 0.08883586 0.07672389 0.07715183 0.07703422 0.07429999
## # A tibble: 6 x 7
##   NAME       white black hispanic  male total_pop   age
##   <chr>      <dbl> <dbl>    <dbl> <dbl>     <int> <dbl>
## 1 alabama     69.5 26.7      4.13  48.5   4850858    38
## 2 alaska      66.5  3.67     6.82  52.4    737979    33
## 3 arizona     83.5  4.80    30.9   49.7   6802262    37
## 4 arkansas    79.6 15.7      7.18  49.1   2975626    37
## 5 california  73.0  6.49    38.7   49.7  39032444    36
## 6 colorado    87.6  4.47    21.3   50.3   5440445    36

2. Violent Crime

First print the names of the columns to see that there are some extraneous characters not visible in the dataframe. Use the column index to select columns instead of the complicated names. Also, we print a specified row of violent crime to observe the X__1 group we are looking for – Rate per 100,000 inhabitants (per the study.)

##  [1] "State"                                   
##  [2] "Area"                                    
##  [3] "X__1"                                    
##  [4] "Population"                              
##  [5] "Violent\ncrime1"                         
##  [6] "Murder and \nnonnegligent \nmanslaughter"
##  [7] "Rape\n(revised\ndefinition)2"            
##  [8] "Rape\n(legacy\ndefinition)3"             
##  [9] "Robbery"                                 
## [10] "Aggravated \nassault"                    
## [11] "Property \ncrime"                        
## [12] "Burglary"                                
## [13] "Larceny-\ntheft"                         
## [14] "Motor \nvehicle \ntheft"
## # A tibble: 1 x 3
##   State X__1                         `Violent\ncrime1`
##   <chr> <chr>                                    <dbl>
## 1 <NA>  Rate per 100,000 inhabitants              472.

We need get all rows where X__1 == Rate per 100,000 inhabitants. However, as we can see above, the value for State in these rows is <NA>, so we need to fill that value with the state name that is listed in a previous row. Then we can select the rows where X__1 == Rate per 100,000 inhabitants. After that, we no longer need the column X__1, so we can remove it.

## # A tibble: 1 x 3
##   State   X__1                         `Violent\ncrime1`
##   <chr>   <chr>                                    <dbl>
## 1 ALABAMA Rate per 100,000 inhabitants              472.
## # A tibble: 6 x 2
##   State      violent_crime
##   <chr>              <dbl>
## 1 ALABAMA             472.
## 2 ALASKA              730.
## 3 ARIZONA             410.
## 4 ARKANSAS            521.
## 5 CALIFORNIA          426.
## 6 COLORADO            321

Looking at our whole dataframe we can see some of the state names have numbers in them (an example is printed below). This will make it hard to merge data together, so we should clean this up. We also make the state names all lowercase for ease of merging.

## [1] "MAINE6"
## [1] "maine"

We start cumulating our data into one dataframe, p_df.

##         NAME    white     black  hispanic     male total_pop age
## 1    alabama 69.50197 26.745949  4.129434 48.46650   4850858  38
## 2     alaska 66.51368  3.667991  6.821197 52.36978    737979  33
## 3    arizona 83.52295  4.797801 30.873010 49.71595   6802262  37
## 4   arkansas 79.57623 15.663427  7.180439 49.12855   2975626  37
## 5 california 72.97555  6.491044 38.727129 49.67815  39032444  36
## 6   colorado 87.64728  4.469781 21.266735 50.25140   5440445  36
##   violent_crime
## 1         472.4
## 2         730.2
## 3         410.2
## 4         521.3
## 5         426.3
## 6         321.0

3. Brady Scores

The study by AJPH groups the scores by the 7 categories. The study removed all weightings of the different laws in favor of a “1 law 1 point” system, since the weightings were “somewhat arbitrary.”

For the purpose of practice and simplification we just keep the first line of “total state points” from the Brady Scorecard as they are given, thus we note here is where our analysis differs from the study.

We need to transform the data frame so that we have a column of state names and a column of the corresponding total scores.

The scores are characters from the data import, so we change them to numeric.

## # A tibble: 6 x 54
##   `States can rec… `Category Point… `Sub Category P… Points AL    AK   
##   <chr>                       <dbl>            <dbl>  <dbl> <chr> <chr>
## 1 TOTAL STATE POI…               NA               NA     NA -18   -30  
## 2 CATEGORY 1:  KE…               50               NA     NA <NA>  <NA> 
## 3 BACKGROUND CHEC…               NA               25     NA AL    AK   
## 4 Background Chec…               NA               NA     25 <NA>  <NA> 
## 5 Background Chec…               NA               NA     20 <NA>  <NA> 
## 6 Background Chec…               NA               NA      5 <NA>  <NA> 
## # ... with 48 more variables: AR <chr>, AZ <chr>, CA <chr>, CO <chr>,
## #   CT <chr>, DE <chr>, FL <chr>, GA <chr>, HI <chr>, ID <chr>, IL <chr>,
## #   IN <chr>, IA <chr>, KS <chr>, KY <chr>, LA <chr>, MA <chr>, MD <chr>,
## #   ME <chr>, MI <chr>, MN <chr>, MO <chr>, MT <chr>, MS <chr>, NC <chr>,
## #   ND <chr>, NE <chr>, NH <chr>, NJ <chr>, NM <chr>, NV <chr>, NY <chr>,
## #   OK <chr>, OH <chr>, OR <chr>, PA <chr>, RI <chr>, SC <chr>, SD <chr>,
## #   TN <chr>, TX <chr>, UT <chr>, VA <chr>, VT <chr>, WA <chr>, WI <chr>,
## #   WV <chr>, WY <chr>
## # A tibble: 1 x 54
##   Law   `Category Point… `Sub Category P… Points AL    AK    AR    AZ   
##   <chr>            <dbl>            <dbl>  <dbl> <chr> <chr> <chr> <chr>
## 1 TOTA…               NA               NA     NA -18   -30   -24   -39  
## # ... with 46 more variables: CA <chr>, CO <chr>, CT <chr>, DE <chr>,
## #   FL <chr>, GA <chr>, HI <chr>, ID <chr>, IL <chr>, IN <chr>, IA <chr>,
## #   KS <chr>, KY <chr>, LA <chr>, MA <chr>, MD <chr>, ME <chr>, MI <chr>,
## #   MN <chr>, MO <chr>, MT <chr>, MS <chr>, NC <chr>, ND <chr>, NE <chr>,
## #   NH <chr>, NJ <chr>, NM <chr>, NV <chr>, NY <chr>, OK <chr>, OH <chr>,
## #   OR <chr>, PA <chr>, RI <chr>, SC <chr>, SD <chr>, TN <chr>, TX <chr>,
## #   UT <chr>, VA <chr>, VT <chr>, WA <chr>, WI <chr>, WV <chr>, WY <chr>
## # A tibble: 6 x 2
##   state brady_scores
##   <chr>        <dbl>
## 1 AL             -18
## 2 AK             -30
## 3 AR             -24
## 4 AZ             -39
## 5 CA              76
## 6 CO              22

This dataframe’s list of states is in two-letter state abbreviations. Since this is not cohesive with previous data files, we should change it to make merging possible. We use the openintro library to do the simple conversion.

## # A tibble: 6 x 2
##   state      brady_scores
##   <chr>             <dbl>
## 1 alabama             -18
## 2 alaska              -30
## 3 arkansas            -24
## 4 arizona             -39
## 5 california           76
## 6 colorado             22
##         NAME    white     black  hispanic     male total_pop age
## 1    alabama 69.50197 26.745949  4.129434 48.46650   4850858  38
## 2     alaska 66.51368  3.667991  6.821197 52.36978    737979  33
## 3    arizona 83.52295  4.797801 30.873010 49.71595   6802262  37
## 4   arkansas 79.57623 15.663427  7.180439 49.12855   2975626  37
## 5 california 72.97555  6.491044 38.727129 49.67815  39032444  36
## 6   colorado 87.64728  4.469781 21.266735 50.25140   5440445  36
##   violent_crime brady_scores
## 1         472.4          -18
## 2         730.2          -30
## 3         410.2          -39
## 4         521.3          -24
## 5         426.3           76
## 6         321.0           22

4. The Counted Fatal Shootings

First we simply concatenate the two years of data.

##   gender          raceethnicity state    classification
## 1   Male                  Black    GA  Death in custody
## 2   Male                  White    OR           Gunshot
## 3   Male                  White    HI Struck by vehicle
## 4   Male        Hispanic/Latino    KS           Gunshot
## 5   Male Asian/Pacific Islander    WA           Gunshot
## 6   Male                  White    CA           Gunshot
##                 lawenforcementagency              armed
## 1    Chatham County Sheriff's Office                 No
## 2 Washington County Sheriff's Office            Firearm
## 3            Kauai Police Department                 No
## 4          Wichita Police Department                 No
## 5      Mason County Sheriff's Office            Firearm
## 6    San Francisco Police Department Non-lethal firearm

Again, this data frame’s list of states is in two-letter state abbreviations.

In the study, researchers “calculated descriptive statistics for the proportion of victims that were male, armed, and non-White.” Thus we repeat here easily with dplyr. tally is used to count the number of observations in each group.

We also calculate the annualized rate per 1,000,000 residents. Note that the tally is the total shootings per state over 2 years.

## # A tibble: 6 x 4
##   state      gunshot_tally gunshot_filtered gunshot_rate
##   <chr>              <int>            <int>        <dbl>
## 1 alabama               43               37         4.43
## 2 alaska                12               11         8.13
## 3 arizona               92               76         6.76
## 4 arkansas              23               20         3.86
## 5 california           341              274         4.37
## 6 colorado              60               56         5.51

5. Unemployment Data

This data is available online from the BLS, but there is no easy download of the table. It is also difficult to simply copy and paste; it doesn’t hold it’s table format. Thus we set up a web scraper to get the data.

To do this, we must use the rvest package. To view the HTML of a webpage, right-click and select “View page source.”

Then we get the values from each column of the data table. html_nodes acts as a CSS selector. The sub0 class returns the state names, and the datavalue class corresponds to the values of both columns, Unemployment Rank and Rate. Since there is no differentiation of the two columns, and our “datavalues” alternate, we select subsets of this column to be the two seperate columns “rank” and “rate.”

##        state unemployment_rate unemployment_rank
## 1    Alabama               6.1                42
## 2     Alaska               6.5                47
## 3    Arizona               6.1                42
## 4   Arkansas               5.0                24
## 5 California               6.2                44
## 6   Colorado               3.9                10

6. Population Density 2015

To manually find population density, we can calculate total 2015 population from the Census data and import land area to calculate population density. This is caculated because “good” data for state population in 2015 was not available in a downloadable format and/or was not easy to scrape.

We select LND110210D by looking at the table and comparing values on other sites (such as the census or Wikipedia) to find the correct column. This column corresponds to land area in square miles.

## # A tibble: 6 x 2
##   Areaname      LND110210D
##   <chr>              <dbl>
## 1 UNITED STATES   3531905.
## 2 ALABAMA           50645.
## 3 Autauga, AL         594.
## 4 Baldwin, AL        1590.
## 5 Barbour, AL         885.
## 6 Bibb, AL            623.

Since landSqMi gives us area for each town in addition to the states, we merge on the state names to obtain only the area for each state.

We convert the state names to be all lowercase since the row values must match to merge successfully.

##         NAME    density
## 1    alabama  95.780954
## 2     alaska   1.293246
## 3    arizona  59.882188
## 4   arkansas  57.184559
## 5 california 250.562585
## 6   colorado  52.492723

8. Firearm Ownership

We simply calculate firearm ownership as a percent of firearm suicides to all suicides.

##         NAME ownership
## 1    alabama  70.09103
## 2     alaska  59.89848
## 3    arizona  57.40086
## 4   arkansas  59.45230
## 5 california  37.27692
## 6   colorado  51.03936

Exploratory data analysis

1. Poisson fit

First let’s determine if the fatal police shooting data is Poisson distributed. This is possible because the shootings are discrete events over a fixed interval of time.

We take the mean of the fatal shootings to get our parameter \(\lambda\). \(\lambda\) is the expected value and variance of fatal police shootings in each state for the years 2015-16 if they are Poisson distributed.

## [1] 40.36

Now that we’ve calculated the expected number of shootings, we can use Pearson’s chi-squared test (goodness of fit) to see if the data does indeed follow a Poisson distribution. Our test statistic is \[\chi^2 = \sum_{i=1}^{n} \frac{(O_{i} - E{i})^{2}}{E_{i}}\] where \(O_{i}\) denotes the observed value in state \(i\) and \(E_{i}\) denotes the expected value in state \(i\) if the data is Poisson distributed.

## [1] 3581.108
## [1] 0

Our resulting test statistic is large so our p-value is very small (so small that it rounds to 0), indicating that we should reject our null hypothesis that the data is Poisson distributed.

Let’s repeat this analysis using the R built in glm function results, for both the null model m0, the most basic model with the brady scores as the only explanatory variable m1, and the full model mv with multiple variables. We can simply square the Pearson residuals, \[r_{i} = \frac{O_{i} - E_{i}}{\sqrt{E_{i}}}\] to get Pearson’s chi-squared statistic as before.

First we fit the null model as we did previously.

## [1] 3581.108
## [1] 0

These results reflect our previous results.

Next we fit the simplest model, using the brady scores as our explanatory variable \(X\), followed by the multivariate model with statewise sociodemographic characteristics included.

## [1] 427.6651
## [1] 2.301701e-62
## [1] 186.6729
## [1] 3.048676e-18

These results show that we still reject our null hypothesis that the data follows a Poisson distribution, even when incorporating other variables. However we note that our test statistic gets smaller as we add more variables to the model.

We plot a histogram of the count data compared to a line graph of the Poisson distribution with the same \(\lambda\) (average number of shootings) to visualize how the data deviates from Poisson.

We observe that the data is quite skewed and does not fall tightly under the curve which may be contributing to the high values of our Pearson’s chi-squared test statistic.

Next we create a qqplot to again see is the data deviates from a Poisson distribution. A “good” qqplot would look like a relatively straight diagonal line of data points, showing that the data matches the distribution.

Let us consider using a negative binomial to account for overdispersion, when the varaiance of the data is larger than the mean. Poisson models assume that the mean equals the variance, hence our model summary states “Dispersion parameter for Poisson family taken to be 1.”

## 
## Call:
## glm(formula = gunshot_tally ~ brady_scores + offset(log(total_pop)), 
##     family = "poisson", data = p_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9878  -1.8314  -0.2442   0.8135   9.3783  
## 
## Coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  -1.193e+01  2.333e-02 -511.215  < 2e-16 ***
## brady_scores -3.643e-03  6.263e-04   -5.816 6.01e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 430.97  on 49  degrees of freedom
## Residual deviance: 396.07  on 48  degrees of freedom
## AIC: 650.68
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm.nb(formula = gunshot_tally ~ brady_scores + offset(log(total_pop)), 
##     data = p_df, init.theta = 6.593063775, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6937  -0.8066  -0.2271   0.3542   2.5874  
## 
## Coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  -11.888623   0.063627 -186.849  < 2e-16 ***
## brady_scores  -0.008368   0.001989   -4.207 2.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(6.5931) family taken to be 1)
## 
##     Null deviance: 67.922  on 49  degrees of freedom
## Residual deviance: 49.465  on 48  degrees of freedom
## AIC: 388.16
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  6.59 
##           Std. Err.:  1.63 
## 
##  2 x log-likelihood:  -382.156

We observe that the standard errors increase for the negative binomial.

We do the equivalent test of a Likelihood Ratio Test comparing the Poisson model to the negative binomial model to see if the relaxation of the mean=variance assumption is better.

## [1] "LRT test statistic = 264.52, df = 1, p-value = 0"

Our test statistic is large and our p-value small (rounds to 0), indicating that we should reject the null hypothesis and conclude that a negative binomial model is better than Poisson.

We can also look at a quasi-poisson model, which retains the Poisson distibution assumption but removes the mean=variance assumption.

## 
## Call:
## glm(formula = gunshot_tally ~ brady_scores + offset(log(total_pop)), 
##     family = quasipoisson(link = "log"), data = p_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9878  -1.8314  -0.2442   0.8135   9.3783  
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  -11.925886   0.069634 -171.264   <2e-16 ***
## brady_scores  -0.003643   0.001870   -1.949   0.0572 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 8.909899)
## 
##     Null deviance: 430.97  on 49  degrees of freedom
## Residual deviance: 396.07  on 48  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

We see in these results that our brady coefficient is no longer significant, and note that the dispersion parameter 8.909899 is very high.

Thus we continue our analysis using the negative binomial regression model.

2. Selection of variables

In the study, “We employed a 2-step approach used in previous firearm research to select covariates for the adjusted regression models.First, we identified variables correlated (Spearman’s r) with the outcome at 0.300 or greater. Then, to address concerns with collinearity, we excluded those covariates that were highly correlated with other covariates.”

Thus we calculate Spearman’s \(\rho\) for the fatal police shooting rate in each state with all other relevant variables.

## [1] "white correlation with fatal shooting rate: -0.035"
## [1] "black correlation with fatal shooting rate: -0.183"
## [1] "hispanic correlation with fatal shooting rate: 0.116"
## [1] "male correlation with fatal shooting rate: 0.421"
## [1] "age correlation with fatal shooting rate: -0.432"
## [1] "violent_crime correlation with fatal shooting rate: 0.445"
## [1] "unemployment_rate correlation with fatal shooting rate: 0.241"
## [1] "density correlation with fatal shooting rate: -0.58"
## [1] "ownership correlation with fatal shooting rate: 0.547"

We see that the data for percent of white residents, percent of black residents, percent of hispanic residents, and unemployment rate do not have a correlation stronger than 0.3 (or -0.3). We leave variables these out of the rest of our analysis.

Now let’s look at potential covariates by finding the correlation between each of the remaining variables.

## [1] "male correlation with age: -0.497"
## [1] "male correlation with density: -0.745"
## [1] "age correlation with male: -0.497"
## [1] "age correlation with density: 0.505"
## [1] "density correlation with male: -0.745"
## [1] "density correlation with age: 0.505"
## [1] "density correlation with ownership: -0.423"
## [1] "ownership correlation with density: -0.423"

It seems as though median age & percentage of male residents, median age & population density, and gun ownership & population density are all highly correlated. Thus, we keep age and gun ownership to be consistent with the study.

The study also kept the violent crime rate data, which we see has a high correlation with the fatal police shooting rate and does not have a high correlation with the other variables so we will include this in our multivariate model as well.

Data analysis

1. Negative Binomial regression

For both Negative Binomial regression and Poisson regression, our data must be discrete. However we want to incorporate year and population into our fatal police shooting data to find the annualized rate per 1 million residents.

The Negative Binomial and Poisson regression model with rate \(t\) have the same regression equation, \[\ln{\frac{\mu}{t}} = \beta_{0} + \beta_{1}X \implies \mu = te^{\beta_{0}}e^{\beta_{1}X}\] where \(\mu\) is the mean of \(Y\), \(X\) is our given predictor variable (in our simple model m1 case, the Brady scores) and \(\beta_{0}\) and \(\beta_{1}\) are our parameters to be estimated.

To do this in R, as we have included in the models in the Exploratory data analysis section, adding + offset(log(total_pop)) includes our rate \(t\).

We plot the results of our negative binomial regresson on the reduced model m1.

Next we analyze the multivariate negative binomial model.

## 
## Call:
## glm.nb(formula = gunshot_tally ~ brady_scores + age + violent_crime + 
##     ownership + offset(log(total_pop)), data = p_df, init.theta = 10.11418313, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9237  -0.7183  -0.2150   0.4537   1.8964  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.142e+01  1.140e+00 -10.020  < 2e-16 ***
## brady_scores  -2.873e-03  3.541e-03  -0.811 0.417106    
## age           -4.620e-02  2.568e-02  -1.799 0.072036 .  
## violent_crime  1.452e-03  4.293e-04   3.383 0.000718 ***
## ownership      1.314e-02  9.678e-03   1.358 0.174454    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(10.1142) family taken to be 1)
## 
##     Null deviance: 92.474  on 49  degrees of freedom
## Residual deviance: 47.613  on 45  degrees of freedom
## AIC: 376.81
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  10.11 
##           Std. Err.:  2.69 
## 
##  2 x log-likelihood:  -364.812

2. Absolute Rate Differences

We group the states into 4 quartiles and calculate absolute rates or the mean annualized rate of fatal police shootings per 1,000,000. Then we find absolute rate differences between the first group of states (with the fewest/weakest firearm laws) compared to the other 3 groups. Finally we calculate confidence intervals and point estimates for the incidence rate ratios (IRR), where we again compare the 1st group of states to the other 3 groups.

We calculate everything from scratch: the mean rates and shooting counts, the rate differences and ratios, the standard error of the log(IRR) std, and lastly the confidence intervals.

Note that the standard error is calculated from the count data, not the rate data. Suppose \(a\) is the mean number of shootings in the first quartile and \(b\) is the mean number of shootings in the second, third, or fourth quartile. Then the standard error is calculated by \[SE = \sqrt{\frac{1}{a} + \frac{1}{b}}\]

On the natural log scale, the lower and upper bounds of the 95% confidence interval are found with \[log(IRR) \pm [1.96\times{SE}]\] which we exponentiate to get the final interval for the IRR.

Change in Overall Fatal Police Shootings by State-Level Firearm Legislative Strength Quartile: United States, January 1, 2015–December 31, 2016
Unadjusted
Firearm Legislative Strength Quartile Absolute Rate (SD) Absolute Rate Difference Incidence Rate Ratio CI: Lower CI: Upper
1 4.534 0.000 1.000 0.000 0.000
2 4.491 0.043 0.991 0.619 1.585
3 2.776 1.759 0.612 0.392 0.955
4 2.058 2.476 0.454 0.298 0.692

Next we look at the incidence rate ratios when including our age, gun ownership, and violent crime variables.

## [1] 1.0000000 0.9477462 0.8369995 0.5099855
## [1] 1.0000000 0.9580370 0.9664771 0.8778255

Interestingly, when we add gun ownership the incidence rates are much closer to 1 indicating that the brady scores have less of an effect on fatal police shooting rates when we include gun ownership rates.

Note that another way to calculate incidence rate ratios is to simply exponentiate a model’s coefficients and confidence intevals. To do this, one would have to make each quartile a variable \(X_{1},...X_{4}\).

3. Sobel-Goodman mediation test

We just observed that adding gun ownership rates had a large effect on our incidence rate ratios. Because of this, we will consider performing a Sobel test to see if gun ownership is a mediating factor. To understand this relationship, a helpful graphic (from Wikepedia) is shown below

(graphic ) where X is the Brady scores and other variables, Y is the fatal shooting rate, and M is the mediating factor gun ownership.

To complete this test we import the package mediation and fit two models: one with the mediating factor ownership as the response variable, and the other with ownership included in the predictor variables and the gunshot_tally as our response as before.

More information on this R package can be viewed here.

Negative binomial models are not permitted mediator model types for this R package so we use Poisson instead.

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                           Estimate 95% CI Lower 95% CI Upper p-value
## ACME (control)           -7.50e-08    -1.96e-07     2.81e-08    0.22
## ACME (treated)           -7.63e-08    -2.02e-07     2.76e-08    0.22
## ADE (control)             5.03e-08    -1.32e-07     2.68e-07    0.60
## ADE (treated)             4.90e-08    -1.33e-07     2.61e-07    0.60
## Total Effect             -2.59e-08    -1.02e-07     5.12e-08    0.36
## Prop. Mediated (control)  1.13e+00    -9.26e+00     1.12e+02    0.58
## Prop. Mediated (treated)  1.13e+00    -9.53e+00     1.13e+02    0.58
## ACME (average)           -7.56e-08    -1.99e-07     2.78e-08    0.22
## ADE (average)             4.97e-08    -1.32e-07     2.65e-07    0.60
## Prop. Mediated (average)  1.13e+00    -9.39e+00     1.13e+02    0.58
## 
## Sample Size Used: 50 
## 
## 
## Simulations: 100
## [1] 1.317641e-09

“The average causal mediation effects (ACME) and the average direct effects (ADE) represent the population averages of these causal mediation and direct effects.”

With this output, one can perform statistical tests on the differences between the “control” and “treated” groups.

Summary of results

Our analysis deviates slightly from the AJPH study. We found that fatal police shootings in the years of 2015 & 2016 were not Poisson distributed after applying Pearson’s Chi-Squared test to multiple models. By relaxing the mean = variance assumption and comparing to Negative Binomial models and a quasi-poisson model, we concluded a Negative Binomial model is a better fit for the dataset.

To select the appropriate variables for the multivariate model, we calculated Spearman’s \(\rho\) correlation coefficients between all variables and the outcome Y as well as between variables to control for potential covariates. Our results and selection closely mirrored that of the AJPH study.

Our Brady score coefficient is negative and statistically significant and can be interpreted as “for every additional Brady legislative point, there are 0.00418 less fatal police shootings per 1 million people per year” (m_nb$coefficients[2]/2).

However more analysis should be completed to determine whether this decreased is caused by the increase in legislation, or the decrese in gun ownership. When adding gun ownership into both our Negative Binomial model and our calculations of absolute rate differences, we saw less of an effect from the Brady Scores. Sobel Mediation tests are one tool that can be used to observe this relationship.