Violent and Property Crime Rate Prediction in the U.S.: A Machine-learning Approach

Author

Ziwen Lu, Xuyan Xiu, Doris Yan

Introduction

Crime is a significant concern in public policy since crime directly affects individual and community well-being. Crime also has a huge impact on the economy as it increases the costs of public safety measures and deters investment. The estimated annual cost of crime is $4.71 - $5.67 trillion (Anderson 2021). The two major types of crimes in the United States are violent crime (homicide, rape, robbery, and aggravated assault) and property crime (burglary, larceny, and auto theft). According to the Federal Bureau of Investigation, there were 5,049,721 property crimes and 809,381 violent crimes reported in the United States in 2022. Therefore, we believe that it is important to understand the factors that predict crime and lend insights to policies that reduce crime and increase social stability.

We are interested in learning about urban planning factors that predict crime rate. Goin, Rudolph, and Ahern (2018) studied the predictors of firearm violence in California with a machine-learning approach. The covariates they use include ACS 5-year estimates of ZCTA-level community characteristics, alcohol outlet density, and climate. With a similar machine-learning methodology to study crimes, we will expand their scope by (1) covering a wider variety of crimes that include both property and violent crimes, (2) using county-level data from the entire U.S. (3) focusing on the socioeconomic status factors as well as the urban planning factors that predict crime.

Crime and covariates datasets are available on Inter-university Consortium for Political and Social Research (ICPSR). All of the datasets are posted by National Neighborhood Data Archive (NaNDA) on ICPSR for storage. The covariates we choose include socioeconomic status and demographic characteristics (Clarke et al., n.d.), school counts (Kim et al., n.d.), street connectivity (Ailshire et al., n.d.), the number of polluting sites (Finlay et al., n.d.), and land cover (e.g., low-, medium-, or high-density development, forest, wetland, open water) (Melendez et al., n.d.).

Previous research support our choice of covariates. There is significant decline in crime following school closure in Philadelphia. The reduction in crime, mainly driven by the reduction in violent crime, is concentrated in places where high schools are closed (Steinberg, Ukert, and MacDonald 2019). Residential blocks with more schools have a significantly higher number of felonious and motor vehicle thefts (i.e., property crimes) (Murray and Swatt 2013). In terms of street connectivity and land, there is significant relationship between traffic congestion and domestic violence. Extreme traffic increases the incidence of domestic violence by 9% in Los Angeles (Beland and Brent 2018). Environmental factors likewise have an effect on crime. Increased pollution has a positive effect on violent crimes in the United States. A 10% reduction in daily PM2.5 and ozone could save $1.4 billion in crime costs per year (Burkhardt et al. 2019). In contrast, the amount and access to green space in the urban environment has a mitigating impact on violent crime (Shepley et al. 2019). The results are important for crime prevention.

Method

Data Retrieval

To make our analysis reproducible, we use the function icpsr_download to download data. The downloaded data is stored in a sub-directory called data in the current directory. The process is repeated every time someone tries to run the code. For the analysis, data retrieval only needs to be done once.

Data Cleaning

There are a few steps in data cleaning.

  1. The unit of analysis in the crime data is county, identified by the 5-digit FIPS county code, whereas the unit of analysis for all predictors are is sub-county area, identified by 11-digit Census Tract FIPS code. Thus, for the socioeconomic status and urban planning predictors, we first sum up the sub-county area with the same first 5-digit FIPS code together. Then, we merge the socioeconomic status and urban planning data with the crime data by the shared 5-digit FIPS code.
  2. We remove duplicate variables. For example, crime and socioeconomic data all have variables for population. However, since we are predicting crime, it would be more appropriate to use the population variable in the crime data, so we drop the population variable in the socioeconomic data. We also intent to drop the value with more than 50% missing values, but after exploring the data, we found no variable has more than 50% missing values.
  3. We think it would be more appropriate to measure crime rate instead of the number of crime in each county because a county with more population would have more crimes. Measuring crime rate thus eliminates population as a strong predictor. Therefore, we generate a new variable for crime rate, which is the number of crime per 100,000 people. We use crime rate to map the average crime rate from 2002 to 2014.
  4. The 2010 data shows that violent crime rate ranges from 9 to 2361, and property crime rate ranges from 31 to 8853. Given the large variation in crime rate, we performed a log transformation for crime rate to reduce the variability and skewness of data.
  5. After transforming census tract to county for the predictors, we filter by year. Predictive data is from 2010. Implementation data is from 2016. School counts, pollution, and land connectivity are all annual measures. We use ACS 5-year estimate of socioeconomic indicators from 2008-2012 on the modeling year 2010, ACS 5-year estimate of socioeconomic indicators from 2013-2017 on the implementation year 2016. We use street connectivity in 2010 on the modeling year 2010, and we use street connectivity in 2020 in the modeling year 2016, assuming street connectivity does not change significantly from 2016 to 2020. In total, we have 106 predictor variables.

Models

We use regularized regression models including ridge and LASSO regression, XGBoost for ensembling, tree-based models, and Multivariate Adaptive Regression Splines (MARS). We split crime rate data into training and testing set. After selecting the model with the lowest RMSE on the testing set from 2010 predictive crime data, we use the best model to predict the 2016 implementation crime data. Finally, we compare our predictions in 2016 with the actual violent and property crime in 2016 and use RMSE to determine the robustness of our models.

For all five models, we replace missing values in all predictor variables with the median of the respective variables using step_impute_median and remove the predictor variables with zero variance using step_nzv. We only normalize our regularized regression models (Ridge and LASSO) using step_normalize because they are more sensitive to the scale of predictor variables.

Metric Selection

Because we want to make predictions and see how well our models perform, we decided to use RMSE as our metric.

Implementation

We compare our predicted crime rate at 2016 with the actual crime rate data from 2016 (United States Department Of Justice. Federal Bureau Of Investigation, n.d.).

Result

Important Variables

Based on the result from our random forest model, most of the important variables for predicting violent and property crime come from the socioeconomic data and few are from the land cover data.

For violent crime, the top three important variables are proportion of people non-Hispanic Black, Disadvantage 1, and proportion of families with Income less than 15K.

For property crime, the top three important variables are Disadvantage1, Proportion of female-headed families with kids, and Disadvantage2.

Disadvantage1 is the mean of proportion of people Non-Hispanic Black, proportion female-headed families with kids, proportion of households with public assistance income, proportion people with income past 12 months below poverty level, and proportion 16+ civilian labor force unemployed.

Disadvantage2 is the mean of proportion female-headed families with kids, proportion of households with public assistance income, proportion people with income past 12 months below poverty level, and proportion 16+ civilian labor force unemployed.

Model Performance

For violent crime prediction, XG Boost has the lowest RMSE. Thus we use this model to predict the violent crime rate in 2016. Our final model fore predicting violent crime rate in 2016 has a RMSE of 0.593.

For property crime prediction, Random Forest has the lowest RMSE. Thus we use this model to predict the property crime rate in 2016. Our final model fore predicting property crime rate in 2016 has a RMSE of 0.459. The unit of RMSE is logarithm of crime rate per 100,000 people.

Discussion

It makes sense that social economic and demographic indicators mostly characterize the outlook of both violent and property crimes. The important variables of both types of crime fall into the economic disadvantage category. Therefore, it is important to target crime prevention effort to areas that are poverty-stricken. Meanwhile, the local governments need to create programs that stimulate racial and ethnic-inclusive economic growth, such as employment program to reduce unemployment rate. Our model allows us to understand the community characteristics of violent crime rate and property crime rate.

Our models are robust to predict the unseen data. Applying our model of violent crime rate on the implementation data in 2016 generates RMSE of 0.593, which means the average difference between our model’s predicted values and the actual values is 0.593. Similarly, for property crime rate, the average difference between our model’s predicted values and the actual values is 0.459. Meanwhile, it is critical to note that the unit of RMSE is logarithm of crime rate per 100,000 people, not crime rate.

Limitation

One of the limitations to our approach is that we assume use street connectivity in 2020 on our implementation year 2016, as we assume street connectivity does not vary significantly between 2016 and 2020.

Additionally, it is uncertain that our covariates of choice outline the entirety of crime rate prediction, because crime is a complicated social issue that implicates many other factors. For example, aside from community characteristics, Goin, Rudolph, and Ahern (2018) also include the number the alcohol outlets and climate. It is possible that with more predictors, our models can have better performance.

The unit of RMSE is logarithm of crime rate per 100,000 people. For convenience we log transform crime rate, but to get a more informative result for policymakers, it may be better to transform log back to crime rate so that it is easier to interpret.

References

Ailshire, Jennifer, Robert Melendez, Megan Chenoweth, and Lindsay Gypin. n.d. “National Neighborhood Data Archive (NaNDA): Street Connectivity by Census Tract and ZIP Code Tabulation Area, United States, 2010 and 2020: Version 2.” https://doi.org/10.3886/ICPSR38580.V2.
Anderson, David A. 2021. “The Aggregate Cost of Crime in the United States.” The Journal of Law and Economics 64 (4): 857–85. https://doi.org/10.1086/715713.
Beland, Louis-Philippe, and Daniel A. Brent. 2018. “Traffic and Crime.” Journal of Public Economics 160 (April): 96–116. https://doi.org/10.1016/j.jpubeco.2018.03.002.
Burkhardt, Jesse, Jude Bayham, Ander Wilson, Ellison Carter, Jesse D. Berman, Katelyn O’Dell, Bonne Ford, Emily V. Fischer, and Jeffrey R. Pierce. 2019. “The Effect of Pollution on Crime: Evidence from Data on Particulate Matter and Ozone.” Journal of Environmental Economics and Management 98 (November): 102267. https://doi.org/10.1016/j.jeem.2019.102267.
Clarke, Philippa, Grace Noppert, Robert Melendez, Megan Chenoweth, and Lindsay Gypin. n.d. “National Neighborhood Data Archive (NaNDA): Socioeconomic Status and Demographic Characteristics of Census Tracts and ZIP Code Tabulation Areas, United States, 2000-2020: Version 3.” https://doi.org/10.3886/ICPSR38528.V3.
Finlay, Jessica M., Robert Melendez, Longrong Pan, Michael Esposito, Anam Khan, Mao Li, Iris Gomez-Lopez, et al. n.d. “National Neighborhood Data Archive (NaNDA): Polluting Sites by Census Tract and ZIP Code Tabulation Area, United States, 1987-2021: Version 2.” https://doi.org/10.3886/ICPSR38597.V2.
Goin, Dana E., Kara E. Rudolph, and Jennifer Ahern. 2018. “Predictors of Firearm Violence in Urban Communities: A Machine-Learning Approach.” Health & Place 51 (May): 61–67. https://doi.org/10.1016/j.healthplace.2018.02.013.
Kim, Min Hee, Mao Li, Dominique Sylvers, Michael Esposito, Iris Gomez-Lopez, Philippa Clarke, and Megan Chenoweth. n.d. “National Neighborhood Data Archive (NaNDA): School District Characteristics and School Counts by Census Tract, ZIP Code Tabulation Area, and School District, 2000-2018: Version 1.” https://doi.org/10.3886/ICPSR38569.V1.
Melendez, Robert, Mao Li, Anam Khan, Iris Gomez-Lopez, Philippa Clarke, and Megan Chenoweth. n.d. “National Neighborhood Data Archive (NaNDA): Land Cover by Census Tract and ZIP Code Tabulation Area, United States, 2001-2016: Version 1.” https://doi.org/10.3886/ICPSR38598.V1.
Murray, Rebecca K., and Marc L. Swatt. 2013. “Disaggregating the Relationship Between Schools and Crime: A Spatial Analysis.” Crime & Delinquency 59 (2): 163–90. https://doi.org/10.1177/0011128709348438.
Shepley, Mardelle, Naomi Sachs, Hessam Sadatsafavi, Christine Fournier, and Kati Peditto. 2019. “The Impact of Green Space on Violent Crime in Urban Environments: An Evidence Synthesis.” International Journal of Environmental Research and Public Health 16 (24): 5119. https://doi.org/10.3390/ijerph16245119.
Steinberg, Matthew P., Benjamin Ukert, and John M. MacDonald. 2019. “Schools as Places of Crime? Evidence from Closing Chronically Underperforming Schools.” Regional Science and Urban Economics 77 (July): 125–40. https://doi.org/10.1016/j.regsciurbeco.2019.04.001.
United States Department Of Justice. Federal Bureau Of Investigation. n.d. “Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, United States, 2016: Version 3.” https://doi.org/10.3886/ICPSR37059.V3.