Volume 28, Issue 5 pp. 1377-1399
RESEARCH ARTICLE
Open Access

Predicting and analyzing crime—Environmental design relationship via GIS-based machine learning approach

Gamze Bediroglu

Corresponding Author

Gamze Bediroglu

Architecture and City Planning Department, Kilis 7 Aralık University, Kilis, Turkey

Correspondence

Gamze Bediroglu, Architecture and City Planning Department, Kilis 7 Aralık University, Kilis 79000, Turkey.

Email: [email protected]

Search for more papers by this author
Husniye Ebru Colak

Husniye Ebru Colak

Department of Geomatics Engineering, Karadeniz Technical University, Trabzon, Turkey

Search for more papers by this author
First published: 05 June 2024
Citations: 3

Abstract

Correlation between burglary crime and urban environmental characteristics is crucial for understanding the causes of crime events. Mathematical relationships can be linked between crime and crime-causing events with the help of the machine learning (ML) model and geographic information system (GIS). The main objective of this research is to analyze and predict burglary crime events by applying ML-based GIS models for Trabzon and Turkey. Random forest regression (RFR) and support vector regression (SVR) were implemented to predict crime. Correlation between crime and urban physical environmental metrics was used in the prediction model. Due to the result of the analysis, the R2 value was measured as 0.78 with the RFR and 0.71 with the SVR algorithm. The height of the building, the proportion of floor area, the density of buildings, and the density of intersection of streets are the four most important variables that affect the burglary crime rate positively. Conversely, the variable with the lowest effect on burglary crime is the ratio of the park to the residential area.

1 INTRODUCTION

Crime is a significant threat and one of the biggest problems in many countries. This complicates law enforcement and crime reduction activities, disrupts the living conditions and social trust environment. To reduce the risk of becoming a victim, create a more secure environment and better quality of life, it is necessary to analyze the relation of crime and find effective methods of crime prevention. The crime analysis will make it possible for crime prevention by reducing the waste of time and resources. Successful crime prevention efforts will promote a safer community by enhancing the perception of safety and the attitudes and behaviors that help people feel safe (Kubilay, 2009). However, there are many factors, such as the psychological, social, physical, and other environment that affect criminal events. Crime analysis describes the qualitative and quantitative study of crime and law enforcement information in combination with sociodemographic and spatial factors to apprehend criminals, prevent crime, reduce disturbance, and evaluate organizational procedures (Boba, 2001).

In order to develop strategies for crime prevention, the spatial and temporal patterns of crime analysis should be defined. In this context, it is necessary to determine the regions of clustering and spatial patterns of crime by type, time, and type of past crime events. As a result, crime analyses should be conducted using geographic information system (GIS)-based analyses. Spatial data analysis through GIS is becoming more popular in the crime analysis (Butorac & Marinovic, 2017) and GIS can be used as a decision-making support system to find better solutions to reduce crime (Achu & Rose, 2016). The development of affordable GIS and the increasing technological developments within policing (such as the digitization and geocoding of crime records) have allowed researchers to exploit the wealth of data collected by police authorities and map crime (Henrico et al., 2022). However, crime analysis requires good and comprehensive criminal records. In addition, this issue concerns big data. It is necessary to obtain useful information through data mining about criminal events from big data.

Crime prediction and criminal identification are the major challenges to law enforcement and intelligence-gathering organizations as there are tremendous amount of crime data exist, and crime data grows very fast. For these reasons, there is a need for technology through which the case-solving could be faster (Sri et al., 2020), and also there is a need for an approach that is quick to solve criminal events and has a good prediction model. A better approach to crime prediction is artificial intelligence methods for inference, decision-making, optimization, and prediction. In particular, machine learning (ML) algorithms may be used to discover and generalize non-trivial relationships between geographical information and other factors or quantities, turning them into reusable predictive models (Cichosz, 2020).

In the literature, studies generally focus on crime prediction and comparing the performances of ML algorithms used in crime prediction. This study provides a methodological framework to analyze the relationship between urban environmental factors affecting crime incidents and crime rates. The application of the ML algorithm provides significant advantages in determining to what extent each factor affects crime.

The objective of this research is to apply appropriate machine learning algorithms to crime data to predict burglary crimes. We focus on the environmental and physical factors of crime, which include spatial variables of the crime. Two models, ML RFR and support vector regression (SVR) have been implemented to predict crime through the existing correlations between crime and urban physical environmental metrics. In this study, the targeted outputs were tested in an area of study and the environmental design relationship was determined with ML-based GIS application for analysis and prediction of crime events.

2 LITERATURE REVIEW

2.1 Machine learning in criminology

ML is a tool for turning information into knowledge and ML techniques are automatically used to find valuable underlying patterns within the complex data that we would find difficult to discover otherwise (Edwards, 2018).

ML for crime analysis includes data collection, classification, pattern identification, prediction, and data visualization. Traditional data mining techniques—association analysis, classification and prediction, cluster analysis, and outlier analysis—identify patterns in structured data, while more recent techniques identify patterns from both structured and unstructured data (Chen et al., 2004; Kim et al., 2018).

Several researchers have addressed problems related to crime control. The variety of statistical methods and ML algorithms used for crime prediction depends on the problem to be solved, the data (distribution, multicollinearity, noise, etc.) and the expected results (regression, classification, causality of crime, etc.) (Matijosaitiene, Zhao, et al., 2019). The prediction accuracy of the ML algorithm depends on the characteristics selected and the dataset used as a reference. In addition, each algorithm has its own advantages and disadvantages in terms of complexity, accuracy, and training time, and can provide different results from a single data set.

2.2 Machine learning algorithms

In this section, information about some ML studies and algorithms used in crime prediction is given.

In the study of Ahishakiye et al. (2017) the development of a prototype crime prediction model to predict violent crimes using the decision tree algorithm is discussed. From the experimental results, the decision tree algorithm predicted crime data to an accuracy of 94%. In the study of Kim et al. (2018), K-nearest neighbors (KNN) and boosted decision tree as ML predictive models were used for crime prediction. In the model, they used all crimes in Vancouver since 2003 and used two different approaches. In the first approach, all categorical variables are converted into binary variables 0 and 1. In the second approach, categorical variables are converted into numerical variables with unique IDs. For approach 1, KNN's accuracy was 40.1%, while for approach 2, it turned out to be 39.9% accurate. At approach 1, the accuracy of the increased decision tree was 41.9%, while approach 2 was 43.2% accurate. Alves et al. (2018) used a random forest regression to predict crime and quantify the influence of urban indicators on homicides. In the study, homicide data between 2001 and 2010 were used and 10 urban indicators such as child labor, elderly population, female population, gross domestic product, illiteracy, family income, male population, population, sanitation, and unemployment were selected as predictor variables of crime. Their approach could have up to 97% accuracy in crime prediction. The study results revealed that unemployment and illiteracy were the most important variables in defining homicides in Brazilian cities. It also determined the order of importance of urban indicators in crime estimation. Marchant et al. (2018) applied ML techniques to particular offense data, such as domestic violence-related assaults, burglaries, and motor vehicle theft for the period 2009–2013, in the state of New South Wales (NSW), Australia. A fully probabilistic algorithm based on ML techniques was used to implement a Bayesian approach. It is argued that this fully probabilistic approach will improve prediction in terms of more accurate measurement of uncertainties and will benefit policy makers and police organizations seeking to prevent and control crime. This approach aims to model the dependency between offense data and environmental factors, such as demographic characteristics and spatial location. McClendon and Meghanathan (2015) have implemented linear regression, additive regression, and decision stump algorithms to demonstrate how effective and accurate the ML algorithms used in data mining can be at predicting violent crime patterns. While the linear regression algorithm provided the best performance among these three algorithms, the decision stump algorithm has the lowest performance. The relatively poor performance of the decision stump algorithm can be attributed to the randomness factor; decision trees have more rigid branches and only produce accurate results if the test set follows the pattern modeled. On the other hand, the linear regression algorithm can handle randomness in the test samples to a certain degree (without too much prediction error). In another study, Angelov et al. (2020) focus on how different types of offenses affect the residential property sale price in an urban county in Washington and in what ML algorithms can be more effective in predicting sales values. The data source contains the physical attributes of a property, such as the square footage, quality, the year built and/or remodeled and another data source contains the crime data (assault, burglary, traffic, drug, fraud, homicide, theft, theft, vandalism, etc.) from July 2018 to July 2019. They built models with three algorithms—decision trees, artificial neural networks, and random forests. Their study showed that random forest models produced the lowest values of errors. In addition, using the information gained from the random forest model, the features from the most important to the least significant were identified for the prediction. They concluded that crime is an important factor in predicting the selling price of residential properties.

According to the studies above, many different algorithms have been used in crime studies in the literature. In general, it has been observed that decision tree-structured algorithms are used more in crime studies and have high performance. However, studies examining the relationship between burglary and environmental factors and interpreting the results from a theoretical perspective are limited.

2.3 Spatial factors of crime

Crimes do not occur equally in all places and by the same way. Crime is concentrated in some places more than in others, while in others, there are fewer crimes. Criminal events are most likely to occur in areas where the area of activity of offenders overlaps with the activity space of potential victims/targets (Brantingham & Brantingham, 1991). In addition, crime is also affected by the characteristics of the physical environmental features. Physical environmental factors include spatial variables of the crime. Therefore, it is very important to determine the relationship of crimes with land use and the environment design in order to reduce crime events.

Various theories have been proposed in studies to identify environmental factors that may affect criminal events and prevent criminal behavior on environmental criminology like the Defensible Space Theory (Newman, 1972), Crime Prevention through Environmental Design (Jeffery, 1972), Routine Activity Theory (Cohen & Felson, 1979), Situational Crime Prevention Theory (Clarke, 1980), Space Syntax (Hillier & Hanson, 1984) and Crime Pattern Theory (Brantingham & Brantingham, 1991). The focus of much of these theoretical frameworks in environmental criminology lies in identifying which variables make certain places more prone to crime (Breetzke & Pearson, 2015).

In this study, one of the most important approaches is Crime Prevention through Environmental Design (CPTED). Guidelines for CPTED of New Zealand's Ministry of Justice (2005), there are seven qualities that characterize well-designed and safe places. These qualities are as follows: (1) access through safe movements and connections; (2) monitoring and visibility lines; (3). layout: a clear and logical orientation; (4) activity mix; (5) sense of ownership through caring for the place; (6) well-designed environments; and (7) physical protection by means of an active security measure (Kamal & Suk, 2018).

CPTED is applied in architectural and urban planning to eliminate criminal opportunities through a comprehensive analysis of three main elements that lead to crime: motivated criminals, vulnerable victims, and environmental opportunities (Kang, 2013). There are two important components in the approach to crime prevention through environmental design. These are urban strategies and environmental design attributes.

Planning interventions have a positive impact on the decline in crime rates (Newman, 1973; Schneider & Kitchen, 2007) and the reduction of the fear of crime (Kubilay, 2009; Shaftoe, 2004). On the planning scale, with well-designed urban land use, strategies can lead to a reduction in criminal events.

Land use is discussed as a factor that can affect the opportunity for crime (Hirschfield, 2008; Ludin et al., 2013; Sypion-Dutkowska & Leitner, 2017). Moreover, to undertake crime prevention through a planning approach focuses on mixed land use and diversity of land use (Jacobs, 1961; Jeffery, 1972; Matijosaitiene, Zhao, et al., 2019; Newman, 1973; Sohn, 2016a). According to mixed-use principle, combining residential uses with commercial uses makes neighborhoods safer. Urban activities promoted by the diversity of land use can enhance natural monitoring, discouraging criminal activities (Cozens, 2008; Jacobs, 1961; Sohn, 2016a; Subbaiyan & Tadepalli, 2012).

Jacobs (1961) holds those streets with pedestrian and vehicular traffic, shops, and cafes open at night, and streets with residents living in apartments facing the street are safer. Newman (1973) holds those recreational areas such as parks should be located next to residential areas. In addition, Stankevice et al. (2013) found that the inclusion of specialized areas and greenery into dense residential areas contributes to crime prevention on the streets.

At the design scale, environmental features are related to the configuration of physical environments to prevent crime. Environmental design attributes including site street design, visibility/scrutiny/sightliness, attractiveness, territorial/entry definition, and finding help (Kubilay, 2009). These environmental design attributes were used as effective crime prevention factors in many researches. These factors are building height (Chang, 2009; Moon et al., 2014; Yavuzer, 2013), building position and its connection to the street (Chang, 2009; Lin, 2010; Moon et al., 2014), street density (Chowdhury, 2014; Hillier & Sahbaz, 2009; Kang et al., 2014; Sohn, 2016b), street width (Kang et al., 2014; Moon et al., 2014), street pattern (Chang, 2009; Chowdhury, 2014; Kamal & Suk, 2018; Kang, 2013; Kubilay, 2009; Matijosaitiene, McDowald, et al., 2019; Sakip & Mustafa, 2019), lighting (Chowdhury, 2014; Kamal & Suk, 2018; Kang, 2013), Close Circuit TeleVision (CCTV) (Ditton et al., 1999; Kang, 2013; Lin, 2010; Moon et al., 2014), landscape design (Chowdhury, 2014; Donovan & Prestemon, 2012; Kuo & Sullivan, 2001; Lin, 2010).

3 MATERIALS AND METHODS

3.1 Study area

To achieve the objective of this study, it was planned to carry out the application in a test region for which the province of Trabzon was selected. The study area is Trabzon city in the eastern Black Sea Region in Turkey (Figure 1).

Details are in the caption following the image
Study area, Trabzon, Turkey.

Trabzon is a city in the eastern Black Sea and located between 38°30′–40°30′ east longitude and 40°30′–41°30′ north latitude. According to data from the Turkish Institute of Statistics, the population of Trabzon in 2020 is 811,901 and covers an area of 4685 km2 (URL-1, n.d.).

The reason for choosing Trabzon city is directly related to the creation of a complete, up-to-date and accurate spatial database to analyze the relationship between criminal events and entry criteria. A comprehensive spatial data set is not available for almost a part of the Türkiye. Trabzon was considered a better alternative in addition to having some small data problems.

3.2 Methodology

In this study, a crime prediction model was created by investigating the relationship between physical environmental factors affecting crime and the rate of burglary. Since crime events occur in specific locations and are a type of behavior influenced by physical environmental characteristics, the physical environmental characteristics and design of cities are very important in crime predictions. ML focuses on learning and teaching from data and improves this situation with experience. In ML, algorithms are trained to find patterns and correlations in large data sets and to make the best predictions based on analysis. RFR and SVR, ML algorithms were used. These methods are detailed below under the heading “Methods of learning of used machines.” The data were divided into two categories: dependent variables and independent variables. In the data set, while crime event data represent the dependent variables, physical environmental data represent the independent variables. For data of the ML model, the measurements of variables were carried out using GIS software and their spatial analyst extensions were performed. GIS is a decision support system that allows spatial analysis and visualization of data. For variable measurement analysis, the study area was divided into grids of 200 × 200 m and a total of 812 meshes were used. The ML model has been implemented in the Python Scikit-learn library. The reason for choosing the 200 × 200 m grid size is during the GIS-based model design and ML performance tests, tests in square dimensions, such as 100 × 100, 200 × 200, 300 × 300, … 600 × 600 and rectangular dimensions in different dimensions (ex: 200 × 400) tests were carried out repeatedly. Finally, 200 × 200 m tests were used because they offered the best performance in this study. If the grid is too small, incidents will focus on only several grids, while the larger grid will reduce the spatial resolution (Rummens & Hardyns, 2021; Zhang et al., 2022).

The differences between previous studies on similar topics are stated below. There are much more parameters compared with existing studies. While the model was being built, the sub-parameters of the parameters were tested many times according to different measurements and data models. Values that give the best performance and accuracy were used. Raw formatted data have been used and processed because there are no ready GIS datasets usable for factor evaluations. On the other hand, the data of some spatial parameters intended to be used in the study could not be used because they were not digitally stored or were not up-to-date.

Study methodology consists of (1) data collection and preparation, (2) preprocessing of data and modeling, (3) predictive modeling of crime using two models of ML and the impact of variables on crime, (4) model validation and prediction for the whole study area. Detailed descriptions are presented as follows.

3.3 Data collection and preparation

3.3.1 Crime data

The main data sources for this study were the crime data reports submitted by the Trabzon Police Department. Crime data reports were obtained in raw format. Crime data are not open data for Turkey due to the security and privacy of personal data. The data were obtained from the Police Department with special authorization for use in scientific studies and in such a way as to protect the confidentiality of personal information is protected. In this context, the qualitative information and location information of the crime are important. The data contained crime records with information on crime types, data, location of each crime, age, and sex of the offender. The original crime data included 20,034 crime events that occurred in Trabzon, Ortahisar District over the past 5 years. According to recorded events, the most frequent type of crime is violent crime, which accounts for 43% of all crime events. Burglary crimes are the second most frequent type of crime, accounting for 13% of total crimes. Among the total crime events, this study focused on burglary crime for analysis. A total of 2236 burglary crime events were extracted from the original crime data. The reason why burglary crimes were selected for analysis is that the rate of the burglary crimes is quite high compared with other crime types and has increased continuously for 5 years. In addition, burglary-type crimes do not have random structure; these crimes are affected by the place and physical environment features.

Recorded crime data were in Excel format and most of the crime dataset does not include the coordinates of crime (rather, it includes street information) and these were solved by linking geocodes of street information in the GIS environment. Prior to the geocoding of crime data, crime data was cleaned to duplicate records, incorrect addresses, and street name errors. And 98% of the crimes originally recorded were successfully geocoded. Crime data are continuous data (collected 7/24) for a 5-year interval. However, the GIS data set collected is a static set of data collected at certain times. In the study area, it was observed by updated satellite images, there were no dominant environmental changes.

3.3.2 Physical environmental data

Many factors, such as the psychological, social, physical, and other environmental factors affect criminal events. Physical environmental data, which are thought to be effective in the occurrence of crime events, were obtained from cadastral institution, municipal institution, and Karadeniz Technical University GISLab Research and Development Laboratory. Since the plans and the current situation were not the same in some regions, satellite images were checked and the deficiencies were corrected. Most of the data were provided in CAD format and converted to GIS format. After being converted to GIS format, the data were edited, detailed, and standardized and the required additional data were entered into the system. Data with different coordinate systems have been brought to the same coordinate system.

These data sets were the main data sources for the GIS analysis in the model and were used as independent variables in the analysis. Detailed information on the attributes of the GIS data set is summarized in Table 1.

TABLE 1. GIS datasets for environmental components.
Datasets Data types Attributes
Parcels Polygon shape Parcel area
Buildings Polygon shape Height, dominant use
Roads Polygon shape Length, intersection of the street segments
Land use types Polygon shape Distance and location

Although they are among the environmental factors affecting the occurrence of crime, some factors are considered to be used due to the fact that it is difficult to obtain data, some factors remain on a microscale for the study, and some are not in the database of any organization (Bus stop, building position, and street connection, obstacles, landscape design elements, signs, etc.) could not be obtained and were stated as the limitation of our study in terms of data supply.

3.4 Variables used in this study

3.4.1 Burglary crime count (dependent variables)

The study focuses on the burglary analysis. Burglary crime count is calculated using a spatial unit of analysis of the same grid and is used as the dependent variable of our model. The total number of burglary offenses in the investigation area is 2236 in 5 years. In this study, burglary crime count was defined as the total number of burglary crimes that occurred in a grid (grid size is 200 × 200 m). This process was applied to each grid in the entire study area through GIS vector data analysis techniques. Figure 2 shows the spatial distribution of burglary crimes and in which regions such crimes are more or less intense. Figure 2a shows the distribution of burglary crime points and Figure 2b shows the distribution of total crime count in each grid in the study area.

Details are in the caption following the image
GIS processing of calculate crime count: (a) distribution of burglary crime events and (b) distribution of total crime count in each grid in study area.

3.4.2 Physical environmental factors (independent variables)

After checking the physical environmental factors for crime literature and considering data availability, 20 variables based on four main components for research were selected as independent variables. The first component is the environmental design attribute (Section 3.4.2.1), which consists of building density, building height, floor area ratio, street density, and street design; the second is urban environmental planning (Section 3.4.2.2), which is made up of land use types; the third is mixed-use (Section 3.4.2.3) and lastly the fourth is land-use diversity (Section 3.4.2.4). These variables may affect burglary crime count positively or negatively. These predictive variable data used for the model of crime prediction are shown in Figure 3 to cover a certain part of the study area. The formulas for these variables determined for the study model are shown, how these variables affect crime and the relationship between environmental principles and the crime situation are discussed in the following sections.

Details are in the caption following the image
The predictor variables data are considered for the crime prediction model.

3.4.2.1 Environmental design attribute

3.4.2.1.1 Building density

This measure was calculated by dividing the total number of buildings in each grid by the area of the grid. In areas with high building density, space control is rather difficult because there are many people sharing the same place and it is difficult to know each occupant and to determine who belongs to the area or who is an outsider to the area (Kubilay, 2009). According to Newman (1973), the more people share an area, the less people's sense of responsibility people have for that area. On the other hand, the high number of people dependent on building density increases natural surveillance and can reduce crime rates.

3.4.2.1.2 Building height

This measure was calculated by dividing the total number of building floors in each grid by the area of that grid. According to Newman's “Defensible Space” theory, building height has a negative effect on burglary (Newman, 1972). He found a direct connection between the building height and the occurrence of crime, which shows that burglaries also occur at higher rates in high-rise buildings than in their lower-rise counterparts (Schneider & Kitchen, 2007; Yavuzer, 2013). As a result of their research, Newman (1972) determined the maximum number of building floors as 5, while Alexander (1977) determined the maximum number of building floors as 4. On the other hand, the building's height is closely related to the area of visibility. According to Newman's theory, the ease of visibility means that natural surveillance from surrounding spaces is generally positive. However, Chang (2009) study showed different results from this view. He found that the correlation between the visibility rate and the burglary rate, the burglary rate of buildings with very good visibility was highest (42.9%), with poor visibility (24.9%) and with the lowest average visibility (3.6%).

3.4.2.1.3 Floor area ratio

Floor area ratio means maximum allowed construction volume for a planned parcel area. This measure was calculated by dividing the average floor area ratio in each grid by the area of that grid. Floor area ratio is obtained by dividing the total floor area that can be built on the parcel (dependent on the number of floors) by the size of the same parcel. The floor area ratio affects crime rates. Moon found that there is a positive relationship between the occurrence of crime and floor area ratio (Moon et al., 2014). He argued that a safer city could be promoted by improving the urban physical environment.

3.4.2.1.4 Street density

This measure was calculated by dividing the total length of streets in each grid by the area of that grid. Improved street networks are expected to enhance natural surveillance (Jacobs, 1961; Johnson & Bowers, 2010), but they adversely affect access control because they increase the permeability of a quarter (Brantingham & Brantingham, 1993; Newman, 1972; Sohn, 2016a). Due to the density of the street, pedestrian and vehicular traffic on the street provides a safe environment for people. Hill and Blears (2004) describe this situation as follows: the absence of vehicular traffic leads to reduced surveillance and increased crime rates, and makes streetwalkers feel lonely and insecure, especially after dark.

3.4.2.1.5 Street design

This measure also examines the effect of the density of the street on burglary crime. It was calculated by dividing the total number of intersections of streets in each grid by the area of that grid. The high number of street intersections and street turns increases the connection from one street to another, making criminals to escape easily. Therefore, higher connectivity may weaken security because it increases the number of escape routes that can be facilitated by offenders (Brantingham & Brantingham, 1993 ). Beavon (1984) argue that street connectivity has more impact on crimes committed by people who learn areas by motor vehicle rather than by foot (Kubilay, 2009). However, pedestrian activities increase with greater street connectivity (Cervero et al., 2009; Saelens et al., 2003), which may improve the opportunity for natural surveillance and activity support (Sohn, 2016a). Another approach in this regard is Space Syntax Approach, which calculates the level of accessibility of street segments of all other street segments within a spatial system (Hillier & Hanson, 1984). More integrated streets, which are more accessible from other streets, are likely to attract more pedestrians, while less integrated streets cannot be reached as easily (requiring many turns) and may attract less pedestrians (Koohsari et al., 2016; Kostakos, 2010; Peponis et al., 1997).

3.4.2.2 Urban environmental planning

3.4.2.2.1 Land use types

The closest distance from the midpoint of each grid to the land use area was measured. This measurement made separately for each type of land use. Some types of land use may have a reducing effect and some may have an increasing effect on crimes. Planning without examining the effects of land use can therefore increase crime rates and a false use of land provides the criminal with an opportunity to commit a crime. In this study, 20 land use types were extracted, which are the most commonly used in crime studies, for the analysis. These include security force buildings, school buildings, health buildings, military buildings, religious buildings, industrial buildings, public buildings, hotel buildings, parks, social facilities, sports facilities, and gas stations.

3.4.2.3 Mixed-use

This measure was calculated by dividing the total buildings of both commercial and residential use by the total buildings of only residential use in each grid. According to Jacobs (1961), streets that have both residential and commercial use 24 h a day are safe streets and she asserted that the mixed-use can generate street activity, promoting the social control benefits of “eyes on the street.” The advocates of mixed-use neighborhoods claim that combining commercial and residential uses can reduce crime by increasing surveillance opportunities, fostering social interaction, and promoting a sense of community and social control (Cozens, 2008; Sohn, 2016b). In contrast to this idea, the mix of commercial and residential uses creates gaps in territoriality distribution, and the greater sense of anonymity combined with the shrunken territory of resident responsibility by the increased land-use mix will escalate the risk of crime (Browning et al., 2010).

3.4.2.4 Land-use diversity

3.4.2.4.1 Ratio of commercial area to residential area

This measure was calculated by dividing the total commercial use parcel area by the total residential use parcel area in each grid. This variable is an indicator of land-use diversity. The increase in the proportion of commercial areas to residential areas generates more street activity and increases social control with natural surveillance. Activity support and natural surveillance create a safer street environment. On the other hand, the impact rate may vary depending on the type of commercial activity (shops, restaurants, or offices/factories) or time of day (e.g., day/night). Browning et al. (2010), in their study, when the increase in the ratio of commercial use to residential use is beyond a certain threshold, showed that land-use diversity can reduce murder and heavy assault, but not robbery.

3.4.2.4.2 Ratio of parks area to residential area

This measure was calculated by dividing the total park use parcel area by the total residential use parcel area in each grid. This variable is an indicator of land-use diversity. According to Jacobs (1961), parks should be designed as a part of their surrounding environment and should be used by the people. As a result, parks increase support for activities. Newman (1973) argues that recreational areas such as parks should also stand alongside residential projects. Because the activities in the parks may have natural surveillance by the inhabitants due to crime and time of day.

Figure 3 shows how environmental parameters are distributed in the study area. Figure 3a shows the data used in the GIS analysis to measure building density, building height; Figure 3b shows the data used in the GIS analysis to measure street density and street design; Figure 3c shows the data used in the GIS analysis to measure the distance to closest land use for each land use types; Figure 3d shows the data used in the GIS analysis to mixed-use, ratio of commercial area to residential area ratio of parks area to residential area.

3.5 The machine learning methods used

The ML model considered in our study is based on supervised learning techniques given that labeled training data were available. Supervised learning consists of two forms, namely, classification and regression. Our study is a regression problem since the difference between classification and regression is that regression gives a number instead of a class and predicts a continuous amount. Regression models based on ML can handle all the above-mentioned issues and are more suitable for the analysis of large complex data sets (Alves et al., 2018; Breiman, 2001).

Two models for ML, random forest (RF) regression and SVR were used to predict crime through existing correlations between crime and urban environmental factors (independent variables). Implementation of the relation between variables and ML is based on the creation of sub-factors carefully. Buffer distances, attribute types, and values are crucial at this point. Practical relations between variables and ML are directly created during ML analysis stage with the help of ML algorithms. The relationship between environmental variables and ML algorithms (RF and SVR) were built using these variables' normalized values; building density, building height, floor area ratio, street density, street design, distance to land use types (security force buildings, school buildings, health buildings, military buildings, religious buildings, industrial buildings, public buildings, hotel buildings, parks, social facilities, sports facilities, and gas station), mixed-use, ratio of commercial area to residential area and ratio of park area to residential area.

Environmental factors such as land use types and environmental design attributes may help to improve the accuracy of crime prediction, but this type of data is not available for all locations. Therefore, these two trained models will be used and performances have been compared in crime prediction modeling.

3.5.1 Random forest regression

RF is a popular and powerful algorithm for ML that can perform both classification and regression problems. The training technique of RF is either bootstrap or bagging and uses the ensemble learning technique. Ensemble learning: Ensemble learning is a ML paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. Bagging, that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process. “bagging” approach aims at reducing variance and at producing an ensemble model that is more robust than the individual models composing it. Boosting, that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy (Rocca, 2019).

Construction of RF is a set of decision trees and each tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make a decision on how to divide the data set into two separate sets with similar responses within. The features for internal nodes are selected with some criterion, which for classification tasks can be Gini impurity, and for regression is variance reduction. (Płoński, 2020). How each feature decreases the impurity of the split (the feature with the highest decrease is selected for the internal node) can be measured and how it decreases on average the impurity can be collected for each feature. The average over all trees in the forest is the measure of the feature importance. The greater advantage of this method is the calculation speed. Tree-based approaches are a nonparametric method, and these approaches can handle missing values, automatically.

RF constructs a group of decision trees in the framework of the random subspace method for efficient modeling. This random selection of attributes can reduce a prediction bias or overfit by excluding attributes that may be highly correlated with each other. By calculating the mean of the predicted values of these decision trees, random forests produce a prediction. (Angelov et al., 2020).

Unlike usual linear regression models, the RF is invariant under scaling and various other transformations of the characteristic values. It is also robust to the inclusion of irrelevant features and produces very accurate predictions and these properties of the RF algorithm make it especially suitable for the prediction of crimes, due to the multicollinearity and nonlinearities present in urban data (Alves et al., 2018; Hastie et al., 2013).

3.5.2 Support vector regression (SVR)

Support vector machine (SVM) is one of the supervised learning models for classification and regression. SVR is a type of SVM. SVR supports linear and nonlinear regression using the respective kernel functions. The commonly used kernels are linear kernel, polynomial kernel, radial base function (RBF), or Gaussian kernel.

The objective of the SVR algorithm is to find the best hyperplane line in an n-dimensional space that has the maximum number of points. Hyperplanes are decision boundaries that are used to predict the continuous output. The data points on either side of the hyperplane that are closest to the hyperplane are called support vectors. These influence the position and orientation of the hyperplane and thus help build the SVR (Raj, 2020).

Although less popular than SVM, SVR has been proven to be an effective tool in real-value function estimation. As a supervised learning approach, SVR trains using a symmetrical loss function, which equally penalizes high and low misestimates (Awad & Khanna, 2015).

SVR is a robust model to work with small training data and high-dimensional problems (Alwee et al., 2013; Ding, 2012). On the other hand, SVR is a powerful algorithm that gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. Unlike ordinary least squares (OLS), the objective function of SVR is to minimize the coefficients—more specifically, the l2-norm of the coefficient vector—not the squared error (Sharp, 2020).

SVR model parameters must be set correctly as they can affect regression accuracy. Because the accuracy of the SVM model depends on the values of its parameters, inadequate parameters can lead to overconditioning or low adjustment (Alwee et al., 2013; Wu & Jin, 2011). SVR allows us to calculate the effect of each independent variable on the dependent variable and to calculate this effect value.

3.6 Preprocessing the data

For crime and spatial data to be used as inputs in ML algorithms, data must first be systematically arranged with a GIS program. Entries with outliers in the data and missing values in data entries due to variables are not suitable to build a model using ML algorithms. These confusions cause the error of our model to increase. In our model data set, there were outliers in some variables, these were cleaned. On the other hand, since the feature of independent variables has a different range, normalization was performed. The goal of normalization is to change the values of numerical columns in the dataset to a common scale, without distorting differences in the ranges of values (Jaitley, 2018). Min–max normalization methods were used to normalize data and give data values between 0 and 1 so that data will not face any distorting distinct ranges of values. Lastly, data shuffling was implemented. Data shuffling is changing the order of data and it is crucial for ML algorithms. Because the main focus of this functionality is to reduce variance, model remains general and overfit less (Ratul, 2020).

These processes are generally performed to standardize the data and make the data meaningful. Thus, these processing steps help improve the accuracy of our model. In the model, the process of cleaning outliers and data shuffling was done on the Excel file where the data were recorded. For data normalization, Python library scikit-learn was used.

3.7 Model building and validation

After preprocessing data, two ML models RFR and SVR were implemented to predict crime through the existing correlations between crime and urban environmental metrics. RF is an ensemble method among various decision tree-based ensemble methods, which is less prone to overfitting and minimizing the variance (Chrysafis et al., 2017; Ullah et al., 2022). SVR was chosen because it is a powerful technique that provides the correlation coefficients of variables. To actually implement the RFR and SVR models, we use the Python library scikit-learn.

The k-fold cross-validation (k = 10) method was applied for splitting dataset. The training data were used to produce the model, and after the model was built, the test data was created to control overfitting.

Overfitting and underfitting are common questions that arise when using ML algorithms. These behaviors appear when estimating the best trade-off that minimizes the bias and variance errors (Alves et al., 2018; James et al., 2014). Overfitting occurs when a model matches the training data almost perfectly, but does poorly in validation and other new data. Overfitting in ML can be determined by the error on the testing or verification dataset that is much larger than the error on the training dataset. The opposite of overfitting is underfitting. When a model fails to capture important distinctions and patterns in the data, it performs poorly even in training data, that is called underfitting (URL-2, n.d.).

Both mean square error (MSE) and R2 metrics were used to evaluate model performance and define which model is the best for crime prediction. MSE measures the average difference between the known values observed in the result and the value predicted by the model (URL-3, n.d.). The smaller the MSE, the more powerful the model is. R2 is a statistical measure of how close the data of the adjusted regression line are, and it is the percentage of the response variable variation that is explained by a linear model. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R2 is always between 0% and 100%: 0% indicates that the model does not explain the variability of the response data around its mean; 100% indicates that the model explains all the variability of the response data around its mean (Dass, 2015). The higher the R2 mean, the better the model.

Model tuning operations were applied in order to obtain the most ideal MSE and R2 values in the established ML models. In ML algorithms, each algorithm has hyperparameters for model optimization (e.g., max_depth, max_features, min_sample_split, n_estimators for RFR). In this process, the hyperparameters and values of the RFR and SVR algorithms were determined through analysis, and the most appropriate values for each parameter were determined. Depending on these parameter values, the model was improved and the performance of the algorithms was increased with the re-adjusted prediction model.

According to Table 2, it seems that the RFR model has higher performance than the SVR model. However, there is no significant performance difference between the two algorithms. When the standard deviations obtained as a result of the k-fold cross-validation method are examined, the standard deviation value obtained from the RFR model result is smaller.

TABLE 2. Performance results of RFR and SVR algorithms.
Algorithm MSE R 2 Standard deviation from k-fold
Random forest regression (RFR) 0.93 0.78 6.83
Support vector regression (SVR) 1.01 0.71 7.46

4 RESULTS

4.1 Predicting crime with the random forest regressor

The random forest model has some main parameters enhancing the performance such as max_depth, max_features, min_sample_split, n_estimators. The best parameters determined for our model are as follows;
RF _ model = RandomForestRegressor random _ state = 42 max _ depth = 5 max _ features = 2 min _ samples _ split = 2 n _ estimators = 200

According to these parameters, our model was tuned and the performance of our model was increased. For our data, we find that the best accuracy (on average) was achieved for these parameters. Both mean square error (MSE) and R2 metrics were used to evaluate model performance and define which model the best for crime prediction. The prediction error measured by the MSE was 0.93 and R2 was measured as 0.78.

MSE means error caused due to used data and preferred method. A higher MSE means higher mistakes in general. Our scaled data are changing between 1 and 16 so these MSE values are acceptable. Besides this, for MSE evaluation, we look at the magnitude of data. R2 is a metric that measures how much of the variability in the dependent variable is explained by the model. According to the R2 value obtained as a result of the building model, all independent variables used in the model explain 78% of the resulting model, which is a good rate. In terms of these criteria's result, MSE and R2 values are acceptable.

4.1.1 Importance of independent variables

The effect of variables is important for a better understanding of crime. For the RFR algorithm, we use physical environmental variables to calculate the importance of the variables describing the number of burglary crimes, but some of them affect the burglary offense more than others.

The measure of the importance ranking of the variables was performed by calculating the average importance of the variables on all the trees in the model and implemented in the Python library sci-kit-learn.

In Figure 4, you can see the importance of ranking of variables on the prediction results.

Details are in the caption following the image
Importance of ranking of physical environmental variables to describe the burglary crime.

Figure 4 shows that floor ratio area is the most important variable to describe the crime, followed by street density, building height, and distance from the security forces building. The next important physical environment indicator is mixed-use areas. The less important variables to predict crime is the ratio of parks to residential areas of cities.

4.2 Predicting crime with the support vector regressor

To build a model with SVR, we used a linear kernel in this algorithm as a mathematical function. After that built model, our model needed some regularization to enhance the performance. In the SVR algorithm, the model can be tuned to the C-parameter in Python's Sklearn library. C-parameter tells the SVR optimization how much you want to avoid misclassifying each training example (Patel, 2017). In our model, the best parameter C was determined as 3 and the model was tuned according to this value. As a result, when we checked how our model tuned into the test data was working, performance improved. MSE produced by the model was 1.01 and R2 was 0.71.

As an advantage of using the SVR algorithm, it allows us to calculate the effect of each independent variable on the dependent variable (positive or negative) and to calculate this effect value. Positive values represent a positive relationship between dependent and independent variables, while negative values represent a negative relationship between these variables.

For our model, the effect of the variables of the physical environment on the crime rate of theft was defined, and these effect values were measured as coefficients. The results of the measurement coefficients for our model are given in Table 3.

TABLE 3. Measured coefficient values of physical environmental variables on burglary crime events.
İndependent variables Coefficient
Building height 6.52177451
Floor area ratio 2.86206532
Building density 1.87590613
Ratio of commercial to residential area −1.60230311
Street intersection density 1.48300128
Mixed-use 1.23465870
Security forces buildings −0.636670054
Military buildings 0.497082929
Hotel buildings −0.328899666
Street density −0.263450615
Public buildings 0.256535669
Sport facilities 0.210838250
Health buildings −0.205440678
Religious buildings 0.174368778
Gas station 0.109408105
School buildings 0.0289340759
Social facilities 0.0241812297
Parks 0.0222498595
Industrial buildings −0.0178987229
Ratio of park to residential area 0.00604665226

According to Table 3, it seems that the most important variables affect the crime rate is building height, which represents the number of building floors. This relationship is positive, it was concluded that as the building height increases, burglary crime rates will also increase. Our result supports Newman's theory. According to Newman, both height and size factors have effects on decreasing the crime rate, but building height is more important than project size (building density) (Newman, 1973). He explains this situation, as long as the building height remains low, one can still maintain high density (size) and not encounter higher crime rates (p. 28). From this explanation, it is understood that building height affects the crime rate more than the building density. As it is seen in Table 3, the coefficient value of the building height is 6.52177451, while the coefficient value of the building density is 1.87590613. Floor area ratio has the second highest coefficient value. The area ratio influences crime rates and shows positive relationships with burglary crime. This result supports work in this field (Moon et al., 2014).

The ratio of commercial area to residential area, which has the fourth highest coefficient among the variables with −1.60230311, has a negative relation to the crime rate. If there is an increase in the ratio of commercial area to residential area, the crime rate decreases. In addition, burglary crime rate was negatively related to the street density, the distance to the closest security force building, the distance to the nearest health buildings, and industrial and hoteliers, which are among the land use types. These negative correlations in the results of this study show the same result study of Sohn (2016b). His study revealed that the distance to the nearest police station and street density were significant predictors of residential crime density and had negative relationships.

The density of the intersection of streets and the mixture used are the values of the fifth and sixth coefficient values and they are positively related to the crime rate. When there are more street turning points, there is a higher probability of burglary crimes, because the street design has easy access and escape routes for the burglary offenders. Previous studies also confirmed that the street segment and street turning point also contribute to the crime (Sakip & Mustafa, 2019).

In the environmental criminology literature, there are different opinions about the effect of mixed use on crime events. Some researchers argue that mixed-use will increase the risk of crime (Browning et al., 2010), while others argue that mixed-use provides a safer environment (Bowers & Hirschfield, 1999; Wilcox et al., 2004). According to our result, mixed use, which is created by combining residential uses and commercial uses, has an increasing effect on burglary crime rates.

Among the independent variables used in the study, the variable with the lowest effect on burglary crime is the ratio of park to residential area. When looking at the distance between the closest land use types are examined, it is seen that their coefficient values are lower than the other variables.

The impact coefficients of environmental factors can give ideas about how cities should be designed. The results of this coefficient will help institutions plan within the framework of a safe city when planning cities in the future.

4.3 Visualizing prediction results

The input data values of the spatial factors affecting the crime events, prepared in the GIS environment, and the effect coefficients obtained as a result of the SVR algorithm were multiplied. Thus, a spatial crime prediction map in grid format based on artificial intelligence was obtained. Figure 5 shows the estimated number of burglary crimes per grid obtained as a result of ML crime prediction analysis.

Details are in the caption following the image
Result map of grid values to predict burglary crimes.

4.4 Prediction accuracy index

Prediction accuracy index (PAI) is defined as the percent of crime in the forecasted hot spots divided by the percent of the geographic area forecasted to be a hot spot (Chainey et al., 2008; Drawve & Wooditch, 2019). The PAI is calculated by dividing the hit rate percentage by the area percentage and the PAI equation used is given below (Chainey et al., 2008).
n N × 100 a A × 100 = Hit rate Area percentage = Prediction Accuracy Index ()
wherer n is number of crime in areas where crimes are predicted to occur; N is number of crimes in the study area; a is area of areas where crimes are predicted to occur, and A, area of the study area.

By reducing the area to be more representative of where crime could occur; this will likely change the location of the identified hot spots and subsequently, the PAI estimate (Drawve & Wooditch, 2019). The greater the number of future crime events in a hotspot area that is smaller in areal size to the whole study area, the higher the PAI value (Chainey et al., 2008).

One of the common techniques to identify crime hotspots is KDE. In this study, the PAI value was calculated using KDE technique and according to SVR algorithm estimations. Table 4 presents the average PAI value and standard deviation of the average PAI value for the burglary crime.

TABLE 4. Result value of prediction accuracy index for the burglary crime.
Datasets Average PAI Std. deviation of average PAI
Predicted values 9.57 4.41
Original values 8.63 3.85

5 CONCLUSIONS

Lack of capacity to determine to what extent each factor affecting crime formation in current machine learning-based crime prediction models affects crime may reduce confidence in the effectiveness of these models. This study addresses this gap, and in addition to apply traditional ML models in the crime prediction and prevention process, physical predictive variables of the environment were also examined to construct this predictive model of crime.

The data from downtown Trabzon in Turkey on crime over the past five years as used in two ML algorithms and the performance of this algorithm, SVR and RFR, was compared in crime prediction modeling for Trabzon crime data. The results indicated that the SVR model had the lowest performance, compared to the RFR model in the modeling process. However, there is no big performance difference between the two algorithms. Among these two regression algorithms, the SVR algorithm has a functional role in the study as it directly gives the crime effect coefficients. Since the accuracy of the SVR algorithm depends on the input parameter values, inappropriate parameters directly affect and reduce model performance. Therefore, input data quality and the appropriate data structure are very important for the SVR algorithm. On the other hand, the RFR algorithm can automatically process the missing values in the input parameters and is robust against outliers.

This study provided empirical evidence that certain characteristics of the physical environment can contribute to reducing burglary crime. Results show that the building height is the most important variable affecting the crime rate, followed by the ratio of floor area, building density, the ratio between commercial and residential areas, street design, and mixed use. Additionally, the burglary crime rate is negatively related to the density of the streets, the ratio of the commercial area to the residential area and the distance to the closest security force building, health buildings, industrial buildings, and hotel buildings.

The understanding gained as a result of the study includes the notion that burglary offenses with environmental designs are less attractive to attacks by potential criminals. In line with these ideas, it has provided information to relevant institutions on how urban planning and environmental design should be correct to combat burglary-related crime more effectively. Thus, ideas were obtained on the factors to which local governments should pay attention in urban planning for peaceful and safe cities.

The use of ML in predictive crime is important because ML algorithms can give quick and reliable results in the decision-making process. On the other hand, the application of ML for identifying environmental factors that correlate with crime helps to how factor is important or not for describing a crime. In addition, more effective results have been achieved by using the ML method, which is one of the new generation technologies, compared with classical statistical methods. Through the results of our study, law enforcement agencies or policymakers can create better strategies and accurate actions for reducing crime.

Results from model analyses can help law enforcement agencies produce data-driven policies and create targeted crime prevention strategies. On the other hand, it can contribute to creating a basic vision for local governments such as municipalities, which they should pay attention to in terms of security in urban planning. While this study attempted to examine all available data and data on predictive modeling of crime based on environmental factors, environmental factors can be increased in relation to crime occurrence, or considering some other important factors can be combined with environmental factors.

This study has some potential limitations, which can be explored in future studies. For example, the data set does not include sociodemographic and economic characteristics in the places where crime occurred. In future studies, which include these factors, can provide interesting insights into crime prediction. It would also be exciting to examine how the effects of factors vary with different crime types. In addition, the application of other advanced ML models may be investigated in forthcoming studies.

6 ACKNOWLEDGEMENTS

I would like to thank the Trabzon Police Department for their support in obtaining crime data.

7 CONFLICT OF INTEREST STATEMENT

The authors declare that there is no conflict of interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from Trabzon Provincial Police Department. Restrictions apply to the availability of these data, which were used under license for this study. Data are available from the author(s) with the permission of Trabzon Provincial Police Department.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.