Predicting the crime type using ML for chicago crime records

Predicting the crime type using ML


Mahesh Matele



1.Introduction


1.1 Background


Crime in Chicago has been tracked by the Chicago Police Department's Bureau of Records since the beginning of the 20th century. The city's overall crime rate, especially the violent crime rate, is higher than the US average. Chicago was responsible for nearly half of 2016's increase in homicides in the US, though the nation's crime rates remain near historic lows.
The reasons for the higher numbers in Chicago remain unclear. An article in The Atlantic detailed how researchers and analysts had come to no real consensus on the cause for the violence.
As part of our research we are going to aid the exploration of finding out the speculate the reasons for the increase in crimes in Chicago.

For more information, please follow this link.

1.2 Problem:

Crime reporting accuracy

In 2014 and 2015, Chicago Magazine and The Economist conducted investigations into the CompStat data reporting of crime statistics for the city and reported irregularities. In addition, an audit conducted by Chicago's Office of the Inspector General found significant problems in the accuracy of CPD's crime data.
According to Chicago Magazine, superiors often pressure officers to under-report crime. An unnamed police source quoted in the magazine says there are "a million tiny ways to do it," such as misclassifying and downgrading offenses, counting multiple incidents as single events, and discouraging residents from reporting crime. The police department has responded that their statistics are generally accurate and that the discrepancies can be explained by differences in the Uniform Crime Reporting used by the FBI and CompStat.

There has to be a narrow down to be done and crime has to be annihilated from those smaller regions of Chicago which are having more crimes.

From our report we will figure out he regions which are contributing to the crime rate consistently and ultimately we will predict a crime based on the dataset in particular ward.

1.3 Interest:
The research will be of interest to stakeholders who would like to open any kind of investigation in a given category of crimes.

2. Data acquisition and cleaning

2.1 Data sources

·       Features for the dataset were taken from the below link of city of Chicago.

·       This has 4 comma separate values, flat files, namely:
o   Chicago_Crimes_2001_to_2004.csv
contains approx. 2 million data samples
o   Chicago_Crimes_2005_to_2007.csv
    contains approx. 2 million records
o   Chicago_Crimes_2008_to_2011.csv
            contains approx. 2.5 million records
o   Chicago_Crimes_2012_to_2017.csv
       contains approx. 1.5 million records

·       Geopy - For getting the co-ordinates of different locations.

2.2 Data cleaning

Data is downloaded from sources including:
·       Geopy
·       city of Chicago crime official site.

The feature [Primary Type] which is the feature that represents the type of crime does not contain any nulls.

Below is a representation of nulls present in the features.




Below is the seaborn heatmap representation of the nulls found in the above table.

This is very resource-intensive and I will not execute this for a larger amount of records.





So, as long as the research is based on [Primary Type] feature, the null columns were not removed as they will also remove the records with some potential information of crime types.

Further, when the research on feature [Primary Type] was completed nulls were removed from the dataset, so as to give an accurate picture of the existing data samples.




3.1 Feature selection

The Chicago crime analysis contains 23 features as below:





4.1 Exploratory Analysis

With the limited processing power which was available on premises I plotted a word cloud for the maximum occurring crime in Chicago, which looks as below:




You can clearly see THEFT, BATTERY, CRIMINAL DAMAGE and HOMICIDE which a major crime ranking 4th; are some of the few top-ranking crimes in Chicago.

Same could be verified using a horizontal bar plot as below:





From the above representation we could derive the top crimes, further, I wanted to explore if there were any arrests that were made for these crimes.
Accordingly, below is the horizontal bar graph representation of the crimes and arrest made for them:






So, here you can see even though THEFT contributes as the majority of the crimes, still the arrests that are being made for it are only just nearly 1/10 or 7.74% to be precise.

Which can be a concern as the thieves who are committing crimes, only 7.74% of them are arrested.


On the other hand, we can see NARCOTICS has a phenomenal performance.

The ratio of total narcotics crimes to the total arrests made for them is nearly 100% or 99.40%.

HOMICIDE is another serious offense and the arrest percentage for it is just 63.01%, which again a matter of serious concern for us.

For further analysis, we will concentrate on HOMICIDE as it’s a serious offense.

So, to start off with, where were these HOMICIDE crimes committed?

Below is a horizontal representation of the HOMICIDE crimes and where they were committed.

Besides that, it also shows, the numbers of arrest that were made for crimes made in a specific location.

 


So, we can clearly see that the maximum number of homicides happened in the STREET and the least happened in the STAIRWELL where the arrest percentage is 100%.

The other information that the horizontal bar graph gives is the arrests made for homicides committed at these locations.


STREET location which has the highest HOMICIDE number and has arrest percentage of 64% only, whereas the homicides committed in the APARTMENT has an 80.72 % arrest percentage.

HOUSE location shows a 90% arrest count.

So, we can see the probability of an arrest made is more if the homicide committed is indoors or in an enclosed premise.

Some examples are in the below screenshot:



For our further analysis on crime HOMICIDE we will consider the top crime locations where the arrests made are the lowest which are STREET and AUTO and arrests percentage here is 64% and 48% respectively.



What is the location coordinates of these streets and the autos where the homicide was committed?



Below is the map representation of the location of these crimes.
The green circles are locations where the crime was committed and an arrest was also made, on the other hands, red circles are locations where the crime was committed and no arrest was made.









The above representation is very interactive and you can click on

the circles to get the location information.
For presentation purposes, I have clicked on one of the green circles which represents that the arrest was made for a homicide committed there.
The address here is:

041XX N LECLAIRE AVE, HARWOOD HEIGHTS, COOK COUNTY

But, the above representation has information that can be a bit overwhelming, hence I have clubbed together the crimes based on the location, which shows a better picture.





So, from the above it is observed that the below are the top 5 locations where the HOMICIDES were committed and arrests were made too:
Cicero
    Whiting
    Calumet Park
    Oak Park
    Chicago

Also, the above ranking shows that the number of arrests made were more than the number of arrests not made for the total number of homicides committed.

So, we can say that when it comes to homicides, the Cicero police investigation squad is doing a good job, whereas others need to match up to that bar.

Now let us investigate what time of the day were all these homicides committed?

So, for simple understanding we have grouped the times of the day into different sections as below as per the 24-hour clock:

If the time is in 1,2,3,4 of the day then its 'Deep Night'
If the time is in 5,6,7 of the day then its 'Early morning'
If the time is in 8,9,10,11,12 of the day then its 'Morning'
If the time is in 13,14,15,16 of the day then it is 'After Noon'
If the time is in 17,18,19 of the day then its 'Evening'
If the time is in 20,21,22,23,24 of the day then its 'Night'

Below is a bar chart representation of the crimes happening at different times of the day.




As can be seen, most of the homicides have happened in the ‘Deep Night’, i.e. between 1 and 4 AM.

Surprisingly the 2nd rank for homicide crimes committed is morning (between 8,9,10,11,12) and not night (between 20,21,22,23,24) which is 3rd.

So, what is the percentage of homicides committed in the morning hours in the street?



18.6 % of the homicides committed, happen in the broad day light between 8 -12 AM in the morning in the streets.
Of these 18.6% only 9.9 % of the crimes show arrests which is approximately 50%, so the rest of 50% are still on the streets.

So, previously we saw that most of the HOMICIDES committed are on the street and autos, and now the
second rank goes to those crimes committed in the morning.

So, many homicide crimes are committed in the morning broad daylight!


So, the question arises which is the safest hour of the day in Chicago when the probability of any crime happening is the least?



As seen above 6 AM at dawn time is the safest hour of the day when the probability of any crime happening is the least. Which gives us a behavioral hint that thieves are not early risers.

Now, looking at the general crimes, if we select the top 6 crimes, based on the numbers, across all the crimes committed in Chicago we can see the below day chart of it happening.



The above chart shows that; THEFT is the most popular crime followed by CRIMINAL DAMAGE.
Both of them show a big spike in cases between 10 AM and 2 PM.
THEFT also shows a spike at around 9 AM.

Is it the same gang active between 10 AM and 2 PM and committing crimes, or the number of gullible people out at that time are more?



The above chart shows the crimes committed as per the locations.
Most of the crimes committed are in RESIDENCES followed by APARTMENT and then STREET.

Again, the spike is seen in the morning and afternoon.
Somehow, in the afternoon as the crimes committed decrease in the RESIDENCE they increase in the STREET, that’s an inverse relationship and maybe it’s the same gang who were committing crimes in the RESIDENCES and then moved to the STREETS.

Further on, we will see which week of the month sees more crimes.



As, observed above week 1 from all the months of the year sees more crimes.

Maybe because people get their salaries and have more money at the beginning of the month.

Subsequently, there is a good probability that the crimes committed should be more in the first 7 days which is the first week of the month.


This can be shown in the above chart. From 1st till the 5th day of the month we see that the crime rate is the highest.

From a year’s perspective, Chicago doesn’t seem to have a happy new year, as the number of crimes are on the extreme rise in the first month – January, of the year, as seen below:

If seen across the multiple years from 2001 – 2004,
year had a very high rise which continued into the next year of 2002 and then steadily declined in 2003 and then had a very low rise in the subsequent year of 2004.

Let us observe the same statistics but with the bifurcation of crimes which showed arrests against the crimes which did not show arrest.

Week-wise
Week wise as we saw earlier too the first week has the most number of crimes, but the arrests that are made for those crimes do not match up the crimes, affecting the safety index of the city.


In a week, Monday seemed to have started on a high rate of crimes and gradually decreasing to end the week on a low.


The arrests for these crimes show a huge gap with just ¼th or 25% of the crimes having arrests for Monday.
Somehow, this doesn’t give a good picture of the crime control squad.

From, the daily arrests perspective, as we saw earlier the first few days of the month show a high number of crimes. The gap between the crimes and the arrest also seems to be declining as days of the pass.

But, now where we can see the number of arrest showing a positive rise which is a matter of concern.


A similar picture is seen when this is seen across months of the year.

Again, when this ratio of crime to arrests is seen across 4 years from 2001 – 2004 it just seems to be matching with the crimes committed, whereas it should be independent or lesser than the number of crimes committed.

Maybe it’s just my observation, but the crime control squad seems only to have worked till a threshold number of 25% arrests.

5. Predictive Modeling

As part of predictive modeling, we will be predicting the [PRIMARY TYPE] or the type of crime that can happen.

5.1. Prepare Data

5.1.1 Data Cleaning

As part of data cleaning we have removed any records which are null to have a clean prediction.

Below are columns with the number of nulls:
There are different methods with you can repopulate the nulls, but as these were places and co-ordinates we could not repopulate them with means, as that would have given wrong predictions in the test data.

5.1.2 Data Transforms


As part of the data transformations we have converted all the string values to numerical values by using a dictionary collection to store a number equivalent to all the corresponding unique strings.

After getting the numerical equivalents of the strings we replace them in the data frame.

So, now the above values can be replaced back to original strings once the prediction is done.

For binary value columns like true or false we simply converted them to integer datatype.


At the end, I removed the features which wouldn’t have helped with the prediction and was left with the below:

5.2 Feature Selection

From the feature selection point of view, first I check the correlational of the features.
Secondly, I used the SelectKBest algorithm to rank the features.
5.3 Evaluate Algorithms

As this is a classification problem, we used the classification algorithms to find out the best prediction solution:

Below is a box plot representation of the above algorithms:

As it can be seen CART is performing very well with almost 99% accuracy.


But could any other algorithm act perform better or at par with CART?

To find out we did standard scaler to all the fields and then evaluated the algorithms again and below are the results.


We could see an improvement in accuracy for:
Logistic Regression
Linear Discriminant Analysis
·       KNeighbors Classifier
·       Naïve Bayes
    Support Vector Classifier
    But still the best performer was CART with 99% accuracy:

5.4 Finalize Model

5.4.1 Predictions on the validation dataset

As we are using CART we have to find out the best max depth for the decision tree using entropy criterion.

So, I checked the best accuracy for different max_depths variable.
Below is the line plot for the same.


As can be seen, the best max_depths can be found at 6 and further.

So, using the max_depth variable as 6 and criterion as entropy, I have trained the model.

After testing the model at the max_depth variable as 6 and criterion as entropy we got an accuracy score of 98.42%.

Below is a confusion matrix representation of the same:

6 Conclusion
So from the above analysis, we were able to conclude on the below:
  • THEFT, BATTERY, CRIMINAL DAMAGE and HOMICIDE which a major crime ranking 4th; are some of the few top-ranking crimes in Chicago.
  • THEFT contributes as the majority of the crimes, still the arrests that are being made for it are only just nearly 1/10th or 7.74% to be precise. Which can be a concern as the thieves who are committing crimes, only 7.74% of them are arrested.
  • On the other hand, we can see NARCOTICS has a phenomenal performance. The ratio of total narcotics crimes to the total arrests made for them is nearly 100% or 99.40%.
  • HOMICIDE is another serious offense and the arrest percentage for it is just 63.01%, which again a matter of serious concern for us.
  • Maximum number of homicides happened in the STREET and the least happened in the STAIRWELL where the arrest percentage is 100%.
  • STREET location which has the highest HOMICIDE number and has arrest percentage of 64% only, whereas the homicides committed in the APARTMENT has an 80.72 % arrest percentage.
  • HOUSE location shows a 90% arrest count.
  • So, we can see the probability of an arrest made is more if the homicide committed is indoors or in an enclosed premise.
  • Considering HOMICIDE, we checked the top crime locations where the arrests made are the lowest which are STREET and AUTO and arrests percentage here is 64% and 48% respectively.
  • Top 5 locations where the HOMICIDES were committed and arrests were made too:
1.     Cicero
2.     Whiting
3.     Calumet Park
4.     Oak Park
5.     Chicago
So, we can say that when it comes to homicides, the Cicero        police investigation squad is doing a good job, whereas others  need to match up to that bar.
  • Most of the homicides have happened in the ‘Deep Night’, i.e. between 1 and 4 AM.
  • Surprisingly, the 2nd rank for homicide crimes committed is morning (between 8,9,10,11,12) and not night (between 20,21,22,23,24) which is 3rd.
  • 18.6 % of the homicides committed, happen in the broad day light between 8 -12 AM in the morning in the streets. Of these 18.6% only 9.9 % of the crimes show arrests which is approximately 50%, so the rest of 50% are still on the streets.
  • Many homicide crimes are committed in the morning broad daylight!
  • 6 AM at dawn time is the safest hour of the day when the probability of any crime happening is the least. Which gives us a behavioral hint that thieves are not early risers.
  • THEFT is the most popular crime followed by CRIMINAL DAMAGE.
  • Both of them show a big spike in cases between 10 AM and 2 PM.
  • THEFT also shows a spike at around 9 AM.
  • Is it the same gang active between 10 AM and 2 PM and committing crimes, or the number of gullible people out at that time are more?
  • Most of the crimes committed are in RESIDENCES followed by APARTMENT and then STREET.
  • Somehow, in the afternoon, as the crimes committed decrease in the RESIDENCE they increase in the STREET, that’s an inverse relationship and maybe it’s the same gang who were committing crimes in the RESIDENCES and then moved to STREETS.
  • Week 1 from all the months of the year sees more crimes. Maybe because people get their salaries and have more money at the beginning of the month.
  • From 1st till 5th day of the month we see that the crime rate is the highest.
  • From a year’s perspective, Chicago doesn’t seem to have a happy new year, as the number of crimes are on the extreme rise in the first month – January, of the year.
  • If seen across the multiple years from 2001 – 2004, 2001 year had a very high rise which continued to increase into the next year of 2002 and then steadily declined in 2003 and then had a very low rise in the subsequent year of 2004.
  • Week wise, the first week has the most number of crimes, but the arrests that are made for those crimes do not match up the crimes, affecting the safety index of the city.
  • In a week, Monday seemed to have started on a high rate of crimes and gradually decreasing to end the week on a low.
  • The arrests for these crimes show a huge gap with just ¼th or 25% of the crimes having arrests for Monday.
  • Somehow, this doesn’t give a good picture of the crime control squad.
  • From, the daily arrests perspective, as we saw earlier first few days of the month show high number of crimes. The gap between the crimes and the arrest also, seem to be declining as days’ pass.
  • When this ratio of crime to arrests is seen across 4 years from 2001 – 2004 the gap of arrest and no arrests for the crime seems to be matching with the total crimes committed, whereas it should be independent or lesser than the number of the crimes committed.
  • Maybe it’s just my observation, but the crime control squad seems only to have worked till a threshold number of 25% arrests.

Comments

Popular posts from this blog

ML for coursera capstone project