ML for coursera capstone project


Predicting to open a Thai restaurant in the town lands in Dublin county of Ireland.

Mahesh Matele

26th June 2020

1.Introduction

1.1 Background

Dublin is a city of Ireland with the maximum population. Below is a list of town lands which gives a general idea about the respective town lands in Ireland spread out across the Dublin county.
These town lands have a lot of entertainment places which generate an immense amount f business based on its success.
One of the major leisure activities for the people of Dublin is visiting an eatery. Out of this exotic dishes are very much in demand.
As part of this research we are concentrating on the below town lands of Dublin county as mentioned in the above link namely :
Athgoe,Ballsbridge,Ballygall,Beggars Bush,Clonskeagh,Merrion  Gates, Poppintree, Priorswood, Ringsend, Roebuck, Sandymount.

The subject to open a Thai restaurant was taken on the basis of, after looking at the different kinds of venues available in Dublin town lands. We can experiment the same solution with any other venue category. For this analysis stint we are concentrating on Thai restaurants.

1.2 Problem :
To open a restaurant we should be having a good idea about the categories of the venues available for food in these town lands.
Opening a Thai restaurant in a populous place where there are already other Thai restaurant is only going to divide the profit and increase the competition, consequently there will be an addition of monetary investment required to up the competition.
Therefore, we will have to find an appropriate place for opening a Thai restaurant.

1.3 Interest :
The research will be of interest to stakeholders who would like to open any kind of venue in a given category of restaurants.

2. Data acquisition and cleaning

2.1 Data sources

·       Wikimedia to get the base of town lands used for analysis
Below is the data source used for fetching the information about the town lands of Dublin county.

·       Foursquare API used to :
o   fetch the venues
o   explore the areas to fetch the venues in the surrounding areas
o   get the list of Thai restaurants in the Dublin county town lands
·       Geopy - For getting the co-ordinated of different locations.

2.2 Data cleaning

Data is downloaded or scraped from multiple sources including :
Geopy ,Foursquare API ,wikimedia.

As the data was scrapped from an API the data which was received was in a json format and not a simple flat file data like a csv file.
This data is updated in a while hence the data received may differ from time and every run of code may generate different results for prediction.

There were some issues with the data that was fetched from wikimedia. Some of the town lands that were received had the county name associated with it, whereas the others did not have it. The pictorial representation of the same is as below.

This was cleansed to get it back into a standard format where the county name was clipped as below :



After the data is sourced from the above sources it is enriched every step with co ordinates and other attributes regarding its venues.
 
It is checked for data quality using the null check and the same is replaced with mean of the columns, so that there no major standard deviation.

 

The null values are removed by replacing them with the mean of that column for reducing the standard deviation.
 
From the many ways that the data is sliced for verification , all the categories are replaced with onehot functionality to convert them into numbers as below :
 

3.1 Feature selection

After the data cleansing we found that there were 370 samples and limited to 35 attributes.
The surrounding venues were limited to just 30 categories.
The samples thus generated were limited the town lands in Dublin county and the categories of venues.
As we are clear with the objective of this exercise we chose the townhall where we can have a business of Thai restaurant as the target variable.

Below is the histogram of all the numerical attributes , once the data was available for processing. There was no possible linear/polynomial relationship between the attributes.
 
The data which was made available to us from the web scraping is used to create a word cloud which depicts the top venue categories in the town lands as below :
 

As seen, clearly from the above word cloud that the Cafe, restaurants and Pub lead the numbers from the density point of view and the Thai venues underlined in green has a very small font thus signifying its lesser presence around the town lands.

3.2 Clustering similar town lands and plotting them

Below is the depiction of all the 10 town lands of Dublin mapped on the folium map.



Based on the density of a Thai restaurant in the 30 surrounding venues of a town land the mean value is calculated and again mapped on the folium map to check its intensity factor in the area, as below :

 




Once all the town lands are mapped to their respective locations the locations are clustered based on their similarity using K Means clustering algorithm which is a unsupervised machine learning algorithm.

Here first of all the a random K of 4 is taken to cluster the similar town lands as below :




From above clustering it is seen that cluster-1 (in red color rings) has no existing Thai Restaurants and we expect the prediction be a subset of this list of venues.

With the highest ranking of Thai restaurants present is cluster-4 with yellow rings

With moderate ranking of are Thai restaurants is cluster-2 and cluster-3 with green and blue rings located in the not so outer skirts of Dublin county.


4. Predictive Modeling

Predictive modeling is done using K-nearest neighbors (KNN) algorithm which is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry.
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data points which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

In our example we have data points which are the 30 surrounding venues of the town lands. Based on these data points the similarity of the   town lands will be achieved by the algorithm.

We have determined the K value by using the elbow graphical representation as below :

So, following the K as 8 I created the training and test data.
As all the values should be numerical the categories were assigned a numerical value as shown below :


Ultimately, the prediction is realized  for every town land on which business is beneficial for the particular town land , including the Thai restaurant ,as shown below :




Below is the prediction of Thai restaurant mapped on the folium map :


5 Conclusion

From above analysis we can infer that cluster-1 (in red color rings) has no existing Thai Restaurants and we expect the prediction be a subset of this list of venues.

With the highest ranking of Thai restaurants present is cluster-4 with yellow rings.

With moderate ranking of are Thai  restaurants is cluster-2 and cluster-3 with green and blue rings located in the not so outer skirts of Dublin county.

This analysis presents a great opportunity to entrepreneurs to tap into the unutilized potential of the pockets of the Dublin county by opening Thai  restaurants.

It is also evident that cluster-3 has a very high competition when it comes to opening a Thai  restaurant, hence investment in this area should be avoided by investors.

Investors with unique selling propositions that can stand out from the moderate competition in cluste-3 and cluster-2 and can take moderate risk and attract the customers already visiting the locality of this cluster because of the existing Thai  restaurant.


Comments

Popular posts from this blog

Predicting the crime type using ML for chicago crime records