ML for coursera capstone project
Predicting to open a Thai restaurant in the town lands
in Dublin county of Ireland.
Mahesh Matele
26th June 2020
1.Introduction
1.1
Background
Dublin is a city of Ireland with the maximum
population. Below is a list of town lands which gives a general idea about the respective
town lands in Ireland spread out across the Dublin county.
These town lands have a lot of entertainment places
which generate an immense amount f business based on its success.
One of the major leisure activities for the people of
Dublin is visiting an eatery. Out of this exotic dishes are very much in
demand.
As part of this research we are concentrating on the
below town lands of Dublin county as mentioned in the above link namely :
Athgoe,Ballsbridge,Ballygall,Beggars
Bush,Clonskeagh,Merrion Gates,
Poppintree, Priorswood, Ringsend, Roebuck, Sandymount.
The subject to open a Thai restaurant was taken on
the basis of, after looking at the different kinds of venues available in Dublin
town lands. We can experiment the same solution with any other venue category.
For this analysis stint we are concentrating on Thai restaurants.
1.2
Problem :
To open a restaurant we should be having a good idea
about the categories of the venues available for food in these town lands.
Opening a Thai restaurant in a populous place where
there are already other Thai restaurant is only going to divide the profit and
increase the competition, consequently there will be an addition of monetary
investment required to up the competition.
Therefore, we will have to find an appropriate place
for opening a Thai restaurant.
1.3
Interest :
The research will be of interest to stakeholders who
would like to open any kind of venue in a given category of restaurants.
2.
Data acquisition and cleaning
2.1
Data sources
·
Wikimedia
to get the base of town lands used for analysis
Below is the data source
used for fetching the information about the town lands of Dublin county.
·
Foursquare API used
to :
o fetch the venues
o explore the areas to fetch the venues in the
surrounding areas
o get the list of Thai restaurants in the Dublin county
town lands
·
Geopy
- For getting the co-ordinated of different locations.
2.2
Data cleaning
Data is downloaded or scraped from multiple sources including
:
Geopy ,Foursquare API ,wikimedia.
As the data was scrapped from an API the data which
was received was in a json format and not a simple flat file data like a csv
file.
This data is updated in a while hence the data
received may differ from time and every run of code may generate different
results for prediction.
There were some issues with the data that was fetched
from wikimedia. Some of the town lands that were received had the county name
associated with it, whereas the others did not have it. The pictorial
representation of the same is as below.
This was cleansed to get it back into a standard
format where the county name was clipped as below :
After the data is sourced from the above sources it
is enriched every step with co ordinates and other attributes regarding its
venues.
It is checked for data quality using the null check
and the same is replaced with mean of the columns, so that there no major
standard deviation.
The null values are removed by replacing them with
the mean of that column for reducing the standard deviation.
From the many ways that the data is sliced for
verification , all the categories are replaced with onehot functionality to
convert them into numbers as below :
3.1
Feature selection
After the data cleansing we found that there were 370
samples and limited to 35 attributes.
The surrounding venues were limited to just 30
categories.
The samples thus generated were limited the town
lands in Dublin county and the categories of venues.
As we are clear with the objective of this exercise
we chose the townhall where we can have a business of Thai restaurant as the
target variable.
Below is the histogram of all the numerical attributes
, once the data was available for processing. There was no possible
linear/polynomial relationship between the attributes.
The data which was made available to us from the web
scraping is used to create a word cloud which depicts the top venue categories
in the town lands as below :
As seen, clearly from the above word cloud that the
Cafe, restaurants and Pub lead the numbers from the density point of view and
the Thai venues underlined in green has a very small font thus signifying its
lesser presence around the town lands.
3.2
Clustering similar town lands and plotting them
Below is the depiction of all the 10 town lands of Dublin
mapped on the folium map.
Based on the density of a Thai restaurant in the 30
surrounding venues of a town land the mean value is calculated and again mapped
on the folium map to check its intensity factor in the area, as below :
Once all the town lands are mapped to their
respective locations the locations are clustered based on their similarity
using K Means clustering algorithm which is a unsupervised machine learning
algorithm.
Here first of all the a random K of 4 is taken to
cluster the similar town lands as below :
From above clustering it is seen that cluster-1 (in
red color rings) has no existing Thai Restaurants and we expect the prediction
be a subset of this list of venues.
With the highest ranking of Thai restaurants present
is cluster-4 with yellow rings
With moderate ranking of are Thai restaurants is
cluster-2 and cluster-3 with green and blue rings located in the not so outer
skirts of Dublin county.
4.
Predictive Modeling
Predictive modeling is done using K-nearest neighbors
(KNN) algorithm which is a type of supervised ML algorithm which can be used
for both classification as well as regression predictive problems. However, it
is mainly used for classification predictive problems in industry.
K-nearest neighbors (KNN) algorithm uses ‘feature
similarity’ to predict the values of new data points which further means that
the new data point will be assigned a value based on how closely it matches the
points in the training set.
In our example we have data points which are the 30
surrounding venues of the town lands. Based on these data points the similarity
of the town lands will be achieved by
the algorithm.
We have determined the K value by using the elbow
graphical representation as below :
So, following the K as 8 I created the training and test
data.
As all the values should be numerical the categories
were assigned a numerical value as shown below :
Ultimately, the prediction is realized for every town land on which business is
beneficial for the particular town land , including the Thai restaurant ,as
shown below :
Below is the prediction of Thai restaurant mapped on
the folium map :
5
Conclusion
From above analysis we can infer that cluster-1 (in
red color rings) has no existing Thai Restaurants and we expect the prediction
be a subset of this list of venues.
With the highest ranking of Thai restaurants present
is cluster-4 with yellow rings.
With moderate ranking of are Thai restaurants is cluster-2 and cluster-3 with
green and blue rings located in the not so outer skirts of Dublin county.
This analysis presents a great opportunity to
entrepreneurs to tap into the unutilized potential of the pockets of the Dublin
county by opening Thai restaurants.
It is also evident that cluster-3 has a very high
competition when it comes to opening a Thai restaurant, hence investment in this area
should be avoided by investors.
Investors with unique selling propositions that can
stand out from the moderate competition in cluste-3 and cluster-2 and can take
moderate risk and attract the customers already visiting the locality of this
cluster because of the existing Thai restaurant.
Comments
Post a Comment