What is around each of top 100 world-class universities?
Introduction
This is the introductory section of the final assignment of IBM Applied Data Science Capstone course on Coursera. Here, the students are asked to be as creative as they want to implement Foursquare API and clustering algorithm to explore or compare neighborhoods or cities of their choice. Based on that instructions, I decided to explore and cluster what is nearby each of top 100 world-class universities.
In addition to this blog post, you can also read the report version here. And you can find the full source code of this article in my GitHub repo here.
Background
Quacquarelli Symonds (QS) is the world’s leading provider of services, analytics, and insight to the global higher education sector, whose mission is to enable motivated people anywhere in the world to fulfil their potential through educational achievement, international mobility, and career development[1]. It annually releases world university rankings. This rankings are based on a selective, transparent, and comprehensive methodology. Therefore, they always become a reference for the higher education world. Even many prospective students in the world use its rankings to choose their university of destination for undergraduate and postgraduate study.
Problem Description
However, since QS’s methodology is based on the education-centric aspects only, sometimes it is not easy for the prospective students to decide which higher education institution has proper neighborhoods for their college life later at the university. Although QS pusblishes the ranking of the best student cities which sorts the best cities to live by students, not the universities themselves, it still does not remove the fact that QS has no any recommendation regarding the neighborhoods of, for example, top 100 world-best universities. Hence, in this article, we’re going to explore the neighborhoods of those top 100 and cluster them in order to gain insight which institutions have similar neighborhoods.
Aim and Objectives
This article aims to explore and cluster the neighborhoods around each of top 100 higher education instituions in the world based on their venue categories. Meanwhile, to meet the aim, the following are this article’s objectives:
- To collect the list of world top 100 higher educations from QS website
- To collect the geospatial coordinates of each higher education institution
- To collect venues data in each higher education institution
- To explore venues data in each higher education institution
- To visualize neighboorhoods in each higher education institution
- To cluster neighborhoods in each higher education institution
- To analyze and discuss the generated clusters
Target Audience
We can expect at least below audiences could benefit from this article:
- Those who will continue their higher education or would like to work at a higher education institution but are still confused to choose
- Those who would like know more about world-top 100 universities
- Those who would like to know the neighborhood similarity among universities
- Those who would like to gain another perspective from universities
- etc.
Structure
This article is arranged as follows:
- Introduction
- Data Sources
- Methodology
- Data Preparation
- Exploratory Data Analysis
- Data Preprocessing
- Modelling
- Analysis and Discussions
- Conclusions
Data Sources
In order to satisfy aim and objectives, we need at least the following data:
- The names of each higher education institution
We can collect these names from QS website.
- The city and country names where each higher education institution based in
We can collect these names from QS website as well.
- The geo location of each higher education institution
We can utilize Geocoder library to get latitude and longitude of each institute.
- The venues of each higher education institution
We can use Foursquare API to collect venues names. We will be using venues data from each higher education institution to cluster them. Afterwards, we can know which neighborhood of institute shares similarity with a neighborhood in another institute.
Methodology
In this section, we’re going to define the methodology that will be used in this project. In order to run the project, at least we need the following libraries:
- pandas
- numpy
- html5lib
- re
- requests
- json
- bs4
- selenium
- matplotlib
- folium
- geocoder
Fig. 3 below shows the workflow of this project.
- “Collect all needed” is a step to gather all data needed for this project. This step is explained further in Data Preparation section.
- “Data cleaning” is a step to process the data so that any undesired data, e.g. missing values and unwanted structure, can be minimized. This step is explained further in Data Cleaning section.
- “Exploratory data analysis” is a step to explore and understand more the characteristics of the data we’ve collected. This step is explained further in Exploratory Data Analysis section.
- “Data Preprocessing” is a step to transform data so that they can fit the model more appropriate. This step is explained further in Data Preprocessing section.
- “Clustering” is a step where we can implement the K-Means clustering algorithm. This algorithm, despite being simplistic, has powerful performance and is popular. The
k
parameter we will use for such algorithm is 7. This step is explained further in Modelling section. - “Analysis” is a step to analysis and discuss what we’ve done until modelling. This step is explained further in Analysis and Discussions.
Data Preparation
As mentioned in the previous section, we have four types of data to be acquired. In this section, we’re going to describe those four.
- The names of each higher education institution
We could use BeautifulSoup and Selenium to scrape the data of the names of each higher education institution. The target of URL is the QS web page below.
From the URL above, we need to scrape the table of QS World University Rankings® 2020 as shown in Fig. 3.
- The city and country names where each higher education institution based in
Like the names of each institution, we could use BeautifulSoup and Selenium as well for scraping their city and country. Furthermore, since the names of institutions, city, and country are on the same web page, we could scrape them simultaneously. As seen in Fig. 3, we can click the name of each institution. When clicked, we will be directed to a new web page as shown in Fig. 4. The name of institution is marked in the blue box, whereas the address of institution is in the green box. From such address, we can derive the city and country of the institution.
Then, as you could imagine, there are 100 clickable objects we have to click and scrape. As we don’t desire to click them all manually, we could operate Selenium to click 100 web pages one by one and scrape them using BeautifulSoup. After we’ve implemented all steps, we could get the data something like the Fig. 5 below.
- Geospatial coordinates
After we’ve acquired the data from QS website, the next step is to gain the geospatial coordinates of each institution. We could use Geocoder API to get latitude and longitude. Some documentations of such API can be found below.
Then, from Python Geocoder library, we can use ArcGIS as the provider of the data. ArcGIS is powerful to provide geospatial coordinates. We could use the Address
column of the data table (as in Fig. 5) as the argument for the geocoder.arcgis()
method, and then we pass the .latlng()
method to get the coordinates of each institution. After we’ve got the coordinates, our data would look like Fig. 6 below.
- Venues of each institution
After the coordinates acquired, we could use such coordinates to obtain the data of venues nearby of each institution. In this case, we could apply Foursquare API. In order to do so, we need several parameters, i.e. client_ID
, client_secret
, limit
, version
, and radius
. client_ID
and client_secret
can be obtained after we’ve created a Foursquare developer account. Both are our credentials to request to Foursquare API. Meanwhile, limit
, version
, and radius
are our other parameters to set the data we’d like to receive.
limit
is the maximum number of venues we’d like to receive. version
indicates the latest data we’d like to get. radius
, as its name, is the radius from the center of coordinates. All venues inside such radius will be collected. Then, using them all, we define an URL as shown below.
CLIENT_ID = # your Foursquare ID
CLIENT_SECRET = # your Foursquare Secret
LIMIT = 100 # limit only 100 venues
VERSION = ‘20191201’ # Foursquare API version
radius = 15000 # 15 km radiusurl = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}' \
.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
We will use the URL to get the data, then we will receive JSON-formatted results. In such JSON data, we’re only interested to extract venue’s name, venue’s latitude and longitude, and venue’s category. After we’ve got the venues for each institute, our table data would look like Fig. 7 below.
Exploratory Data Analysis
So, we have gathered all data we need. Before we move on to the next section, we should explore the data and see what they look like. In this section, we’re going to perform exploratory data analysis (EDA). It is split into two subsections, i.e. chart-based exploration and map-based exploration.
Chart-based exploration
After we’ve got the venue categories of each institution, now we have 9652 venues in total, with 441 unique venue catetgories. Based on these numbers, we can plot the data. However, since the data are too big to plot in regular charts, e.g. bar chart and scatter plot, we can utilize word clouds to plot them, considering the data we’d like to plot are in string format. Then, Fig. 8 below shows what we’ve got for the venues categories.
The venues, like Park, Coffee Shop, Hotel, Cafe, etc., dominate the word clouds. Implicitly, we can understand that most higher education institutions are sorrounded the most by those types of venues.
To make the word clouds more interesting, we can use word clouds mask. We’re going to use the world map as the mask. To do so, first we can download a picture from here. Next, using the picture as the mask, we will have our previous word clouds changed as shown in Fig. 9 below.
As mentioned earlier that we have a huge number of data so that it is difficult to plot the data, we need to group our data so that they can be more reachable to plot. The best candidate of column to become the key for grouping is the Country
column.
After grouped by country, we’ve found 20 unique countries. Now, we can create a bubble plot. We’re going to plot three dimensions using bubble plots, i.e. the number of institutions in each country, country, and the number of venues in each country. Fig. 10 below shows our bubble plots for such data.
We can see that the US dominates both the numbers of institutions and venues, followed by the UK in the second position. Though, such numbers in most countries are in between 1 to 5 institutions and hundreds of venues.
Map-based exploration
As we’ve gained geospatial coordinates, we can use Folium library to show the map. Fig. 11 below show the map of all institutions.
Since we also have the country data, we can create choropleth map as well using Folium. First of all, we group the data by countries. Then, to create choropleth map, we need boundaries of each country. Fortunately, such boundaries have been provided here. Fig. 12 show the choropleth map of all institutions. Its legend clearly indicates the density of institutions of each country. The countries in black color exhibit missing values as we have no data for those countries.
Data Preprocessing
In the next section, i.e. Modelling, we will cluster all institutions based on their venues. However, as the venues in our table (as seen in Fig. 7) are not in numerical value (except the coordinates) and the machine learning algorithm only receives numerical data, we have to preprocess our data in order to satisfy such requirement. We’re going to use one-hot encoding technique to solve this.
We’re going to transform all venue categories into 0 and 1 forms. Since we have a huge number of unique venue categories (i.e. 442 unique categories), after the dataset has been converted using one-hot encoding, we will have a huge number of columns as well (i.e. 9652 rows and 442 columns). The gif image in Fig. 13 below confirms this.
Such dataset, except the last column as it is not in numerical values, will be fed into clustering algorithm, K-Means. This will be explained in the next section.
Modelling
In this section, we’re going to cluster all institutions, using one-hot encoding form of the venue categories, by using K-Means algorithm. The Scikit-learn library is going to help us now. We’re going define 7 clusters for the K
.
The clustering algorithm returns the cluster for each input. In our case, since we’ve declared K=7
, we will have 7 clusters, i.e. cluster 0, cluster 1, cluster 2, and all the way to cluster 6 — since Python counts from 0, the last cluster is cluster 6, instead of cluster 7. The outputs of K-Means, in this case, are placed in the Cluster Labels
column in our dataset as seen in Fig. 14 below.
Analysis and Discussions
At the moment, we know the city and country as well as the coordinates for each institution; we have more than 400 venue categories; we’ve created the clusters and have 7 clusters of all institutions in the world. Then, now we’re going to analyze and discuss them.
The dataset with outputs from clustering is shown in Fig. 14. Using such dataset, we can combine it with venue categories data as shown in Fig 13. First of all, using the dataset of Fig. 13, we can group all rows by each institution name. Afterwards, we will have our dataset shrunk. However, this only shortens the rows. The columns are still extremely wide. Therefore, we need to rank the venues of each institution. For example, we’re only interested in the top 10 venues. To do so, we could count the venues’ frequency of appearance of each institute. In other words, the venue categories frequently appeared in a certain institution the most would become top 10 venue categories.
After we’ve performed the steps above, at the moment, our dataset would look like the gif image in Fig. 15 below.
Subsequently, we can merge this dataset with the dataset shown in Fig. 14. After we’ve done so, our dataset will look like Fig. 16 below. Such combined dataset is our final dataset.
Using the final dataset, we can, once again, create a map. However, now we can also cluster all institutions. As a remainder, we have 7 clusters. So, using Folium again, our map would look like Fig. 17 below. We can see that there are various number of colors. Briefly, we can see that the cluster colors in the US are dominated by orange color. Furthermore, Fig. 18 indicates the colors of each cluster and the number of institutions of each cluster.
As mentioned earlier, orange color (or cluster 6), which is mostly in the US, is the lead cluster with 29 institutions. Cluster 0 (red color) and cluster 4 (light green color) are placed econd and third, respectively. Meanwhile, the cluster with the least members is cluster 3.
Now, let’s view the members of each cluster.
Cluster 0
The members of Cluster 0 can be seen in figure below. There are 19 members in this cluster. We can see that the first most common venue in most institutions is dominated by Hotel and Park. Even we still see the similar pattern in the second most common venue. This cluster consists of diverse institutions from different countries.
Cluster 1
The members of Cluster 1 can be seen in figure below. There are 10 members in this cluster. We can see that the first most common venue in almost all institutions is Pub, except Bristol with Park as the first most common one. Though, we can see Pub is placed second in Bristol. Uniquely, this cluster has institutions from the UK only.
Cluster 2
The members of Cluster 2 can be seen in figure below. There are 12 members in this cluster. We can see that the first most common venue in almost all institutions is Café, except Canberra with Park as the first most common one. Though, we can see Café is placed second in Canberra. This cluster consists of diverse institutions from different countries.
Cluster 3
The members of Cluster 3 can be seen in figure below. There are 4 members in this cluster. We can see that the first and the second most common venues in all institutions share similar preferences. They are dominated, if not by Hotel, by Coffee Shop. This cluster belongs to China only.
Cluster 4
The members of Cluster 4 can be seen in figure below. There are 17 members in this cluster. We can see that the first most common venue in most institutions is dominated by Park. However, for the second and so on, they tend to be random. This cluster is dominated by institutions from Europe.
Cluster 5
The members of Cluster 5 can be seen in figure below. There are 8 members in this cluster. In this cluster, although it seems that the first most common venue in all institutions tend to be random, we can that they have one similarity, i.e. the venues with oriental theme, particularly from East Asia. We can confirm this that all institutions here are from Japan and Korea only.
Cluster 6
The members of Cluster 6 can be seen in figure below. There are 29 members in this cluster. This is the cluster with the biggest number of members. It is difficult to find patterns in this cluster since the top venues apparently form random patterns. In addition, this cluster is also really diverse. It consists of various countries.
Conclusions
This section concludes this final report. There are several conclusions we have here.
- We have collected the names of world-top 100 higher education institutions.
- We have collected the geospatial coordinates of each higher education institution.
- We have collected venues data in each higher education institution.
- We have explored the neighborhoods around each of top 100 higher education institutions in the world.
- We have visualized the neighboorhoods in each higher education institution.
- We have clustered the neighborhoods in each higher education institution.
- We have analyzed and discussed the generated clusters.
- We have determined
k=7
as the number of clusters. - We have found that each cluster has its characteristics.
Future Works
We have carried out a great job. However, there are still many drawbacks in our works. Therefore, for the future works, we could perform the following tasks:
- To use institutions data from Times Higher Education
- To explore more features from Foursquare API
- To increase the number of institutions, instead of 100
- To try other numbers of K and examine the optimum K value
- To try other clustering algorithms and and compare all them