K-means clustering; Transport data clustering analysis.
- Karthik Jamalpur
- Nov 28, 2021
- 3 min read
Updated: Dec 8, 2021

Aim:
To identify different traffic states like saturated, under saturated, over-saturated flows clustering analysis is introduced. K-means clustering analysis in sklearn is used on the traffic data to also get speed fluctuation on a road segment.
Clustering for aggregated data
Clustering is the task of dividing the data points into a few groups such that data points in the same groups are more like other data points in the same group and dissimilar to the data points in other groups. Clustering is an unsupervised machine learning method of recognizing and grouping similar data points in larger datasets without concern for the specific outcome. Clustering is usually used to classify data into structures that are more easily understood and manipulated.
There are several types of clustering methods, in this project we are using K-means to carry out it. K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data the algorithm works iteratively to assign each data point to one of K groups based on the features that are given.
The way k - means algorithm works is as follows:
1. Specify the number of clusters K.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
3. Keep iterating until there is no change to the centroids
K-Means Clustering Advantages and Disadvantages. K-Means Advantages:
1) If variables are huge, then K-Means most of the time computationally faster than hierarchical clustering if the k value is small.
2) K-Means generate tighter clusters than hierarchical clustering, especially if the clusters are globular.

Fig.1 Raw data with outliers

Fig.2 Raw data after removal of outliers
For the clustering 500k chunk size (500k FCD data (rows with column values)) is taken and below are the results of clustering.

Fig.3 Speed vs std vs counts Fig.4 Speed vs std
Elbow method
The basic idea behind partitioning methods, such as k-means clustering is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and it should be as small as possible. The Elbow method looks at the total WSS as a function of the number of clusters, one should choose several clusters so that adding another cluster doesn’t improve much better the total WSS.
The Optimal number of clusters can be depicted as follows,
In Figure 46 by looking at scatterplot, generalized (Assumption) number of iterations as 3.
1. Compute clustering algorithm (e.g., K-means) for different values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

In the speed vs std graph k (number of iterations) is randomly chosen that is 3(assumption). After that to know the accurate number of iteration Elbow method is usually used. Observing the graph helps to find the number of clusters in the data. Sudden fall (bend) and ended at 2nd point and from that iteration rate is slow. It is useless to take more than 2 iterations. So, choose number 2 as the number of iterations and plot the clustering graph as below.

Fig.4 Speed vs std
it is showing two different traffic states, one is with a mean speed of 50 km/h (Oversaturation) and another cluster with a speed greater than 50 km/hr. (Saturation). For instance, In the FCD data file on link 123007observed speed is 107 km/hr. on 2nd February at 01:10:00, it comes below the second group of clusters. Parallelly time of the individual data is also grouped according to the cluster. It is just an enhancement to the prediction for better guidance.
Here every link demonstrates the traffic state (oversaturated, undersaturated, saturated or free flow) at a particular time. This clustering system is helpful to route guidance and traffic control.
Comments