I was looking around for an algorithm that would give me an outline of points in a dataset. Normally it would be a good idea to cluster the points and calculate the convex hull of the points in each cluster. The main problem with this approach is that most clustering algorithms require that you define a set number of clusters you want to create for your data points. But in you are working with points that are not evenly distributed or are bunched into natural clusters with a different count than what you specify then your clusters don't give a realistic picture of your data.

The solution is obviously to cluster the points in a cluster near their own location and create new clusters if the existing clusters are irrelevant for the current point, because it is too far away. In this post I am going to show how I modified the k-means clustering algorithm to create clusters at my points. Once the points were clustered it was a simple case of calculating the convex hull of each cluster.

The basic k-means algorithm clusters n observations into k clusters with each point belonging to the cluster with the nearest mean location, where n and k are defined before the clustering. The modified version here uses the same approach except it adds a requirement that a cluster cannot be further away than d, so that the algorithm can be said to cluster n observations into x clusters, where a cluster mean is not further than d from an observation, where only n and d are defined.

Where the k-means algorithm uses a defined k amount of clusters this approach starts by using the location of the first observation as the location of the first cluster. While looping through the dataset the location of the current observation is used to find the first nearest cluster that is not further from the observation than the max distance d. The observation is then assigned to this cluster. When reaching the end of the data set all clusters with no points assigned to them are removed and for the rest the mean location of the cluster is recalculated. The process is then repeated except that from the second iteration and on the existing clusters are used as initial clusters. This does not mean that a new cluster cannot be created because a new mean is beyond the max distance d from a given observation, but it will become increasingly unlikely.

The speed up the process I used the Task Parallel Library so the observations are clustered in parallel. This may cause an extra iteration or two because some clusters may be created in parallel and therefore be close to each other. This is the reason that empty clusters are removed at the end of each iteration. The iterations continue until the new cluster means are the same as the previous iteration.

The clustering is performed using the following code to get the initial center assignments.

private static IDictionary> GetCenterAssignments( IEnumerable points, IEnumerable centers, double maxDistance, Units distanceUnit) { var centerAssignments = new ConcurrentDictionary>( centers .Where(x => x != null) .Distinct() .Select(x => new KeyValuePair>(x, new ConcurrentBag()))); Parallel.ForEach( points, point => { var closestCenter = (from centerPoint in centerAssignments.Keys let distance = MapCalculations.Distance(centerPoint, point, distanceUnit) where distance < maxDistance select centerPoint) .FirstOrDefault(); if (closestCenter != null && centerAssignments.ContainsKey(closestCenter)) { centerAssignments[closestCenter].Add(point); } else { centerAssignments.TryAdd(point, new ConcurrentBag { point }); } }); return centerAssignments.Where(x => x.Value.Count > 0).ToDictionary(x => x.Key, x => (IList)x.Value.ToList()); }

The center assignments are then moved to the new mean location for the contained coordinates.

private static IList GetNewCenters(IDictionary> centerAssignments) { var newCenters = new List(); Parallel.ForEach( centerAssignments.Keys, new Action( center => { var averageLng = centerAssignments[center].Average(x => x.Longitude); var averageLat = centerAssignments[center].Average(x => x.Latitude); var newCenter = new Coordinates(averageLat, averageLng); newCenters.Add(newCenter); })); return newCenters; }

As mentioned above the process is repeated until the new mean centers for the clusters are equal to the previous ones. This means that the points have been placed in the nearest cluster and the clusters are no longer moving.

Note, the code above uses the ICoordinates interface, which basically exposes Latitude and Longitude.

To make sure there are no cross thread issues when running the clustering tasks in parallel I make use of the new thread safe collections from .NET 4, so this code won’t run in .NET 3.5.