Tuesday, March 11, 2014

kMeans in F#

This post is about the  KMeans , and a simple implementation in F#.

A quick definition of the problem I can give is:
given a set of points, and an integer n, divide the set in n partitions in a way that minimise the "within-cluster sum of squares (WCSS)".

By literature, the algorithm is fast but suboptimal, i.e. it converges to a local minimum.
It takes an initial guess of centroids, then calculate the clusters associated to each centroid guess, and then recalculate a new centroid of that clusters, and so... on until there is no difference between the set of clusters and the ones of the next iteration.

You should run it different times, and with different n, and then take the best result (according to the objective function minimising the "within cluster sum of squares").

For simplicity, I assumed only bidimensional points, but real problems are more complex and so they have more features that need to be represented by more dimensions.
Anyway this is the definition of a point:

type Point = {X: float; Y: float}

I describe here functions useful for the final algorithm.

Measure of a distance between two points:

let quadraticDistance point1 point2 = 
    Math.Pow((point1.X-point2.X),2.0)+ Math.Pow((point1.Y-point2.Y),2.0)


Any point has to be associated to a cluster represented by a centroid, when that centroids is the closest one:

let closestCentroid point centroids  =
    let precondition = match centroids with | [] -> failwith "Centroid list should be not empty" | _ -> true
    List.fold (fun acc item -> if ((quadraticDistance item point) < (quadraticDistance acc point)) then item else acc) (List.head centroids) centroids


(never mind the "precondition" check: it just creates an exception with a description, avoiding having a more obscure error of calling List.head on an empty list)

Given a segment, I need to find its real centroid:

let centroid segment = 
    let checkPrecondition = match segment with |[] -> failwith "precondition error, list of points expected to be not empty" | _ -> true
    let Xs = List.sumBy (fun item -> item.X) segment
    let Ys = List.sumBy (fun  item -> item.Y) segment
    let centroid = {X=(Xs/(float (List.length segment)));Y=(Ys/(float (List.length segment)))}
    centroid


(Again: the "precondition" make sure that if there is an error condition from empty segment we get the specific message, instead of a "division by zero")

Getting centroids for more segments all in once is easy by using the map function:

let centroids segments =
  List.map (fun x -> centroid x) segments



Now I have almost all the elements to build up the algorithm, and so I can try to implement it, and this time I use mutable types in the loop (so that I "reuse" the same variables, instead of virtually creating new ones each iteration).

I keep up to date the centroids relative to the current iteration and the relative cluster, and the centroids relative to the next iteration. If the cluster do not change, then we stop. Looks like list comparison works as long as they are reordered.


let iterateFindingFinalCentroids points initialCentroids =
    let mutable currentCentroids = initialCentroids
    let mutable segments = segmentPointsByClosestCentroid points currentCentroids
    let mutable nextCentroids = centroids segments 
    let mutable nextSegments = segmentPointsByClosestCentroid points nextCentroids
    while ((List.sort nextSegments) <> (List.sort segments)) do
        currentCentroids 
<- nextCentroids
        segments 
<- nextSegments
        nextCentroids 
<- centroids nextSegments
        nextSegments 
<- segmentPointsByClosestCentroid points nextCentroids
    nextCentroids



One missing part here is a function that separate all the points in different clusters on the base of their proximity to the given centroids.

I found convenient doing it in two steps. Tagging each point with its closest centroid, and then aggregating the points tagged with the same centroid.

A tagged point type can be useful, and perhaps better than a simple pair, because is more expressive:

type TaggedPoints = {Point: Point; AssignedCentroid: Point}


let tagPointsWithClosestCentroid centroids segment =
    List.fold (fun acc item -> {Point=item; AssignedCentroid = (closestCentroid item centroids)}::acc) [] segment


To put the points in different clusters:

let segmentPointsByClosestCentroid points centroids =
    let taggedPoints = tagPointsWithClosestCentroid centroids points
    List.fold (fun acc item -> (List.filter (fun p -> p.AssignedCentroid = item) taggedPoints |> List.map (fun taggedPoint -> taggedPoint.Point))::acc) [] centroids
   

(A next step could be making the tagPointsWithClosestCentroid internal to this last one).

For description of functions used here, see the msdn (for example this is the entry about how to use List.fold: http://msdn.microsoft.com/en-us/library/ee353894.aspx).

I played a little bit with artificial data: I generated artificially four clusters given by bidimensional Gaussian, so expecting that the centroid found will be the central/average (or just 'μ') of that distributions, and actually this is what happened (except when I included a subdle "bug" in the algorithm).

For that experiment I wrote some code in unit tests.

The Normal distributions have the 'μ'  (their centers) equals to (1,1), (1,8), (8,8), (8,1) respectively and variances the (σ^2)'s all equals to 2.

The algorithm actually converge to values close to the 'μ' of the random normal distributions used to generate the points.

I'll show some code for generating cluster from Normal distributions. This array contains the four sources of random data:

 let rnd = System.Random()
            let generators = [|
              {XGauss=new Normal(1.0,2.0);YGauss= new Normal(1.0,2.0)};   
              {XGauss=new Normal(
1.0,2.0);YGauss= new Normal(8.0,2.0)};
              {XGauss=new Normal(8.0,2.0);YGauss= new Normal(8.0,2.0)};
              {XGauss=new Normal(8.0,2.0);YGauss= new Normal(1.0,2.0)}
            |]



To experiment if the algorithm is affected by choosing different initial points the strategy is to take four generators:

            let sampleFirsts = [
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[1].XGauss.Sample();Y=generators.[1].YGauss.Sample()};
                {X=generators.[2].XGauss.Sample();Y=generators.[2].YGauss.Sample()};
                {X=generators.[3].XGauss.Sample();Y=generators.[3].YGauss.Sample()}
            ]
              

I can change the configuration of the sources simply changing their index. For example in the following case the initial centroids are taken all from the first source:

            let sampleFirsts = [
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()}


To sample on thousand of points from the different sources (and the first four according to the idea of having initial centroids, so using the "SampleFirsts"):

            let sample = sampleFirsts @ (List.map (fun x -> {X=x.XGauss.Sample();Y=x.YGauss.Sample()}) (List.map (fun x -> generators.[x % 4 ]) [0..995]))


            let initialCentroids = [sample.[0]; sample.[1]; sample.[2];sample.[3]]

            let finalCentroids = iterateFindingFinalCentroids sample initialCentroids



Actually the code returns the centroids (and not the segments), and neither the "variability" (dispersion) of the segments, but they can be calculated later:

            let segment0 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[0then item::acc else acc) [] sample
            let segment1 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[1then item::acc else acc) [] sample
            let segment2 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[2then item::acc else acc) [] sample
            let segment3 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[3then item::acc else acc) [] sample

            let quadraticDist0 = sumOfQuadraticDistances finalCentroids.[0] segment0
            let quadraticDist1 = sumOfQuadraticDistances finalCentroids.[1] segment1
            let quadraticDist2 = sumOfQuadraticDistances finalCentroids.[2] segment2
            let quadraticDist3 = sumOfQuadraticDistances finalCentroids.[3] segment3




The result is that the algorithm is not affected by the different combinations of sources for the initial centroids (that correspond to the first four points of the sample), still converging to the four expected points, i.e. the center of the generators of the Normal (Guassian) bidimensional distributions.


No comments: