none
ML.NET: plotting K-means clustering results? RRS feed

  • Question

  • I'm new to ML, and experimenting with ML.NET in an unsupervised clustering scenario. My start data are less than 30 records with 5 features in a TSV file, e.g.:

    Label   S1   S2   S3   S4   S5

    alpha   0.274167987321712   0.483359746434231   0.0855784469096672   0.297939778129952   0.0332805071315372

    beta   0.378208470054279   0.405409549510871   0.162317151706584   0.292342604802355   0.0551994848048085

    ...

    I started from this iris tutorial at https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/iris-clustering, a sample of K-means clustering. In my case, I have 5 features for each record, and I want 3 clusters. Once created the model, I'd like to use it to add the clustering data to each record in a copy of the original file, so I can examine them and plot scatter graphs.

    I started with this training code (say `MyModel` is the POCO class representing its model, with properties for S1-S5):

    MLContext mlContext = new MLContext(seed: 0);
    IDataView dataView = mlContext.Data.LoadFromTextFile<MyModel>
        (dataPath, hasHeader: true, separatorChar: '\t');
    
    const string featuresColumnName = "Features";
    EstimatorChain<ClusteringPredictionTransformer<KMeansModelParameters>>
        pipeline = mlContext.Transforms
        .Concatenate(featuresColumnName, "S1", "S2", "S3", "S4", "S5")
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName,
        numberOfClusters: 3));
    
    TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>
        model = pipeline.Fit(dataView);
    
    using (FileStream fileStream = new FileStream(modelPath,
        FileMode.Create, FileAccess.Write, FileShare.Write))
    {
        mlContext.Model.Save(model, dataView.Schema, fileStream);
    }
    

    Then, I load the saved model, read every record from the original data, and get its cluster ID. This sounds a bit convoluted, and probably Scikit-like solutions would be easier, but the self-learning intent here is inspecting the results, before playing with them. The results should be saved in a new file, together with the centroids coordinates and the points coordinates.

    AFAIK, it does not seem that this API is transparent enough to easily access the centroids; I found only this rather old post, and its code no more compiles. I rather used it as a hint to recover the data via reflection, but this is a hack.

    Also, I'm not sure about the details of the data provided by the framework. I can see that every centroid has 3 vectors (named `cx` `cy` `cz` in the sample code), each with 5 elements (the 5 features, in their concatenated input order, I presume, i.e. from S1 to S5); also, each prediction provides a 3-fold distance (`dx` `dy` `dz`). With these assumptions, I could assign a cluster ID to each record like this:

    // for each record in the original data
    foreach (MyModel record in csvReader.GetRecords<MyModel>())
    {
        // get its cluster ID
        MyPrediction prediction = predictor.Predict(record);
    
        // get the centroids just once, as of course they are the same
        // for all the records referring their distances to them
        if (cx == null)
        {
            // get centroids (via reflection...):
            // https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Clustering/KMeansWithOptions.cs#L49
            // https://social.msdn.microsoft.com/Forums/azure/en-US/c09171c0-d9c8-4426-83a9-36ed72a32fe7/kmeans-output-centroids-and-cluster-size?forum=MachineLearning
            VBuffer<float>[] centroids = default;
            var last = ((TransformerChain<ITransformer>)model)
                .LastTransformer;
            KMeansModelParameters kparams = (KMeansModelParameters)
                last.GetType().GetProperty("Model").GetValue(last);
            kparams.GetClusterCentroids(ref centroids, out int k);
            cx = centroids[0].GetValues().ToArray();
            cy = centroids[1].GetValues().ToArray();
            cz = centroids[2].GetValues().ToArray();
        }
    
        float dx = prediction.Distances[0];
        float dy = prediction.Distances[1];
        float dz = prediction.Distances[2];
        // ... calculate and save full details for the record ...
    }
    

    If these assumptions are correct, I suppose I can get all the details about each record position in the following way:

    - `dx`, `dy`, `dz` are the distances.

    - `cx[0]` `cy[0]` `cy[0]` + the distances (`dx`, `dy`, and `dz` respectively) should be the position of the S1 point; `cx[1]` `cy[1]` `cz[1]` + the distances the position of S2; and so forth up to S5 (`cx[4]` etc).

    In this case, I could plot these data in a 3D scatter graph. Yet, given that the primary purpose of this API seems providing a prediction for a single record, rather than the full details of the clustering results, I'm not sure about these assumptions, maybe I'm just making things wrong. Could anyone give point me in the right direction?

    Saturday, August 10, 2019 10:03 PM

All replies

  • Hi Naftis,

    Thank you for reaching out. I am sorry here is not the right place for ML.NET. Please check following resources you can refer to:

    Gitter forum for ML.NET: https://gitter.im/dotnet/mlnet?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

    GitHub forum for .NET framework, you can open a new issue there: https://github.com/dotnet/docs/issues?page=3&q=is%3Aissue+is%3Aopen

    GitHub forum for ML.NET samples, you can open a new issue there and also find different samples. https://github.com/dotnet/machinelearning-samples

    Hope you can find all you want. Thank you for understanding.

    Regards,

    Yutong

    Monday, August 12, 2019 5:06 PM
    Moderator