Le réseau pour les développeurs > Forums - Accueil > Data Mining > how to rearrange clusters ? manually or programmatically?
Poser une questionPoser une question
 

Discussion généralehow to rearrange clusters ? manually or programmatically?

  • jeudi 2 juillet 2009 08:31Guennadiy Vanine Médailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     
    ideo.
    And having watched it I am treybg to configure the DM models in order to use predictions (engaging the use of MS Clustering algorithm).

    During creation of model I see the warning or "help" message:
    "Input data will be randomly split into two sets, a training set and a testing set,
    based on the percentage of data for testing and maximum number of cases in testing data set you provide.
    The training set is used to create the mining model. The testing set is used to check model accuracy."


    This is very nice!
    Is there any way to switch off the randomness and split it manually?

    I am also interested to know whether it is possible to define cluster creation manually or programmatically ?
    or rearrange clusters ?

    PS
    Added later.
    I cannot be mute on it.
    My client saw Excel 2007 Add-in "Exception highlighting" video.
    Having wathed and listened it, he insists that Microsoft Clustering Algorithm arrange clusters according to probabilitits.
    I.e. it creates clusters with exceptions (anomalies or outliers).
    And he wants to have such clusters...

    So, is it possible to satisfy such a wish?
    tmoving exceptions to separate cluster(s)?

    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин

Toutes les réponses

  • jeudi 2 juillet 2009 09:14Allan MitchellMVPMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     
    Hi

    To create your own Testing and training sets then you can use SSIS (or any other method you choose) to split the original dataset into 2.  The premise holds though that the two sets of data should be representative of the whole.

     from this page http://technet.microsoft.com/en-us/library/ms131977.aspx

    Using the wizard you will be default get a 70/30 split. You could change that to 100/0

    Using DM you have to manually specify WITH HOLDOUT (<option>)


    You can programmatically (API and DMX) specify the algorithm parameter values.


  • dimanche 12 juillet 2009 15:46Vladimir Cupal Médailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     
    Regarding the second part of the question and PS.. although you will be able to control behaviour of Microsoft Clustering algorithm to some extent, there are limitations, which you are now probably close to. As far as I know, you are not able with Microsoft Clustering algorithm to define exactly how are clusters created (for example their exact centers) or how the final results will be stored in node structure. To be able to create clustering model completetely to your wishes, I would recommend writing your own clustering plug-in algorithm. Even though creating your own algorithm (writing the code) may complicate things at first, you will be then completely in charge of all those issues you mentioned.

    Best regards
    Vladimir Cupal