How to control the proportion of predicted example

Answered How to control the proportion of predicted example

  • 04 April 2012 16:12
     
     


    Help needed to understand how to control the proportion of
    predicted examples in binomial classes: I have a database with binomial class label:
    either 1 or 0. I have got a training set with 20000 records, inside which
    around 2600 of them have the class label 1, others are all class 0. Then I have
    a test set with around 5000 records, but only around 100 of them are belong to
    class 1. However when I perform the prediction by applying different algorithms
    (Logistic regression, linear regression, Support Vector Machine, Neural Net…) I
    found it very difficult to control the proportion of the predicted examples, in
    this case, the records being predicted as class 1. Ideally I would like to
    control the mining process to only predicted records in class 1 around 3% - 5%
    of the whole dataset and maintain 95% of records being classified class 0. Any
    idea?  Many thx

    stephen


Semua Balasan

  • 04 April 2012 18:32
    Penjawab Pertanyaan
     
     Jawab

    What tool are you using? Microsoft Data Mining does not have implementation of Support Vector Machine.

    In Microsoft Data Mining, when you run a query, you can use PredictProbability function. More details here http://msdn.microsoft.com/en-us/library/ms131988.aspx

    With that function you can predict probability of record belonging to class 0 and predict one class or the other depending on the probability value. (If you had more than 2 states, then you could have used PredictHistogram function to decide what value to predict).

    Note that when one of the states has very low probability, the model generated by default might not be good. To get more accurate model you might train it on a oversampled data (where both states have similar probability). If you use Data Mining Add-ins for Excel, there is an easy to use wizard that allows you to perform oversampling.


    Tatyana Yakushev [PredixionSoftware.com]

  • 05 April 2012 9:02
     
     

    Hi Tatyana,

       Many thanks for the suggetions, will dig in PredictProbability a bit more. I am actually using both Rapidminer and Microsoft data mining to generate different algorithm, for svm I used RM.

    stephen