יום רביעי 04 אפריל 2012 16:12
Help needed to understand how to control the proportion of
predicted examples in binomial classes: I have a database with binomial class label:
either 1 or 0. I have got a training set with 20000 records, inside which
around 2600 of them have the class label 1, others are all class 0. Then I have
a test set with around 5000 records, but only around 100 of them are belong to
class 1. However when I perform the prediction by applying different algorithms
(Logistic regression, linear regression, Support Vector Machine, Neural Net…) I
found it very difficult to control the proportion of the predicted examples, in
this case, the records being predicted as class 1. Ideally I would like to
control the mining process to only predicted records in class 1 around 3% - 5%
of the whole dataset and maintain 95% of records being classified class 0. Any
idea? Many thx
יום רביעי 04 אפריל 2012 18:32משיב
What tool are you using? Microsoft Data Mining does not have implementation of Support Vector Machine.
In Microsoft Data Mining, when you run a query, you can use PredictProbability function. More details here http://msdn.microsoft.com/en-us/library/ms131988.aspx
With that function you can predict probability of record belonging to class 0 and predict one class or the other depending on the probability value. (If you had more than 2 states, then you could have used PredictHistogram function to decide what value to predict).
Note that when one of the states has very low probability, the model generated by default might not be good. To get more accurate model you might train it on a oversampled data (where both states have similar probability). If you use Data Mining Add-ins for Excel, there is an easy to use wizard that allows you to perform oversampling.
Tatyana Yakushev [PredixionSoftware.com]
- סומן כתשובה על-ידי Tatyana YakushevEditor יום חמישי 05 אפריל 2012 17:19
יום חמישי 05 אפריל 2012 09:02
Many thanks for the suggetions, will dig in PredictProbability a bit more. I am actually using both Rapidminer and Microsoft data mining to generate different algorithm, for svm I used RM.