Datamining with small valid sample, please help!
-
Saturday, January 28, 2012 8:07 PMI have a dataset around 10K, they all have detailed info, but only 400 of them have response. I want to predict the behavier of the rest of the data (10k-400) based on the 400 records that has input, what kind of model shall I use? How to do the data cleansing? Will the result be very not reliable when the valid record used for training is only 4%? Really appreciate any response! I've struggled for a week and still couldn't figure the problem out.
All Replies
-
Sunday, January 29, 2012 5:53 AMAnswerer
What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?
If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.
It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.
I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.
Tatyana Yakushev [PredixionSoftware.com]- Marked As Answer by Tatyana YakushevEditor Tuesday, January 31, 2012 12:03 AM
-
Monday, January 30, 2012 4:09 PM
Thank you so much for your input! It is discrete prediction column. What I did was put all the 400 records (Y) and some of the non-response records from the other 10k (N) into the model (training and testing). In this way some extra info for the N type will be discovered. But it seems I shouldn't grab the 'N' records there since I want to predict them base on the model. I'll post the result with your method here later. Thanks again!

