2012年1月28日 下午 08:07I have a dataset around 10K, they all have detailed info, but only 400 of them have response. I want to predict the behavier of the rest of the data (10k-400) based on the 400 records that has input, what kind of model shall I use? How to do the data cleansing? Will the result be very not reliable when the valid record used for training is only 4%? Really appreciate any response! I've struggled for a week and still couldn't figure the problem out.
2012年1月29日 上午 05:53解答者
What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?
If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.
It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.
I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.
Tatyana Yakushev [PredixionSoftware.com]
- 已標示為解答 Tatyana YakushevEditor 2012年1月31日 上午 12:03
2012年1月30日 下午 04:09
Thank you so much for your input! It is discrete prediction column. What I did was put all the 400 records (Y) and some of the non-response records from the other 10k (N) into the model (training and testing). In this way some extra info for the N type will be discovered. But it seems I shouldn't grab the 'N' records there since I want to predict them base on the model. I'll post the result with your method here later. Thanks again!