# Datamining with small valid sample, please help!

### 问题

• I have a dataset around 10K, they all have detailed info, but only 400 of them have response. I want to predict the behavier of the rest of the data (10k-400) based on the 400 records that has input, what kind of model shall I use? How to do the data cleansing? Will the result be very not reliable when the valid record used for training is only 4%? Really appreciate any response! I've struggled for a week and still couldn't figure the problem out.
2012年1月28日 20:07

### 答案

• What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?

If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.

It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.

I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.

Tatyana Yakushev [PredixionSoftware.com]
2012年1月29日 5:53

### 全部回复

• What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?

If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.

It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.

I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.

Tatyana Yakushev [PredixionSoftware.com]
2012年1月29日 5:53
• Thank you so much for your input! It is discrete prediction column. What I did was put all the 400 records (Y) and some of the non-response records from the other 10k (N) into the model (training and testing). In this way some extra info for the N type will be discovered. But it seems I shouldn't grab the 'N' records there since I want to predict them base on the model. I'll post the result with your method here later. Thanks again!

2012年1月30日 16:09