none
Datamining with small valid sample, please help!

    Question

  • I have a dataset around 10K, they all have detailed info, but only 400 of them have response. I want to predict the behavier of the rest of the data (10k-400) based on the 400 records that has input, what kind of model shall I use? How to do the data cleansing? Will the result be very not reliable when the valid record used for training is only 4%? Really appreciate any response! I've struggled for a week and still couldn't figure the problem out.
    Saturday, January 28, 2012 8:07 PM

Answers

  • What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?

    If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.

    It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.

    I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.


    Tatyana Yakushev [PredixionSoftware.com]
    Sunday, January 29, 2012 5:53 AM
    Answerer

All replies

  • What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?

    If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.

    It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.

    I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.


    Tatyana Yakushev [PredixionSoftware.com]
    Sunday, January 29, 2012 5:53 AM
    Answerer
  • Thank you so much for your input! It is discrete prediction column. What I did was put all the 400 records (Y) and some of the non-response records from the other 10k (N) into the model (training and testing). In this way some extra info for the N type will be discovered. But it seems I shouldn't grab the 'N' records there since I want to predict them base on the model. I'll post the result with your method here later. Thanks again!

    Monday, January 30, 2012 4:09 PM