Datamining with small valid sample, please help!

Answered Datamining with small valid sample, please help!

  • sábado, 28 de enero de 2012 20:07
     
     
    I have a dataset around 10K, they all have detailed info, but only 400 of them have response. I want to predict the behavier of the rest of the data (10k-400) based on the 400 records that has input, what kind of model shall I use? How to do the data cleansing? Will the result be very not reliable when the valid record used for training is only 4%? Really appreciate any response! I've struggled for a week and still couldn't figure the problem out.

Todas las respuestas

  • domingo, 29 de enero de 2012 5:53
    Usuario que responde
     
     Respondida

    What are you trying to predict? Is it discrete column (e.g. Yes or No)? If not, what is it?

    If it is discrete, then try following algorithms: decision trees, logistic regression, neural networks.

    It is impossible to say if results will be reliable or not without knowing your data and trying to create mining models.

    I would train multiple models on 70% of the 400 rows and measure accuracy on the remaining 30% of the 400 rows.


    Tatyana Yakushev [PredixionSoftware.com]
  • lunes, 30 de enero de 2012 16:09
     
     

    Thank you so much for your input! It is discrete prediction column. What I did was put all the 400 records (Y) and some of the non-response records from the other 10k (N) into the model (training and testing). In this way some extra info for the N type will be discovered. But it seems I shouldn't grab the 'N' records there since I want to predict them base on the model. I'll post the result with your method here later. Thanks again!