Classification with and Decision Trees & Neural Network

Answered Classification with and Decision Trees & Neural Network

  • 2012년 7월 12일 목요일 오후 2:11
     
     

    Hi everyone,

    I have some questions about classification with the Microsoft algorithms neural network and decision tree (and maybe native bayes). My problem is following:

    I want to classify data from an database into categories but these categories are different in an other database. Of course, in every database I want to classify almost correctly ;-). The database (=the input attributes for the data mining models) will be equal, of course. Only there are different "row value's".For examble, in one database exist 3 and in the second database 5 catecories.

    A second aspect is, that in the databases could exist data, that's not categorisable clearly into the existing categories. Now, following question:

    My consideration is, that I don't classify these data because their "PredictProbability"-Value is - by comparison - lower. Is this a possible - or better: a common used - solution looking for the PredictProbability-Value? Is there a second approach to use only "good" classifications?


    My second idea is, that I've yet another category for such data. For examble a catecory named: Miscellaneous, Leftovers, ... . But I haven't seen so far, that in classification algorithms a model is trained with "un-categorisable" data. I've always seen data with unique and clear categories. In my case, I think about something like an "else"-path. 

    I think, that categories respectively their attributes (and even their values) are very important for an good classification. Especially when using decision trees, there could occur some wrong classifications when the data respectrively the categories aren't good. Can me someone give some general tips about the structure of input attributes for categorisation (distribution, value,...)?

    Hope, someone could help me :-)

    Thanks in anticipation

모든 응답

  • 2012년 7월 12일 목요일 오후 4:43
    답변자
     
     

    First, if you are trying to predict 2 types of categories, you just need to create 2 models.

    It is a common practice to look at PredictProbability. By default, algorithm returns prediction that has highest probability. Very often this is not good enough (typical example is fraud identification. If something has 20% of being fraud, you most likely want to investigate the case)

    I don't understand your reason for training a model with artificial category (Miscellaneous, leftovers).

    P.S. What is your native language?


    Tatyana Yakushev [PredixionSoftware.com]

  • 2012년 7월 13일 금요일 오전 7:42
     
     

    Hi Tatyana, thanks for your fast reply.

    Yes, for 2 types of categories I need two models. But then there's the same database (input rows). In my case I've something like this:

    Input attributes: e.g. in one database there are two categories:

    Weight    Type    Height    Width    Category
      100        1         1,5         2,0            A
      200        2         2,3         1,0            B

    And in another database there can be three categories:
     300         5         1,0         4,0            1
     400         6         0,5         2,0            2
     500         7         1,0         3,0            3

    Now, my input attributes are still the same but only there value's are differnt and there are other categories. Therefore I've to re-train the data mining model only, or not!?

    My consideration with this "artificial" category is following: Let's assume, that in the database there can be data ( =row(s) ) that can't categorised clearly. For examble, let's look again at the data with the two categories A & B. Let's assume there is now following input (row):

    Weight = 500, Type = 3, Height = 3,0, Width = 3,0     <-- This is an input, I can't categorise clearly.

    Now my idea: PredictProbability is - in comparision - lower and so I can exclude this classification. Or second, I've a third catecory for such rows. Did you understand my - maybe wooly - thoughts :-)?

    Quote: "...If something has 20% of being fraud...." <--- Is this a case, where PredictProbability = 0.8...?

    PS: My native-language is german, but is my post written poorly :-)?

  • 2012년 7월 17일 화요일 오전 6:08
     
     

    Hi everyone,

    is there something incomprehensible in my previous post? I hope, somebody could give me some feedback.

    Thanks.

  • 2012년 7월 17일 화요일 오전 11:15
     
     

    Hi,

    Like Tatyana has written, you can create different mining models for data from different databases.

    You can create one input data set from different databases, but you have to know if the data describe the same experiment. For example, if in database "A" you have diveded data on three categories and in the database "B" you have divided data on five categories, I want to ask why. Do you wanted to divide data in the database "A" on five categories, but there weren't any example in data to match fourth or fifth category, or you have different methods of division (different experiment)?

    Mixing data from different experiments is not a good idea.

    "Artificial category". I understand your description in this way. You have some data to which you have assigned the category:

    Weight    Type    Height    Width    Category
      100        1         1,5         2,0            A
      200        2         2,3         1,0            B 

    and you have some data, where you cannot assigned any category (for some reason depending on your experiment) 

    Weight = 500, Type = 3, Height = 3,0, Width = 3,0

    So, you can use data with assigned category and create a data mining model. Then you can use created model to predict category for the data, to which you couldn't assigned category.

    Regards,

    gc 

  • 2012년 7월 17일 화요일 오전 11:43
     
     

    Hi,

    mhh...I have one dataset in database X and another in database Y. There database have the same attributes and their data is disjunct (their is no data in two databases). Only their exist different "user-defined" categories. In other words: In database X there's a "rougher" classification (with maybe only two categories) and in database Y there's a "smoother" calssification (with 3 categories). The attributes are equal!

    My consideration: I've one mining structure and one (and even more) mining model(s), which describes all this attributes (input) and the category-attribute is the "predictable"-column. Of course, I've to train the model again when changing the input values, but in general this should be possible or not? I hope, it's clear now. I can't describe it better...:-)

    Quote:
    "So, you can use data with assigned category and create a data mining model. Then you can use created model to predict category for the data, to which you couldn't assigned category." <--- Ok, and the result is: I'll get a assigned category for this third data row and the PredictProbabilty-Value illustrates the Prabability, how "clear" it fits in the category, isn't it?

    Regards and thanks

  • 2012년 7월 18일 수요일 오전 5:26
     
     답변됨

    Hi,

    In other words: In database X there's a "rougher" classification (with maybe only two categories) and in database Y there's a "smoother" calssification (with 3 categories). <--- So, you have two different experiments.

    Data mining models create "rules". In other words, they divide space of attributes values onto some sectors to find patterns (the same categories). You shouldn't use rules, which were founded by one model (in the example data with two categories), to categorize data from the second example of data (with more than two categories) - because you will not predict third category and so on...

    "...." <--- Ok, and the result is: I'll get a assigned category for this third data row and the PredictProbabilty-Value illustrates the Prabability, how "clear" it fits in the category, isn't it?

    Yes. Also, you can divide data with assigned categories onto two data sets: learning data and test data. Thus, if you learn a data mining model using the learning data you can test it on the test data (in test data you have already known the correct category for each case, so you can see how accurate is your data mining model - compare real category with the predicted one, you will see how offen the model is predicting the category correctly).

    Regards,

    gc 

    • 답변으로 표시됨 The-Spiky 2012년 7월 19일 목요일 오전 7:48
    •  
  • 2012년 7월 18일 수요일 오전 6:44
     
     

    I won't use rules found in one database to categorize data from the second. I'll only use the rules for their own. Of course, for the second database I must learn the model again to use these (second) rules for prediction / classification.

    Finally: Thanks all for your fast responses. You've helped to clarify my ambiguities. But one last question :-):

    Are therse some general tips & tricks about the structure of input attributes for categorisation (distribution, value,...)? 

    Regards

  • 2012년 7월 18일 수요일 오전 7:04
     
     

    It is very hard to write in a few sentences. Creating a data mining model is a final step and I would say it is about 30%-40% of work. First you have to prepare data - clear data, identify outliers, change character data from continuous to discrete or vice versa (sometimes it is neceserry for the model and sometimes such operations give you more accurate models - it depends on data), change the scale of data, generate computed attributes (some calculated values for each case using other values).

    The creativity is the key.

    In data mining analysis you want to find patterns which are working on most of your analyzed cases. This is main difference between data mining analysis and statistical analysis - you don't have to worry about theoretical assumptions if your model work correctly enough to help you in decision process.

    Regards,

    gc  

  • 2012년 7월 19일 목요일 오전 7:48
     
     
    Thanks very much for your information. What to you think about "stratified sampling" for classification? I think - especially when using classification algorithms like naive bayes - the is a stratified sampling of the data better for classification, isn't it? What's when using Neural Networks or Decision Trees? Should there the rows for categories A & B exist per 500 times for examble?
  • 2012년 7월 21일 토요일 오전 1:16
    답변자
     
     답변됨
    It really depends on the problem you are trying to solve.
    Sometimes you need to do stratified sampling to get good model (e.g. in fraud predictions when low probablity events are much more important)
    In other cases, you really need to preserve data distribution.

    Tatyana Yakushev [PredixionSoftware.com]

    • 답변으로 표시됨 The-Spiky 2012년 7월 24일 화요일 오전 11:38
    •  
  • 2012년 7월 24일 화요일 오전 11:46
     
     

    Thanks a lot for your information.

    But can someone say something about Microsoft Neural Networks algorithm: For examble some informations about the learning rate, the training cycles, ...? Can I get some information in the "Microsoft Generic Content Tree Viewer" or are these parameters internal?

  • 2012년 7월 24일 화요일 오후 4:46
    답변자
     
     

    Learning rate is not displayed anywhere.

    Microsoft Generic Content Tree Viewer only shows content of the created model. It does not show how the model was trained.


    Tatyana Yakushev [PredixionSoftware.com]

  • 2012년 7월 25일 수요일 오전 6:52
     
     
    Ah ok, thanks. Is the initial state of the learning rate always the same value or does the alorithm set randomly the value? Can somebody say something about this? As far as I've know (read somewhere) the algorithm degreases the learning rate during the training rate, isn't it?
  • 2012년 7월 30일 월요일 오전 6:47
     
     

    Does no one know something about the learning rate in Microsoft Neural Network?