none
Strongest relationship for predicting

    Question

  • Hello.

     

    I need a little hint, for instance, I've 3 series of data.  I need to find which serie that has most influence on the key-serie-value (is that the right term?)

     

    I.e. if the butter price goes up, was it then because the meet price or the milk price went up?`

     

     

     

     

    This is one of my first postings regarding DM, so please bear over with me if my questions are trivial

    /Janus

     

     

    Thursday, July 17, 2008 8:26 PM

Answers

  •  

    Hi Janus

     

    Thanks for posting the question.  Your problem is a common one in data analysis, and while I immediately had an answer, I did not know what the answer would be using MS Data Mining tools.  Thus, permit me to use your question as a test for how a typical user might approach the rich data mining toolset.  I will describe what I did.

     

    I chose to use Excel 2007 with the Data Mining addon for SQL Server 2005.  I constructed a sample dataset with transaction numbers, butter prices (using the RANDBETWEEN function), meat prices directly related to the butter prices (I used a 2.4 multiplier) and milk prices (at random).  Now I asked the question what could help me discover the relationship among the price ranges.

     

    I went down several data mining model paths including clustering, decision trees, and neural nets.  I did not find a satisfactory immediate answer based on those tools, but perhaps neural nets was the most intellectually curious type of output.  You could try a similar exercise yourself, and see what these wizards provide.  The interpretations of the data mining models varies depending on how these input variables are presented for modeling.  All the output is valid, but has various interpretations (I have decided to respond on each of the valid data mining models as a personal exercise, and I will be adding another post when it is done, including a good explanation of what data mining tells you about my sample dataset).

     

    The central question is what data mining output matches your original question.  Based on your original varibles, it is possible to "milk" an answer from the Data Mining "meat" provided, given that you grease the models with enough "butter", but in statistical modeling I believe that less is more.  The general rule to remember is that the power of Data Mining is for multivariate problems, and based on my simple dataset, I only have three variables (or columns) and 59 observations (or rows).  The only way to know that there is more is to start with a good statistical education since many common and powerful techniques are not obviously surfaced using SSAS and data mining.  I do not view this "omission" as a fault of the SSAS design, because most of my background is using SAS, where in display manager you get a blank screen, and in SAS Enterprise Guide, you get something similar to SPSS, lots of pull-down menus and wizards which do not advise you on how to proceed or where to go first.

     

    My recommended answer is to first generate a correlation matrix with the three variables -- butter, milk, and meat -- as input columns.  A correlation value goes from negative one to positive one, and the single number is a two variable (or "attribute") comparison of how one value rises or falls with another.  Since I made the meat prices directly relate to the butter prices, the correlation should be one.  However, there should be lower correlation among the other price categories. 

     

    Corrleation matricies are available in the "Analysis Tool Pack" add-in in Microsoft Excel.  You need to specifically turn on the feature using the "Add-In" Wizard.  You can search MS Excel help or a good Excel book on how to enable add-ins.  The tools in the Analysis Pack do many fundamental analysis functions without needing to program (you just use the wizards).  Again, a analysis wizard (whether in SSAS or Excel or SAS or SPSS) will only guide you on how to complete a specific analysis task, but only a statistical education will best inform on why and when an analytic technique should be used and how to interpret the results. 

     

    If you feel the need to use Analysis Services to make your correlation matrix, then take a look at:

    http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=102&Id=12

     

    The link provides some stored procedures which produce either a covariance or correlation matrix.  I recommend the correlation matrix, which is normalized to run between negative and positive one.  The covariance matrix carries information based on the original value sizes.  I wish correlation matrices came standard in the data mining suite because it, along with multivariate regression, forms the second common way that statistical analysts look at data.  People interested in correlations are looking for associations, and in the literature are represented by people who do factor analysis.  Regression/ANOVA is a different philosophical direction, but I believe both are complementary philosophical approaches (which is why I used both in my own dissertation).

     

    For my sample dataset, SSAS is overkill to do something Excel can do.  Naturally, the advatage of SSAS in general is putting code into production when you need to automate the solution under the security and transactional environment that SQL Server provides.  My answer therefore is to use Excel for simple calculations, but step up to SSAS for datasets beyond an Excel workbook, or for answers needing to be put into production.  (In my data mining analysis -- to be posted in a future separate post -- I also only use Excel 2007 with the data mining addins since my intention is to demonstrate what someone might do with a small Excel-sized dataset.)

     

    If you are looking at putting a solution into production, and the issue is urgent, consider hiring a consultant to help get you started.

     

    Here are my results for my correlation matrix (column 1 is butter, column 2 is meat, and colum 3 is milk) -- note that, as expected, the correlation between butter and meat is positive one, and the correlation between milk-butter and milk-meat is just about zero (there are ways to determine a confidence interval on correlation values based on the sample size, with larger samples providing a narrower interval range).  Also, the autocorrelations (butter-butter or milk-milk) are all positive one (since a value is exactly correlated with itself, as predicted).  Note too that the matrix is a triangle shape since the upper triangle provides the same values.  Check a book on statistics on the use of correlation matrices:

     

      Column 1 Column 2 Column 3
    Column 1 1
    Column 2 1 1
    Column 3 -0.05469 -0.05469 1

     

     

    Here is my sample originally tabbed dataset (sorry if the tabs do not persist in HTML):

    Transaction Butter Meat Milk
    1 9 21.6 1
    2 7 16.8 2
    3 6 14.4 2
    4 9 21.6 1
    5 10 24 1
    6 7 16.8 1
    7 5 12 1
    8 9 21.6 1
    9 10 24 4
    10 5 12 2
    11 7 16.8 2
    12 8 19.2 4
    13 5 12 1
    14 10 24 2
    15 6 14.4 4
    16 7 16.8 2
    17 10 24 3
    18 8 19.2 1
    19 7 16.8 2
    20 8 19.2 4
    21 7 16.8 1
    22 8 19.2 3
    23 8 19.2 2
    24 5 12 1
    25 8 19.2 1
    26 6 14.4 3
    27 6 14.4 4
    28 9 21.6 4
    29 5 12 3
    30 5 12 2
    31 10 24 2
    32 9 21.6 4
    33 5 12 4
    34 9 21.6 2
    35 5 12 4
    36 9 21.6 1
    37 7 16.8 4
    38 7 16.8 3
    39 8 19.2 3
    40 10 24 3
    41 5 12 4
    42 5 12 3
    43 5 12 3
    44 7 16.8 3
    45 7 16.8 3
    46 9 21.6 4
    47 9 21.6 2
    48 7 16.8 1
    49 6 14.4 4
    50 7 16.8 3
    51 5 12 1
    52 6 14.4 2
    53 7 16.8 3
    54 8 19.2 4
    55 10 24 3
    56 7 16.8 2
    57 5 12 4
    58 8 19.2 1
    59 6 14.4 1
    Saturday, July 19, 2008 10:03 PM

All replies

  •  

    Hi Janus

     

    Thanks for posting the question.  Your problem is a common one in data analysis, and while I immediately had an answer, I did not know what the answer would be using MS Data Mining tools.  Thus, permit me to use your question as a test for how a typical user might approach the rich data mining toolset.  I will describe what I did.

     

    I chose to use Excel 2007 with the Data Mining addon for SQL Server 2005.  I constructed a sample dataset with transaction numbers, butter prices (using the RANDBETWEEN function), meat prices directly related to the butter prices (I used a 2.4 multiplier) and milk prices (at random).  Now I asked the question what could help me discover the relationship among the price ranges.

     

    I went down several data mining model paths including clustering, decision trees, and neural nets.  I did not find a satisfactory immediate answer based on those tools, but perhaps neural nets was the most intellectually curious type of output.  You could try a similar exercise yourself, and see what these wizards provide.  The interpretations of the data mining models varies depending on how these input variables are presented for modeling.  All the output is valid, but has various interpretations (I have decided to respond on each of the valid data mining models as a personal exercise, and I will be adding another post when it is done, including a good explanation of what data mining tells you about my sample dataset).

     

    The central question is what data mining output matches your original question.  Based on your original varibles, it is possible to "milk" an answer from the Data Mining "meat" provided, given that you grease the models with enough "butter", but in statistical modeling I believe that less is more.  The general rule to remember is that the power of Data Mining is for multivariate problems, and based on my simple dataset, I only have three variables (or columns) and 59 observations (or rows).  The only way to know that there is more is to start with a good statistical education since many common and powerful techniques are not obviously surfaced using SSAS and data mining.  I do not view this "omission" as a fault of the SSAS design, because most of my background is using SAS, where in display manager you get a blank screen, and in SAS Enterprise Guide, you get something similar to SPSS, lots of pull-down menus and wizards which do not advise you on how to proceed or where to go first.

     

    My recommended answer is to first generate a correlation matrix with the three variables -- butter, milk, and meat -- as input columns.  A correlation value goes from negative one to positive one, and the single number is a two variable (or "attribute") comparison of how one value rises or falls with another.  Since I made the meat prices directly relate to the butter prices, the correlation should be one.  However, there should be lower correlation among the other price categories. 

     

    Corrleation matricies are available in the "Analysis Tool Pack" add-in in Microsoft Excel.  You need to specifically turn on the feature using the "Add-In" Wizard.  You can search MS Excel help or a good Excel book on how to enable add-ins.  The tools in the Analysis Pack do many fundamental analysis functions without needing to program (you just use the wizards).  Again, a analysis wizard (whether in SSAS or Excel or SAS or SPSS) will only guide you on how to complete a specific analysis task, but only a statistical education will best inform on why and when an analytic technique should be used and how to interpret the results. 

     

    If you feel the need to use Analysis Services to make your correlation matrix, then take a look at:

    http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=102&Id=12

     

    The link provides some stored procedures which produce either a covariance or correlation matrix.  I recommend the correlation matrix, which is normalized to run between negative and positive one.  The covariance matrix carries information based on the original value sizes.  I wish correlation matrices came standard in the data mining suite because it, along with multivariate regression, forms the second common way that statistical analysts look at data.  People interested in correlations are looking for associations, and in the literature are represented by people who do factor analysis.  Regression/ANOVA is a different philosophical direction, but I believe both are complementary philosophical approaches (which is why I used both in my own dissertation).

     

    For my sample dataset, SSAS is overkill to do something Excel can do.  Naturally, the advatage of SSAS in general is putting code into production when you need to automate the solution under the security and transactional environment that SQL Server provides.  My answer therefore is to use Excel for simple calculations, but step up to SSAS for datasets beyond an Excel workbook, or for answers needing to be put into production.  (In my data mining analysis -- to be posted in a future separate post -- I also only use Excel 2007 with the data mining addins since my intention is to demonstrate what someone might do with a small Excel-sized dataset.)

     

    If you are looking at putting a solution into production, and the issue is urgent, consider hiring a consultant to help get you started.

     

    Here are my results for my correlation matrix (column 1 is butter, column 2 is meat, and colum 3 is milk) -- note that, as expected, the correlation between butter and meat is positive one, and the correlation between milk-butter and milk-meat is just about zero (there are ways to determine a confidence interval on correlation values based on the sample size, with larger samples providing a narrower interval range).  Also, the autocorrelations (butter-butter or milk-milk) are all positive one (since a value is exactly correlated with itself, as predicted).  Note too that the matrix is a triangle shape since the upper triangle provides the same values.  Check a book on statistics on the use of correlation matrices:

     

      Column 1 Column 2 Column 3
    Column 1 1
    Column 2 1 1
    Column 3 -0.05469 -0.05469 1

     

     

    Here is my sample originally tabbed dataset (sorry if the tabs do not persist in HTML):

    Transaction Butter Meat Milk
    1 9 21.6 1
    2 7 16.8 2
    3 6 14.4 2
    4 9 21.6 1
    5 10 24 1
    6 7 16.8 1
    7 5 12 1
    8 9 21.6 1
    9 10 24 4
    10 5 12 2
    11 7 16.8 2
    12 8 19.2 4
    13 5 12 1
    14 10 24 2
    15 6 14.4 4
    16 7 16.8 2
    17 10 24 3
    18 8 19.2 1
    19 7 16.8 2
    20 8 19.2 4
    21 7 16.8 1
    22 8 19.2 3
    23 8 19.2 2
    24 5 12 1
    25 8 19.2 1
    26 6 14.4 3
    27 6 14.4 4
    28 9 21.6 4
    29 5 12 3
    30 5 12 2
    31 10 24 2
    32 9 21.6 4
    33 5 12 4
    34 9 21.6 2
    35 5 12 4
    36 9 21.6 1
    37 7 16.8 4
    38 7 16.8 3
    39 8 19.2 3
    40 10 24 3
    41 5 12 4
    42 5 12 3
    43 5 12 3
    44 7 16.8 3
    45 7 16.8 3
    46 9 21.6 4
    47 9 21.6 2
    48 7 16.8 1
    49 6 14.4 4
    50 7 16.8 3
    51 5 12 1
    52 6 14.4 2
    53 7 16.8 3
    54 8 19.2 4
    55 10 24 3
    56 7 16.8 2
    57 5 12 4
    58 8 19.2 1
    59 6 14.4 1
    Saturday, July 19, 2008 10:03 PM
  • Problem
     
    On the free MSDN Data Mining forum, a user nicknamed “Janus” posted the following basic but important challenge, paraphrased as follows:
      

    I need a little hint, for instance, I've got 3 series of data.  I need to find which series that has most influence on the key-series-value (is that the right term?)  For example, if the butter price goes up, was it then because the meat price or the milk price went up?


     
    The problem is therefore one with three attributes (or variables, or columns).  Attribute one is a prediction (only) column called Butter.  Attributes two and three are input variables called Meat and Milk.  All three attributes are prices, assumed to be continuous, and positive values.  The question is what could someone use in the Microsoft Data Mining suite to answer the relationship among the variables.
     
    Purpose Statement
     
    Using Excel 2007, I constructed a sample dataset of 59 rows (observations) using the construct given above.  I intentionally made the meat prices to be a fixed multiple of the randomized butter price, while milk has an independently random value not necessarily (or not intentionally) associated with meat or butter prices.  The sample table size of 59 comes from my desire to surpass 30 (considered to be a threshold value where the t-distribution approaches the z-distribution for normal variables).  Many statistical analyses recommend having cell (or case) sizes of 5 or greater, though my sample table does not meet this second criteria. 
     
    Based on empirical results, I was satisfied that this sample data provides the ability to compare and contrast results from the Microsoft Data Mining algorithms.  My sample data does not have any null or missing values, and does not intentionally include any outliers.  Thus, the typical first step of cleaning the data is not needed for this demonstration.
     
    Since this table is small, I also use Excel 2007 to analyze the dataset using the full list of available algorithms in Microsoft Data Mining.  I use all the Data Mining Algorithms except for Time Series and Cluster Sequencing.  This exercise provides basic insight into what the standard algorithms tell you about a small and known dataset.
     
    I did all of these samples from MS Excel 2007 using the Data Mining add-in (and SQL Server 2005 providing the processing), and the dataset I created.  As a baseline comparison, I also am using two features from the Excel 2007 Analysis Pack Add-In, the correlation and the linear regression.  The correlation results show that indeed there is a 1.00 correlation between the Butter and Meat values, while the correlation between either Milk and Butter, or Milk and Meat, is -0.05469 (these correlations are identical since Butter and Meat are mathematical multiples of one another).  Having this type of sample dataset helps a new Data Mining user intuitively validate the results without having to rely only on equations.
     
    Under the Data Mining tab, choose the “Advanced” button, and then “Create Mining Model”.  This list provides all the algorithms available, and more than the other buttons in the “Data Modeling” area, may allow for additional parameters to be set (the default buttons make default choices).  Under the mining model creation choice, the default is "Microsoft Decision Trees", but I will be listing the results in alphabetical order.
     
     
     
    1) Microsoft Association Rules
     
    Change the usage for butter to "predict only" and using the ellipses, make all the variables "not null" (since none of them are missing in my demo dataset).  Association rules does not accept continuous variables, but only discrete or “discretized” (meaning, made to be discrete in bins).  For Microsoft Data Mining, “discrete” means categorized and does not mean ordinal, so I chose “discretized” which will put the values into numeric bins which have an order.
     
    The final model on the dependency network tab shows the results, and in this problem meat always points to butter and milk is a non-factor (and milk does not show on my runs at all).  Using the link slider shows which associations are the strongest, meaning having the most itemsets.  In my dataset, butter=6 and butter=10 have the same number of itemsets, each the most in the dataset.  Thus, unlike correlation where the whole variable as a set are considered, in the MS Association Rules, each pair of observations (for example, butter=6 while meat=14.4) is separately considered as an individual itemset.  These pairs compete with one another to be the "strongest" itemset in the group.  By contrast, correlation calculations consider all the milk points to be together, and all the butter points to be together, and all the meat points to be together.  The Association Rules algorithm instead will compare itemsets of two or three (since there are only three variables) against one another, showing a affinity preference to the row (or observation) instead of the column (or variable).
     
    The rules tab produces what is expected, where the strongest links have only meat and butter together.  When milk is added to the itemsets, there still may be a connection.  For example, my results returned: Meat >= 22.65, Milk >= 3 -> Butter >= 9.   This triplet shows a connection for all three values happening simultaneously, and it means that when meat is greater than or equal to 22.65 and milk is greater than or equal to 3 then butter is greater than or equal to 9.  That conclusion is a rule and surfaces a relationship in the dataset.
     
    The itemsets tab also shows milk having the most support, meaning the most individual observations.  I limited the range of milk values, and so those cases show the most "support", with the top value of 29 for milk being greater than or equal to 3.  However, note that you do not have a sense here that milk is actually not related to the value of butter.  Because of the small dataset size, the results appear to indicate a relationship, but that type of result could even happen in a much larger dataset population where the case sample would, by random chance, produce a connection.  Such connections may be important to a data analyst, and therefore there is an argument for using this type of algorithm for information that a regular correlation might hide. 
     
    2) Microsoft Cluster Analysis – Continuous
     
    For this example, I use butter as a predict only variable, and milk and meat as input variables.  Also, I am setting all variables to continuous and not null.  On the Cluster Profiles tab, the algorithm makes clusters based on values of all variables.  As we know, in this dataset, the milk variable has no effect (from the correlation results).  However, the cluster profiles show that the value and variance of the butter and meat variables have a common pattern, but note that the milk variable is distinctly different.  This pattern shows that there is some type of relationship between butter and meat, but it is not immediately apparent from the screen.
     
    On the cluster characteristics tab, we see this pattern show where the butter and the meat probabilities are always showing up in pairs.  Sometimes, due to random luck, all three (butter meat and milk) appear to have similar if not identical probabilities, and that pattern may indeed be true for some combinations in my demo dataset.  However, the pairing of butter and meat is consistent for each cluster.  What could be confusing here is that each cluster will typically have two or more pairs of butter-meat on it, only because I sent the clustering algorithm to include the meat factor too.
    On the cluster discrimination tab, you can check clusters against one another.  In my case, I pick my named cluster 1 and 2 to compare, and note that again the butter and meat variables are pairing together, while milk values are operating on their own.  You can check other combinations, but this display also shows how the itemset pairings work together for butter and meat.


    The Cluster Diagram shows the strength of association between clusters, and in my run, clusters 4 and 5 have the strongest association.  Looking at the cluster profile page, these clusters correspond to the lowest values of butter-milk, and therefore it makes sense that these two clusters are considered strong with one another.


    3) Microsoft Cluster Analysis – Discrete


    Discrete is the default for MS Cluster Analysis, but since I decided that prices are continuous, I started with that continuous example first.  I now assume that all variables are discrete and not null, and that assumption may make sense where you do not expect prices to be changing, but instead would like to see what patterns emerge from this static assumption. Starting with the Cluster Profile page, note that the color combinations are the same for butter and meat.  In time sequence, I actually did this discrete combination first, and the association between butter and meat is shown in these color bars.  The strength of this algorithm is for many variables, and the colors do not provide an exact number of strength as a correlation would.
     
    Moving to Cluster Characteristics, you can see that meat and butter again move together.  The milk value may or may not be in synch, and in my cluster 3, one milk value appears to be close to the butter and meat values.  Note that this algorithm wants to put these values together, and will make strong efforts to uncover those unlikely itemsets where despite a weak correlation there may be combinations that cluster well together.  My sample data only has 59 observations, but multiply the variables by 50 and the observations by 100,000 and that’s the size dataset where data mining might achieve discovering a hidden association in a large dataset.
     
    On the Cluster Discrimination tab, again the butter and meat have a relationship, and milk may or may not be a factor.  The algorithm wants to put them together, and so you should discover that clustering will indeed find instances where – by random luck – milk happens to be part of the group.
     
    On the Cluster Diagrams page, my clusters 4 and 5 have the strongest connection.  I can see on the Cluster Profiles page that these two clusters have the most similar distributions on the milk variable.  The comparison is graphical, and provides me with information that these strengths relate to wanting to include milk in the final analysis.
     
    4) Microsoft Decision Trees -- Continuous
     
    For Decision Trees, the transaction value can be included as the key and the butter is a predict only value; the other variables are input.
     
    Using the ellipses, change the butter, meat, and milk values: continuous and “not null” and “regressor”.  I picked “regressor” for butter also, but since that variable is “predict only”, it will not be allowed to be a regressor.  The wizard actually should not allow me to see “regressor” if I have previously picked “predict only” but I correctly guessed the intention of the algorithm.  Technically, “regressor” did not show the first time I picked the ellipses, but when I went back to look again, the choice for “regressor” was there.
     
    In my case, I received the following regression equation in this form:  Butter = 7.261+0.417*(Meat-17.410)-0.000010*(Milk-2.475).  I note that this equation is mathematically equivalent to:  Butter = 0.00105475 + 0.417*Meat – 0.000010*(Milk).  I do not know why the constants were separated.
     
    Someone might make the conclusion that the 0.00010 regressor value for milk shows that this milk variable is not important, but it is important to first consider the magnitude of the average value of milk for a fair comparison with the other regressor (0.417).  In this case, the values of milk and meat are within the same range (using a powers of 10 comparison) and thus the equation shows that meat indeed is the primary factor explaining butter.
     
    The dependency network tab shows this simple relationship.  Using the link slider, we can make the link between milk and butter disappear.
     
    This continuous decision tree analysis provides the same type of result as a correlation because the mathematics for multivariate linear regression are built on the concept of correlation.  This approach contrasts with what other algorithms might provide.
     
    5) Linear Regression – From Excel Analysis Pack
     
    Using the Excel Analysis Pack, I ran a regression to get the classic results for regression.  Choose the Data tab, then the Analysis area, and the “Data Analysis” choice.  The output not only gives probability values but also a full suite of residual plots too.  In my case I had significant p-values well below the nominal five percent.  
     
     

     

    Coefficients

    Standard Error

    t Stat

    P-value

    Intercept

    2.5E-15

    3.04E-16

    8.214595

    3.43E-11

    X Variable 1

    0.416667

    1.5E-17

    2.78E+16

    0

    X Variable 2

    6.84E-17

    5.25E-17

    1.304156

    0.197515


     
    In words, my results provided Butter = 2.5E-15 + 0.417*(Meat) – 6.84E-17*(Milk).  I compare this result with the decision trees result since the meat regressor value is the same, but the intercept and the milk regressor values are slightly different.  In both instances, the intercept value is about zero, and the milk regressor is negative.  I do not know why there is a difference, but functionally the results are comparable.  My residual plots from the Analysis Pack Add-In show enough of a random distribution given that the input variables only take on certain values.  Residual plotting is recommended for testing normality and independence (for more information on regression interpretation, see a book on this topic).
     
    6) Microsoft Linear Regression

     
    The Microsoft Linear Regression model is intentionally connected with the Microsoft Decision Trees model.  The bottom line is that my results are identical to the Microsoft Decision Trees model.  As in that previous run, for milk and meat I had to use the ellipses to choose “continuous” and “not null” for the input variable, leave that screen, and return again to choose “regressor”. 
     
    If you don’t go back to choose “regressor” then the algorithm simply produces the average value for butter, and in my case using the algorithm without regressors produces:


    All
    Existing Cases: 59
    Missing Cases: 0
    Butter = 7.254


    7) Microsoft Logistic Regression
     
    For logistic regression, I again choose continuous values for input (milk, meat) and predict only (butter) variables.  The output is tricky to understand, so for my run, I choose “All” in the Input/Attribute window, and the output attribute is “butter” with either value 1 or 2 set to “missing”.  In my case, I found the best analysis from the screen for value 1 being “6.132-7.254” in the drop-down box and value 2 being “missing”.   On this screen, the attributes column shows one value for meat and four values for milk.  The interpretation can be that the single value of meat (valued 14.718 to 17.410) leads to the output value of butter at 6.132 to 7.254.  This type of result is similar to Microsoft Association Rules, which also uses itemsets to determine connections.  Thus, this type of analysis might uncover an important relationship at certain combinations of input and predicted variable.  The multiple values of milk indicate that that this algorithm cannot determine which of the milk values would lead to the output butter value, and indeed the lack of relationship between input milk and output butter verifies this outcome.
     
    8) Microsoft Naïve Bayes
     
    For Naïve Bayes, I choose the butter to be predict only, and milk and meat to be input values.
     
    Using the ellipses, I choose “Discretized” and “Not Null” for the input and prediction attributes.  It is OK to leave the transaction column (or variable) as  “key” value.
     
    The output on the dependency network tab shows an accurate result, namely that meat leads to butter.  Using the link slider does not change this strong single result.
     
    The other tabs verify this single result.  The Microsoft Naïve Bayes algorithm matches the assumptions from this dataset, namely that the input columns (milk and meat) are independent.  Many tables would not be correct to analyze that way, but the results in this case prove that connection.  The mathematics underneath Naïve Bayes consider the attributes (or variables) to be considered as units, and therefore return results closer to the correlation analysis than, for example, Microsoft Association Rules.
     
    9) Microsoft Neural Network
     
    As with Microsoft Logistic Regression, I again choose continuous values for input (milk, meat) and predict only (butter) variables.  As with the intentionally related Logistic Regression algorithm, the output here is tricky to interpret for this example.  I find the most insight for choosing “missing” as one of the values, and scrolling over different values of butter.  A strong connection with “missing” means that an input variable would rather be associated with missing than with a specific value of butter.  In other words, strong connections between input and “missing” means that there is no or low connection with the output variable for that range of values.
     
    The Microsoft Neural Network will attempt to find a model location for all attributes.  In other words, for each value of the output, the algorithm searches for the appropriate relationship between that output and the input.  Since two attributes (milk and meat) contribute to the outcome, the algorithm will optimize output to include a place for all these attributes.  This type of output provides contrasting information to the correlation calculation.
     
    Conclusion
     
    The main conclusion is that Microsoft Data Mining algorithms (and all data mining algorithms in general) produce differing but comparable types of results based on the algorithm’s statistical assumptions.  I believe it is important to study and know what each algorithm promises to provide.  All the algorithms (even the ones not used) provide insight to the relationships among variables.  Microsoft has done a good job in documenting each algorithm both in the SQL Server Books Online, as well as the online technical help at Microsoft.com.  Depending on the nature of the attributes, and the type of problem, these algorithms can collectively provide complementary and contrasting models for large multivariate tables.
     
    As a further exercise, I recommend any of the following options.  First, extend my sample dataset to be about 500 observations and see what the results are.  Second, try out the algorithms using a publicly available dataset (you can search for “machine learning data” or see http://lib.stat.cmu.edu/datasets/). Many of these publicly available datasets have been used in academic literature, and that is a place to validate your results. Finally, use the Microsoft Data Mining tools with your own data.  You can perhaps start with a sample first if your total amount of data is large.  Many options are available, and online help is abundant.  Microsoft intends that these tools be moved from this type of development testing into a production environment, a feature which sets apart these data mining algorithms from many other free or paid standalone tools available.   A regularly refreshed model supported into a production flow (for example a .NET application or a SSIS package) provides maximum use of the technology available.

     

    Tuesday, July 29, 2008 4:20 AM