Friday, January 06, 2012 7:51 PMI'm trying to understand how parameter COMPLEXITY_PENALTY works inside the Decision Trees algorithm. I read the book "Data Mining with SQL Server 2008". There's just written how to use this parameter, and that it is somehow related to Bayesian score. I have read on the forums how "forward pruning" method works, it seems that COMPLEXITY_PENALTY is a threshold, and tree stops splitting when the value of some formula becomes larger than the threshold. Is it right? And what is the formula exactly?
Sunday, January 15, 2012 6:42 AMModerator
Yes, the parameter is to inhibit the growth of the decision treee. Descreasing the value increases the likelihood of a split, while increasing the value decreases the likelihood. For the "what is the formula exactly" question, it is really in-depth knowledge for data mining. I will try to consult with internal team and hope could get the answer.
Tuesday, January 24, 2012 1:04 AM
AS documented, the Microsoft Decision Trees algorithm is a classification and regression algorithm provided by Microsoft SQL Server Analysis Services for use in predictive modeling of both discrete and continuous attributes. It builds a data mining model by creating a series of splits in the tree. These splits are represented as nodes. The algorithm adds a node to the model every time that an input column is found to be significantly correlated with the predictable column. The way that the algorithm determines a split is different depending on whether it is predicting a continuous column or a discrete column.
For discrete attributes, the algorithm makes predictions based on the relationships between input columns in a dataset. It uses the values, known as states, of those columns to predict the states of a column that you designate as predictable. Specifically, the algorithm identifies the input columns that are correlated with the predictable column. For continuous attributes, the algorithm uses linear regression to determine where a decision tree splits.
To control the tree size during the tree growth stage, one of the parameters is Complexity_Penalty. According to MSDN http://msdn.microsoft.com/en-us/library/cc645868.aspx
The COMPLEXITY_PENALTY parameter is a floating point number with a range between 0 and 1. It is used to inhibit the growth of the decision tree, the value is subtracted from 1 and used as a factor in determining the likelihood of a split. The deeper the branch of a decision tree, the less likely a split becomes; the complexity penalty influences that likelihood. A low complexity penalty increases the likelihood of a split, while a high complexity penalty decreases the likelihood of a split.
The default value is based on the number of attributes for a given model: for 1 to 9 attributes, the value is 0.5; for 10 to 99 attributes, the value is 0.9 and for 100 or more attributes, the value is 0.99.
Hope this helps.
Meer Al - MSFT
Monday, February 06, 2012 7:33 PMThanks, but it's not exactly what I needed.