Microsoft Developer Network >
포럼 홈
>
Data Mining
>
Clustering vs. Decision Tree Algorithm for exception detection
Clustering vs. Decision Tree Algorithm for exception detection
- I read that Highlight Exceptions function in the Data Mining ribbon (Analyze tab) of Excel 2007 Addin for SQL Server 2008 was made with Microsoft Clustering Algorithm.
Could it (exception detection) be done with Microsoft Decision Tree Algorithm?
Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин- 편집됨Guennadiy Vanine 2009년 7월 1일 수요일 오전 10:09
- 편집됨Guennadiy Vanine 2009년 7월 1일 수요일 오전 7:42
- 편집됨Guennadiy Vanine 2009년 7월 1일 수요일 오전 7:45
답변
- Hi Guennadiy
I do not think there are any papers or indeed too many discussions around the "Detect Anomlies" ribbon button. As for what it does.
Creates a Session Mining Structire over the columns in the XL list you specify
Adds a Clustering model on top with drillthrough and cluster count = 0
loads the structure with the list
Trains the model
It then issues a query similar to
@ParamTable being the XL list.SELECT T.[GeographyKey], Predict([GeographyKey]), PredictVariance([GeographyKey]), T.[MiddleName], PredictHistogram([MiddleName]), PredictProbability([MiddleName]), T.[NameStyle], PredictHistogram([NameStyle]), PredictProbability([NameStyle]), T.[BirthDate], Predict([BirthDate]), PredictVariance([BirthDate]), T.[MaritalStatus], PredictHistogram([MaritalStatus]), PredictProbability([MaritalStatus]), T.[Gender], PredictHistogram([Gender]), PredictProbability([Gender]), T.[YearlyIncome], Predict([YearlyIncome]), PredictVariance([YearlyIncome]), T.[TotalChildren], PredictHistogram([TotalChildren]), PredictProbability([TotalChildren]), T.[NumberChildrenAtHome], PredictHistogram([NumberChildrenAtHome]), PredictProbability([NumberChildrenAtHome]), T.[EnglishEducation], PredictHistogram([EnglishEducation]), PredictProbability([EnglishEducation]), T.[SpanishEducation], PredictHistogram([SpanishEducation]), PredictProbability([SpanishEducation]), T.[FrenchEducation], PredictHistogram([FrenchEducation]), PredictProbability([FrenchEducation]), T.[EnglishOccupation], PredictHistogram([EnglishOccupation]), PredictProbability([EnglishOccupation]), T.[SpanishOccupation], PredictHistogram([SpanishOccupation]), PredictProbability([SpanishOccupation]), T.[FrenchOccupation], PredictHistogram([FrenchOccupation]), PredictProbability([FrenchOccupation]), T.[HouseOwnerFlag], PredictHistogram([HouseOwnerFlag]), PredictProbability([HouseOwnerFlag]), T.[NumberCarsOwned], PredictHistogram([NumberCarsOwned]), PredictProbability([NumberCarsOwned]), T.[DateFirstPurchase], Predict([DateFirstPurchase]), PredictVariance([DateFirstPurchase]), T.[CommuteDistance], PredictHistogram([CommuteDistance]), PredictProbability([CommuteDistance]), T.[Region], PredictHistogram([Region]), PredictProbability([Region]), T.[Age], Predict([Age]), PredictVariance([Age]), T.[BikeBuyer], PredictHistogram([BikeBuyer]), PredictProbability([BikeBuyer]), PredictCaseLikelihood() FROM [Table1_863200_CL_33744] NATURAL PREDICTION JOIN @ParamTable as T
The UI now has everything it needs. You get to specify how many exceptions you want and this will probably be chosen in order of their PredictCaseLikelihood() value. (ASC)
You can see all this being played out in profiler. Now. To do this yourself and to have more control over what happens you will need to do the exact same thing yourself but building the structire and model yourself specifying the clustering options yourself.
I know you feel you do not have compllete control over the algorithms (your previous posts) but this is where I would be starting if I understand what you want to do.
Allan
- 답변으로 표시됨Jin ChenMSFT, 중재자2009년 8월 26일 수요일 오전 9:17
- A few more detail on top of Allan's comments:
First -- In the context of SQL erver data mining, clustering has some built in feature for anomaly detection, i.e. the PredictCaseLikelihood function, the last selection in the query posted by Allan. This function returns for each input data point a number between 0 and 1 (normalized). The number tells in essence how close the point is to one of the cluster centroids. A value of 1 means a point is close or identical to a centroid. A value of 0 means that it is not far from all centroids. That is the only factor which determines whether a point ia an outlier or not. Everything else in the query has the purpose of determining the most probable cause for the anomaly, i.e. the attribute which makes the point most unlike the clusters. More details on PredictCaseLikelihood here: http://msdn.microsoft.com/en-us/library/ms132052.aspx
Then for each point that is an outlier, for each attribute, the add-in computes the distance from the predicted value of the attribute and the actual one. The code for this computation can be downloaded from the Data Validation sample at sqlserverdatamining.com, available here: http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=99&Id=313
The fact that the Microsoft Clustering algorithm can be used for predictions (based on the distributions inside each cluster, weighed with the distance from the current point to each cluster) allows executing a prediction for each point. As all the columns are input and predict, each prediction is based on all other columns.
So, for each attribute, the add-in computes a likelihood of the actual value with the predicted value. This likelihood is compared across attributes and the attribute which yields the lowest likelihood is considered the most probable cause for the anomaly.
The likelihood calculation differs for discrete and continuous attributes.
For discrete attributes, the add-in needs to compute the ratio between the predicted probability of the current value and the predicted probability of the most likely value.
Effectively, this can now be expressed in DMX as PredictProbability([BikeBuyer], T.[BikeBuyer])/PredictProbability([BikeBuyer]).
However, at the time of the add-in implementation (2005 timeframe), using an input rowset column as a query parameter was not supported, so we ended up asking for the actual attribute value as well as the whole histogram, and the lookup to find the actual state's predicted probability happends on the client.
Therefore, the DMX statement looks likeL
T.[BikeBuyer], PredictHistogram([BikeBuyer]), PredictProbability([BikeBuyer]),
Again, the first column (T.BikeBuyer) is only included to allow the client side code to look up the specific probability of the current input's Bike Buyer in the prediction histogram -- a workaround for the fact that DMX does not support the JOIN operator.
With this calculation, the score for a discrete is a number between 0 and 1(1 if the actual state is the most likely state as predicted).
For a continuous attribute, the likelihood is computed based on actual value, predicted value and prediction variance, like below:
distCol = Math.Exp(- Math.Pow((actualCol- predCol),2) / (2 * varCol) );
hence, the square of the distance between actual and predicted divided by twice the variance. The negative exponential assures that, when actual is close to predicted, the likelihood is closer to 1, while if the distance grows the likelihood tends to 0
Now, the likelihood scores are compared for all attributes that make a row, and the least likely one is marked as the cause of the exception (yellow highlighting)
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm- 답변으로 표시됨Jin ChenMSFT, 중재자2009년 8월 26일 수요일 오전 9:17
모든 응답
- I would be interested in whether MS's version of the DT algorithm could do this and how because certainly you can use DTs to do "Anomaly Detection".You are right that "The Ribbon" uses Clustering.
I would be interested in whether MS's version of the DT algorithm could do this and how because certainly you can use DTs to do "Anomaly Detection".
Allan, thanks for your participation and backup of this interest.You are right that "The Ribbon" uses Clustering.
My question implied the subquestion on explanation why the Microsoft Clustering Algorithm is always being chosen for anomaly detection.
(in all examples and samples I could find on exceptions detection).
What are pros. and contras? And what is the process of choosing the one or another algorithm, how it is being done...
Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин- I would say that clustering is used in examples of anomaly detection, because it is easier to demonstrate the whole process using this technique - it's more straightforward than decision trees. The same applies also for code of clustering algorithm itself.Decision trees can definitely be used for anomaly detection and ofter are, together with clustering. There is a lot of papers about the subject of clustering vs. DT on the internet and usually combination of them is used or recommended.Regards,Vladimir Cupal
I would say that clustering is used in examples of anomaly detection, because it is easier to demonstrate the whole process using this technique - it's more straightforward than decision trees. The same applies also for code of clustering algorithm itself.
Hi, Vladimir,Decision trees can definitely be used for anomaly detection and ofter are, together with clustering. There is a lot of papers about the subject of clustering vs. DT on the internet and usually combination of them is used or recommended.Regards,Vladimir Cupal
The problem that I do not see how it is configured in video on Excel 2007 Add-in (Exception Highlighting).
Even more, I cannot manage, control the tuning even for the simplest and prestaged cases, see:
http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/efe75552-8b2d-4fa6-bebc-10bac9103fef
(help me to interpret and tune up predictions (exceptions))
and
http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/b9bbe2a2-0c47-4d6b-ae73-8b88f8455654
(continuous vs. discrete attribute in DM)
"There is a lot of papers about the subject of clustering vs. DT on the internet"
I could not find anything in context of practical steps employing DM SSAS of MSSQLServer.
Can you give any links?
Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин- Hi Guennadiy
I do not think there are any papers or indeed too many discussions around the "Detect Anomlies" ribbon button. As for what it does.
Creates a Session Mining Structire over the columns in the XL list you specify
Adds a Clustering model on top with drillthrough and cluster count = 0
loads the structure with the list
Trains the model
It then issues a query similar to
@ParamTable being the XL list.SELECT T.[GeographyKey], Predict([GeographyKey]), PredictVariance([GeographyKey]), T.[MiddleName], PredictHistogram([MiddleName]), PredictProbability([MiddleName]), T.[NameStyle], PredictHistogram([NameStyle]), PredictProbability([NameStyle]), T.[BirthDate], Predict([BirthDate]), PredictVariance([BirthDate]), T.[MaritalStatus], PredictHistogram([MaritalStatus]), PredictProbability([MaritalStatus]), T.[Gender], PredictHistogram([Gender]), PredictProbability([Gender]), T.[YearlyIncome], Predict([YearlyIncome]), PredictVariance([YearlyIncome]), T.[TotalChildren], PredictHistogram([TotalChildren]), PredictProbability([TotalChildren]), T.[NumberChildrenAtHome], PredictHistogram([NumberChildrenAtHome]), PredictProbability([NumberChildrenAtHome]), T.[EnglishEducation], PredictHistogram([EnglishEducation]), PredictProbability([EnglishEducation]), T.[SpanishEducation], PredictHistogram([SpanishEducation]), PredictProbability([SpanishEducation]), T.[FrenchEducation], PredictHistogram([FrenchEducation]), PredictProbability([FrenchEducation]), T.[EnglishOccupation], PredictHistogram([EnglishOccupation]), PredictProbability([EnglishOccupation]), T.[SpanishOccupation], PredictHistogram([SpanishOccupation]), PredictProbability([SpanishOccupation]), T.[FrenchOccupation], PredictHistogram([FrenchOccupation]), PredictProbability([FrenchOccupation]), T.[HouseOwnerFlag], PredictHistogram([HouseOwnerFlag]), PredictProbability([HouseOwnerFlag]), T.[NumberCarsOwned], PredictHistogram([NumberCarsOwned]), PredictProbability([NumberCarsOwned]), T.[DateFirstPurchase], Predict([DateFirstPurchase]), PredictVariance([DateFirstPurchase]), T.[CommuteDistance], PredictHistogram([CommuteDistance]), PredictProbability([CommuteDistance]), T.[Region], PredictHistogram([Region]), PredictProbability([Region]), T.[Age], Predict([Age]), PredictVariance([Age]), T.[BikeBuyer], PredictHistogram([BikeBuyer]), PredictProbability([BikeBuyer]), PredictCaseLikelihood() FROM [Table1_863200_CL_33744] NATURAL PREDICTION JOIN @ParamTable as T
The UI now has everything it needs. You get to specify how many exceptions you want and this will probably be chosen in order of their PredictCaseLikelihood() value. (ASC)
You can see all this being played out in profiler. Now. To do this yourself and to have more control over what happens you will need to do the exact same thing yourself but building the structire and model yourself specifying the clustering options yourself.
I know you feel you do not have compllete control over the algorithms (your previous posts) but this is where I would be starting if I understand what you want to do.
Allan
- 답변으로 표시됨Jin ChenMSFT, 중재자2009년 8월 26일 수요일 오전 9:17
- A few more detail on top of Allan's comments:
First -- In the context of SQL erver data mining, clustering has some built in feature for anomaly detection, i.e. the PredictCaseLikelihood function, the last selection in the query posted by Allan. This function returns for each input data point a number between 0 and 1 (normalized). The number tells in essence how close the point is to one of the cluster centroids. A value of 1 means a point is close or identical to a centroid. A value of 0 means that it is not far from all centroids. That is the only factor which determines whether a point ia an outlier or not. Everything else in the query has the purpose of determining the most probable cause for the anomaly, i.e. the attribute which makes the point most unlike the clusters. More details on PredictCaseLikelihood here: http://msdn.microsoft.com/en-us/library/ms132052.aspx
Then for each point that is an outlier, for each attribute, the add-in computes the distance from the predicted value of the attribute and the actual one. The code for this computation can be downloaded from the Data Validation sample at sqlserverdatamining.com, available here: http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=99&Id=313
The fact that the Microsoft Clustering algorithm can be used for predictions (based on the distributions inside each cluster, weighed with the distance from the current point to each cluster) allows executing a prediction for each point. As all the columns are input and predict, each prediction is based on all other columns.
So, for each attribute, the add-in computes a likelihood of the actual value with the predicted value. This likelihood is compared across attributes and the attribute which yields the lowest likelihood is considered the most probable cause for the anomaly.
The likelihood calculation differs for discrete and continuous attributes.
For discrete attributes, the add-in needs to compute the ratio between the predicted probability of the current value and the predicted probability of the most likely value.
Effectively, this can now be expressed in DMX as PredictProbability([BikeBuyer], T.[BikeBuyer])/PredictProbability([BikeBuyer]).
However, at the time of the add-in implementation (2005 timeframe), using an input rowset column as a query parameter was not supported, so we ended up asking for the actual attribute value as well as the whole histogram, and the lookup to find the actual state's predicted probability happends on the client.
Therefore, the DMX statement looks likeL
T.[BikeBuyer], PredictHistogram([BikeBuyer]), PredictProbability([BikeBuyer]),
Again, the first column (T.BikeBuyer) is only included to allow the client side code to look up the specific probability of the current input's Bike Buyer in the prediction histogram -- a workaround for the fact that DMX does not support the JOIN operator.
With this calculation, the score for a discrete is a number between 0 and 1(1 if the actual state is the most likely state as predicted).
For a continuous attribute, the likelihood is computed based on actual value, predicted value and prediction variance, like below:
distCol = Math.Exp(- Math.Pow((actualCol- predCol),2) / (2 * varCol) );
hence, the square of the distance between actual and predicted divided by twice the variance. The negative exponential assures that, when actual is close to predicted, the likelihood is closer to 1, while if the distance grows the likelihood tends to 0
Now, the likelihood scores are compared for all attributes that make a row, and the least likely one is marked as the cause of the exception (yellow highlighting)
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm- 답변으로 표시됨Jin ChenMSFT, 중재자2009년 8월 26일 수요일 오전 9:17

