Sequence Clustering in SSAS 2005
-
Wednesday, May 27, 2009 12:12 AMI have had no problem with most of the algorithms, but sequence clustering is proving to be a challenge. I would like to use some healthcare data to predict the likely state of someone in the next year. States can be on dialysis, dead, low-cost, high-cost, etc. I am linking by year as the key sequence. If I make it discrete, I get an error saying it should be continuous; if I make it continuous, it tells me that a key sequence needs to be discrete.
I have looked at the Adventureworks tutorial, and it seems to me that it should be varchar not int. I am not sure what goes in the nested table and what goes in the case table.
It appears to me that there should be 3 elements in the nested table: primary key, sequence key, and predicted value and that all the predictors should be in the case table. Could someone give me some guidance on the data types, content types, and relationships to make the model run?
All Replies
-
Thursday, May 28, 2009 8:46 PMModerator
Dear MS DM team
This is a adequate question and are we not here to demystify data mining?
A good answere on this question will help dm-rookies like my self to understand this algorithm.
Best regards
Thomas Ivarsson -
Saturday, May 30, 2009 12:50 AMModerator
The nested table should look, more or less, like
StateTable TABLE PREDICT
(
Year LONG KEY SEQUENCE
State TEXT DISCRETE PREDICT
)
Now, if you use the raw year value (such as 1997) you will not get good results. The sequence column should be the ordinal in a sequence. For your situation, I would suggest the following:
- for each patient, detect the first year with information (YearStart)
- Convert, for each patient, the year information to something like [Year-YearStart]
Therefore, your patient data should look like below (before training)
Patient 1 (first record in 1997)
0 (=1997-1997), State 1
1 (=1998-1997), State 2
2 (=1999-1997), State 3
Regarding the data relationships:
- your relational case table will probably contain the patient id (as key) as well as any other information about the patient
- your relational nested table should contain the patient id, the calculated sequence year (year-yearStart) and the state for each year.
Effectively, you have the standard one to many relationship on Patient ID between case and nested table. The year (Sequence) column is the ordinal in the sequence for each patient
Hope this helps
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm- Marked As Answer by Thomas IvarssonModerator Saturday, June 13, 2009 12:04 PM
-
Thursday, June 04, 2009 6:57 PMBogdan,
Thanks--we have made progress but are not there yet. The model deployed and processed; however the cluster viewer gave me error messages or was empty. And it was as if there was no model to predict from on that tab. Not sure if I can attach the Word .doc with the error messages, but I have saved it.
Text:
Under Cluster Diagram: Load cluster profile failed: Specified cast is not valid
Under State Transitions: Object reference not set to an instance of an object
Under Microsoft Mining Content viewer: File System Error: The record id is incorrect.
I have looked in SQL Server Books Online and note that there does not seem to be any guidance on what the common error messages are and what the likely cause is. Something like that would be very helpful, since this is "data mining for the masses." Thanks.
Regards,
Sam
-
Tuesday, June 09, 2009 8:27 PMBogdan,
It took some doing and detective work, but I figured out why my model wasn't running--I got it to run with complete results. Only problem is that there was a single cluster.
I have had this problem with the Cluster algorithm as well:
What makes a model create clusters that it can differentiate?
Thanks,
Sam -
Wednesday, June 10, 2009 4:39 AMModeratorSam
Sorry for the late reply -- how many cases do you have in your training set?
You can try playing with the MINIMUM_SUPPORT parameter and lower the default value (which is 10) as well as try to chnage the number of initial clusters by changing the value of the CLUSTER_COUNT parameter
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm -
Wednesday, June 10, 2009 5:47 AM
Bogdan,
Thanks, although I was a bit panicked, I have been messing with the parameters and am starting to understand some of it. (BTW I know that you and Jamie have to be stretched thin. ) I started setting the defaults to 0 for CLUSTER_COUNT to see what happened (30 plus clusters) and then began altering minimum support. One thing that I can't get to work is the predictsequence(). Went to http://technet.microsoft.com/en-us/library/ms132157(sql.90).aspx but can't seem to get the syntax right. I assume that if I want a Markov model, then n=1 but I am wondering what the "table column referenc" should be--is it my predicted state or the state in the data?
Also, should I use the standard 70-30 split between training and test sets, or is that not relevant here?The other thing that is strange is there is very little interaction among the clusters, there is a self-directed (u-turn arrow) on all of them and one or two have probabilities coming from elsewhere, but little interaction in the last tab to the right of the mining model viewer (am home and doing this from memory)--not sure if that means that states don't change from period to period or what.
The major reason I am trying the sequence clustering is that if we want to predict severity of illness for chronically ill people, we are faced with the problem of regression to the mean--just because a person has high medical expenses one period doesn't mean he/she will continue to be sick or have high expenses the next period, because expenses regress to the mean. By the same token, a person with low expenses and other characteristics that I was hoping the algorithm could pick up, might die or being expensively sick in the next period.
Would like to iron out the details of the model--format, syntax, etc. and then go back to see if I can find better variables. This algorithm is more challenging than the others--haven't done any time series. Thoughts on these questions and whether I am using the right approach?
-
Thursday, June 11, 2009 4:25 AMModerator
(Some partial answers, more will follow)
- query:
Assuming your model has a structure similar to what I suggested in a previous post, the prediction query should look like:
SELECT PredictSequence(StateTable, 2) AS nextSteps
FROM YourModel NATURAL PREDICTION JOIN
(
SELECT 50 AS Age, 'Brown' AS HairColor ..., // Optional case level input attributes -- if none, leave just the SELECT word
(
SELECT 1 AS Year, 'State1' AS State UNION
SELECT 2 AS Year, 'State2' AS State UNION
SELECT 3 AS Year, 'State3' AS State
) AS StateTable // the nested table containing the historical states
) AS T
- 70/30 split -- Unfortunately, there is no built-in feature for SQL Server data mining to use the holdout set for sequence prediction accuracy measurements. You could write your own code, but unless you intend to do this, there is no real advantage in using a holdout data set.
On the general modeling questions -- did you consider using regular regression models? I assume you only use information over a fixed number of years -- you could probably use the nested table feature to model the sequential information in input attributes that predict the state or expenses for the next year
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm -
Thursday, June 11, 2009 10:24 PMBogdan,
I came to work with high hopes but am still not getting there. Here is the model using BI Studio that I have--fieldnames made generic. (I will try using the other algorithms with nested tables while waiting for your help with this.) It is clear to me that the Sequence Clustering algorithm is at a different level than the others, which seem so much easier by comparison.
Code is below.
Thanks,
Sam
SELECT
t.[m_id],
t.[years],
t.[state],
PredictSequence([State Flag],2),
(SELECT [State Flag] FROM [Cms Data1 suggested SCL].[Years1])
From
[Cms Data1 suggested SCL]
PREDICTION JOIN
SHAPE {
OPENQUERY([Test Sam],
'SELECT
[m_id],
[years],
[state],
[gender],
[race],
[count1],
[count2],
[count3],
[sum],
[count4],
[diag_sum],
[code1],
[code2],
[count5],
[region]
FROM
[dbo].[cms_data1]
ORDER BY
[m_id]')}
APPEND
({OPENQUERY([Test Sam],
'SELECT
[seq],
[state],
[m_id]
FROM
[dbo].[years1]
ORDER BY
[m_id]')}
RELATE
[m_id] TO [m_id])
AS
[years1] AS t
ON
[Cms Data1 suggested SCL].[Gender] = t.[gender] AND
[Cms Data1 suggested SCL].[Race] = t.[race] AND
[Cms Data1 suggested SCL].[count1] = t.[count1] AND
[Cms Data1 suggested SCL].[count2] = t.[count2] AND
[Cms Data1 suggested SCL].[count3] = t.[count3] AND
[Cms Data1 suggested SCL].[discrete1] = t.[discrete1] AND
[Cms Data1 suggested SCL].[count4] = t.[count4] AND
[Cms Data1 suggested SCL].[Sum] = t.[sum] AND
[Cms Data1 suggested SCL].[Code1] = t.[code1] AND
[Cms Data1 suggested SCL].[Code2] = t.[code2] AND
[Cms Data1 suggested SCL].[count5] = t.[count5] AND
[Cms Data1 suggested SCL].[Region] = t.[region] AND
[Cms Data1 suggested SCL].[Years1].[Seq] = t.[years1].[seq] AND
[Cms Data1 suggested SCL].[Years1].[State] = t.[years1].[state_flag]
-
Friday, June 12, 2009 5:48 AMModeratorSo, I understand your model looks like below, correct?
Model [Cms Data1 suggested SCL]
(
Gender,
Race,
count1,
count2
count3,
discrete1,
count4,
Sum,
Code1,
Code2,
count5,
Region,
Years1 TABLE
(
Seq,
State
)
)
If this is the case, then the query is slightly wrong. Change the beginning to something like below:
SELECT
t.[m_id],
t.[years],
t.[state],
PredictSequence([Years1],2),
From
[Cms Data1 suggested SCL]
PREDICTION JOIN -- rest of your statement here
PredictSequence takes as argument the nested table name, and returns sequences together with their qualifiers (I assume this is what you wanted to obtain with the subselect, SELECT [State Flag] FROM [Cms Data1 suggested SCL].[Years1]))
Apologies for the shameless plug, but I believe you will find very interesting the Sequence Clustering chapter in the Data Mining with SQL Server 2008 book. You can use the wonderful Look Inside stuff on Amazon to search for PredictSequence and see some more complex examples
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm -
Friday, June 12, 2009 9:44 PMBogdan,
I now have results and I used "FLATTENED" so I could put them back in the server and play with them.
One oddity--one of my states, as this is healthcare, is Death. Is there any way to tell the model that if it hits or predicts that, it can't have low or moderate or even high healthcare costs in the next period--i.e. that death is terminal, so the state stays there?
Also, do I have to use a sub-select to get a predictprobability() in sequence clustering? It doesn't like the syntax I am using, particularly since the status_flag (state) is discrete and alphabetical.
If, on the other hand, the model knows something about bringing folks back, I think both the medical and religious communities would be very interested in ouor joint research paper. :)
Thanks,
Sam -
Friday, June 12, 2009 10:12 PMModeratorFot the probability, try this:
SELECT FLATTENED
t.[m_id],
t.[years],
t.[state],
(
SELECT $SEQUENCE,
(SELECT [status_flag], $Probability, $Support FROM PredictHistogram([status_flag])) AS PredictedSequence
FROM PredictSequence([Years1],2)) AS [Sequences]
From[Cms Data1 suggested SCL]
PREDICTION JOIN ...
or, for complete histogram results:SELECT FLATTENED
t.[m_id],
t.[years],
t.[state],
(
SELECT $SEQUENCE, PredictHistogram([status_flag]) AS Histogram
FROM PredictSequence([Years1],2)) AS [Sequences]
From[Cms Data1 suggested SCL]
PREDICTION JOIN ...
Unfortunately, there is no way to tell the model that a certain state is always a sequence end.
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm -
Thursday, June 18, 2009 5:07 PMBogdan, Thanks--I got the model to run once and get the results I need--although the second time it crashed with an out-of-memory exception. Not sure if there is anyone here savvy enough to figure it out--I have a feeling I surprised a few folks by sending them the DMX query that caused the problem. Two quick questions:
1. Is there a convenient way to compare models? Even when I use multiple Seq CL models with different variables ignored in BI Studio, I can't get an ROC curve or Correct-Incorrect tables to generate.
2. What is the best way to export output from a DMX query in Management Studio--I don't have the option of exporting to text rather than the grid as I do with SQL?
Sam -
Friday, June 19, 2009 7:47 AMModerator
Could you please provide some more details about the out of memory exception? I would like to at least figure out whether it comes from the server or from the tools. Depending on the source, I may be able to provide a workaround.
1. Unfortunately, our model evaluation tools are mostly working for predictive models. I believe you are treating you model as a predictor (you care mostly about the next step in the sequence) -- if this is correct you could try to manually replicate the code for the classification matrix or for lift charts / ROC, but there is no feature in BI Dev Studio allowing you to easily evaluate/compare sequence clustering models
2. If you execute the query in BI Dev Studio instead of Management Studio (just go to the Mining Model Prediction tab and select the SQL button to type in your statement) then, once you get the result, you can save it to a database. In the results pane there is a Save button ressembling a floppy -- it allows you to chose or create a new data source. Alternatively, you could use the Data Mining Query Task in Integration Services which is intended exactly for this purpose
the Data Mining Query task is the only option which scales well for large result sets
bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm -
Monday, June 22, 2009 9:09 PMBogdan,
I continue to experience the error, which comes back as:
Executing the query ...
Obtained object of type: Microsoft.AnalysisServices.AdomdClient.AdomdDataReader
Exception of type 'System.OutOfMemoryException' was thrown.
The DMX query is of the form:SELECT FLATTENED
t.[m_id],
t.[years],
t.[status_flag],
(
SELECT $SEQUENCE,
(SELECT [Status Flag], $Probability, $Support FROM PredictHistogram([Status Flag])) AS PredictedSequence
FROM PredictSequence([Years1],2)
) AS [Sequences]
From
[Cms Data1 suggested SCL]
PREDICTION JOIN
SHAPE {
OPENQUERY([Test Sam],
'SELECT
[m_id],
[years],
[status_flag],
[age],
[gender],
[race],
[H_count],
[I_count],
[M_count],
[O_count],
[discrete1],
[D_count],
[d_sum],
[r_paid],
[discrete_code],
[sum_paid],
[E_count],
[e_sum],
[amt_paid],
[change06_07],
[change07_08],
[z_score],
[f_code],
[region]
FROM
[dbo].[data1]
ORDER BY
[m_id]')}
APPEND
({OPENQUERY([Test Sam],
'SELECT
[seq],
[status_flag],
[m_id]
FROM
[dbo].[years1]
ORDER BY
[m_id]')}
RELATE
[m_id] TO [m_id])
AS
[years1] AS t
ON
[Cms Data1 suggested SCL].[Gender] = t.[gender] AND
[Cms Data1 suggested SCL].[Race] = t.[race] AND
[Cms Data1 suggested SCL].[H count] = t.[H_count] AND
[Cms Data1 suggested SCL].[I count] = t.[I_count] AND
[Cms Data1 suggested SCL].[O count] = t.[O_count] AND
[Cms Data1 suggested SCL].[Discrete1] = t.[discrete1] AND
[Cms Data1 suggested SCL].[D count] = t.[d_count] AND
[Cms Data1 suggested SCL].[D Sum] = t.[d_sum] AND
[Cms Data1 suggested SCL].[Discrete Code] = t.[discrete_code] AND
[Cms Data1 suggested SCL].[E count] = t.[e_count] AND
[Cms Data1 suggested SCL].[Region] = t.[region] AND
[Cms Data1 suggested SCL].[F Code] = t.[f_code] AND
[Cms Data1 suggested SCL].[Years1].[Seq] = t.[years1].[seq] AND
[Cms Data1 suggested SCL].[Years1].[Status Flag] = t.[years1].[status_flag]
And the result set changes to a large red X

