Background: I have a table called error_events. Each row represents an event that occurred and the columns contain the error codes that where generated for that event. Error codes are text strings.Most events usually have only one or two error codes listed but the table allows for up to ten error codes to be listed - one error code per column. The order of the error codes is important. The final column in the table is called chosen_code and it's value can be determined by looking at the list of error_codes and applying certain rules. There are over a 1000 possible error codes, so I can't just write a simple if...then...else statement to populate the chosen_code!
My question is: Can I use a data mining model to help with assigning a value to the chosen_code column? If so which algorithm would be best (sequence clustering)?
I have access to historical data (500,000 rows) which contains data that has been manually checked (twice!). I was thinking that I could use this to teach the model.
End result I'm hoping for: I give the model some data with the chosen_code column blank. The model looks at previous events with the same sequence of events and assigns a chosen_code based on what has happened in the past.
Hope this makes sense.
Could you please give more details to explain why the order of the error codes is important and how to determine the code of the final column? I think the Microsoft Association Algorithm could be the proper one in the algorithms provided by Microsoft SQL Server Analysis Services. But if the order of the error codes is key thing for the chosen_code, the Association algorithm can't be used here. And what input are used to determine the error code for the chosen_code column? Could you give us example data here to illustrate the question?
Thanks for the reply.
The order of the columns is important because it gives you a sequence of events. The code of the final column is determined by a complex set of rules. Sorry, can't provide examples. I've had a look at some other data mining algorithms and I think nearest neighbor is probably the way forward, unfortunately I don't think this algorithm is present in SSAS (2008 R2).