datamining a blackbox or blackboxing a datamine
I was given a huge table (few GB) with 44 colums.
Mostly the designation of the columns is blurred to me.
(with "unhuman" column headings like VGJNPK).
The task is to datamine and determine exceptions (anomalies or outliers)
having the approach of black box but using Microsoft Clustering Algorithm of MSSQLServer20008.Though I know that more than half of columns are IDs of transactions, account numbers, credit cards, etc.,
the client insists on blackbox detection of exceptions.
Also, one can imagine that it would be possible to find typos in credit card numbers (?)
So, I'd like to discuss pros and contras, what are possible approaches to such such a task...
Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин- EditadoGuennadiy Vanine miércoles, 01 de julio de 2009 7:12
Todas las respuestas
When building the structure/model you can have SSAS determine the type of data in each column. The column names are irrelevant as I guess when you hand back your results the columns will be decoded into something English and I don't think that the algorithms would know a Credit Card Number from a National Insurance Number. The reason you may like to know the column name is so you can specify the usage type perhaps (Predict, PredictOnly) A lot of the datasets you can pick up on the web are given to you with useless names as columns. They can be decoded later. In short then I think they are a nice to have but not necessarily essential. AllanI was given a huge table (few GB) with 44 colums.
Mostly the designation of the columns is blurred to me.
(with "unhuman" column headings like VGJNPK).
The task is to datamine and determine exceptions (anomalies or outliers)
having the approach of black box but using Microsoft Clustering Algorithm of MSSQLServer20008.Though I know that more than half of columns are IDs of transactions, account numbers, credit cards, etc.,
the client insists on blackbox detection of exceptions.
Also, one can imagine that it would be possible to find typos in credit card numbers (?)
So, I'd like to discuss pros and contras, what are possible approaches to such such a task...
Guennadi Vanine -- Gennady Vanin -- Геннадий ВанинThe reason you may like to know the column name is so you can specify the usage type perhaps (Predict, PredictOnly) A lot of the datasets you can pick up on the web are given to you with useless names as columns. They can be decoded later. In short then I think they are a nice to have but not necessarily essential. Allan
Allan,
as I could see from other discussions I should know what I can and cannot predict based on which coulumns and how.
If I do not decide from column names then I decide looking into results of DM.
So, my interest is the description of such processing, choosing the model and tuning its configuration
For ex.,
Bogdan Crivat in thread.
http://social.msdn.microsoft.com/Forums/fr-FR/sqldatamining/thread/a8b8e476-534b-426b-b4bd-bf26e8f43f38
Question Regarding Scenario On DataMining
tells:
- "One more suggestion: if IsApproved and IsDeclined describe basically the same thing (i.e., IsApproved = 1 - IsDeclined), I would ignore the IsDeclined column and only build a mining model with IsApproved as target. This way, the model will predict 1 or 0 for the IsApproved column, and you can conclude that 0 means IsDeclined."
What is wrong in having one column to be dependent on another? Look at it from the p.v. of finding possible typos.
So, if I have strong dependence (correlation) between the columns it is repetition and hence not important.
Another extreme case there is no correlation at all, fro ex.,credit card number is quite independent attribute.
How would I repeat or generalize such decisions and steps having blackbox columns before me?
And why should I do it at ll?
Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин- So from your description You know nothing about the dataset You know nothing about the problem (is this a classification problem) Now, the way I might look at this would be to say "Tell me something". I would use clustering to do this. I would then have something I can hang my hat on i.e. the cluster. I would then look at something like Naieve Bayes to ask "Given the value in [Cluster] which attributes are important to the cluster value" As I said in the first post, it would be very nice to know what the column names are so we can make decisions on them but then again maybe just because we have "Bank Balance" as an attribute doesn't mean we want to predict it.

