Le réseau pour les développeurs > Forums - Accueil > Data Mining > datamining a blackbox or blackboxing a datamine
Poser une questionPoser une question
 

Discussion généraledatamining a blackbox or blackboxing a datamine

  • mercredi 1 juillet 2009 07:12Guennadiy Vanine Médailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     

    I was given a huge table (few GB) with 44 colums.

    Mostly the  designation of the columns is blurred to me. 
    (with "unhuman" column headings like VGJNPK).

    The task is to datamine and determine exceptions (anomalies or outliers)
    having the approach of black box but using Microsoft Clustering Algorithm of MSSQLServer20008.

    Though I know that more than half of columns are IDs of transactions, account numbers, credit cards,  etc.,
    the client insists on blackbox detection of exceptions.

    Also, one can imagine that it would be possible to find typos in credit card numbers (?)

    So, I'd like to discuss pros and contras, what are possible approaches to such  such a task...


    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин

Toutes les réponses

  • mercredi 1 juillet 2009 18:02Allan MitchellMVPMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     

    I was given a huge table (few GB) with 44 colums.

    Mostly the  designation of the columns is blurred to me. 
    (with "unhuman" column headings like VGJNPK).

    The task is to datamine and determine exceptions (anomalies or outliers)
    having the approach of black box but using Microsoft Clustering Algorithm of MSSQLServer20008.

    Though I know that more than half of columns are IDs of transactions, account numbers, credit cards,  etc.,
    the client insists on blackbox detection of exceptions.

    Also, one can imagine that it would be possible to find typos in credit card numbers (?)

    So, I'd like to discuss pros and contras, what are possible approaches to such  such a task...


    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин
    When building the structure/model you can have SSAS determine the type of data in each column. The column names are irrelevant as I guess when you hand back your results the columns will be decoded into something English and I don't think that the algorithms would know a Credit Card Number from a National Insurance Number. The reason you may like to know the column name is so you can specify the usage type perhaps (Predict, PredictOnly) A lot of the datasets you can pick up on the web are given to you with useless names as columns. They can be decoded later. In short then I think they are a nice to have but not necessarily essential. Allan
  • jeudi 2 juillet 2009 03:16Guennadiy Vanine Médailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     
    The reason you may like to know the column name is so you can specify the usage type perhaps (Predict, PredictOnly) A lot of the datasets you can pick up on the web are given to you with useless names as columns. They can be decoded later. In short then I think they are a nice to have but not necessarily essential. Allan
    Allan,
    as I could see from other discussions I should know what I can and cannot predict based on which coulumns and how.

    If I do not decide from column names then I decide looking into results of DM.

    So, my interest is the description of such processing, choosing the model and tuning its configuration

    For ex.,
    Bogdan Crivat in thread.
    http://social.msdn.microsoft.com/Forums/fr-FR/sqldatamining/thread/a8b8e476-534b-426b-b4bd-bf26e8f43f38
    Question Regarding Scenario On DataMining
    tells:
    • "One more suggestion: if IsApproved and IsDeclined describe basically the same thing (i.e., IsApproved = 1 - IsDeclined), I would ignore the IsDeclined column and only build a mining model with IsApproved as target. This way, the model will predict 1 or 0 for the IsApproved column, and you can conclude that 0 means IsDeclined."
    OK, how I would have discriminated such dependencies by modelling? by which procedures?

    What is wrong in having one column to be dependent on another? Look at it from the p.v. of finding possible typos.

    So, if I have strong dependence (correlation) between the columns it is repetition and hence not important.
    Another extreme case there is no correlation at all, fro ex.,credit card number is quite independent attribute.

    How would I repeat or generalize such decisions and steps having blackbox columns before me?
    And why should I do it at ll?
      

    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин
  • jeudi 2 juillet 2009 04:52Allan MitchellMVPMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateurMédailles de l'utilisateur
     
    So from your description You know nothing about the dataset You know nothing about the problem (is this a classification problem) Now, the way I might look at this would be to say "Tell me something". I would use clustering to do this. I would then have something I can hang my hat on i.e. the cluster. I would then look at something like Naieve Bayes to ask "Given the value in [Cluster] which attributes are important to the cluster value" As I said in the first post, it would be very nice to know what the column names are so we can make decisions on them but then again maybe just because we have "Bank Balance" as an attribute doesn't mean we want to predict it.