none
continuous vs. discrete attribute in DM

    Question

  • I have a huge table for SSAS DM detecting exceptions (anomalies or outliers) with Microsoft Clustering Algorithm.

    One of the column has dates in char(8) with values like "20060412".

    Should I care about converting this coulmn to DateTime?

    What are possible cons. and pros?

    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин
    Wednesday, July 01, 2009 6:54 AM

All replies

  • Hi

    I would be inclined to "enhance" the attribute.  I would probably break it out into the constituent parts.

    Month Name
    Year
    Day Of Month
    Day of Week
    etc

    still hold the original value as well but by breaking it out it will be easier to see meaning (I think)


    Allan
    Thursday, July 02, 2009 8:25 AM

  • I would be inclined to "enhance" the attribute.  I would probably break it out into the constituent parts.

    Month Name
    Year
    Day Of Month
    Day of Week
    etc

    still hold the original value as well but by breaking it out it will be easier to see meaning (I think)
    Thanks, Allan,
    my client insists that I should restore back to original state his table in final DM solution.

    And in original table the date column is in char(8) (as well as there is no unique key).

    So, I'd like to understand what is giving me the conversion of char(8) column to date format and what I am loosing without it.
    For me, 8 symbols of date is very similar to discrete values.
    But during  DM modeling
    http://social.msdn.microsoft.com/Forums/en-US/sqldatamining/thread/efe75552-8b2d-4fa6-bebc-10bac9103fef
    I noticed that this (almost discrete, by dates) date format values (in source data) are treated as datetime in DM giving rather weird results.


    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин
    Friday, July 03, 2009 8:03 AM
  • Hi, Guennadi

    On preserving the data type -- in your Data Source View, you can replace a table with a named query. Just right click on the table and select Replace with Named Query. In the named query, you could use T-SQL syntax to split your date column into components and convert each component to a numeric type, or convert the whole column to a SQL date type, without altering the actual table data

    A numeric or datetime column allows the data mining algorithms to treat the column as continuous. Continuous columns have a different distribution, characterized by mean and standard deviation and, in certain algorithms (such as the Microsoft Decision Trees), can be represented as intervals with different properties. For example, you can get rules such as "if DATE is smaller than 7/7/2009 THEN PredictableColumn=Value". This kind of analysis cannot be performed by the data mining algorithm if the values appear as distinct strings
    bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm
    Tuesday, July 07, 2009 3:51 AM
  • A numeric or datetime column allows the data mining algorithms to treat the column as continuous. Continuous columns have a different distribution, characterized by mean and standard deviation and, in certain algorithms (such as the Microsoft Decision Trees), can be represented as intervals with different properties. For example, you can get rules such as "if DATE is smaller than 7/7/2009 THEN PredictableColumn=Value". This kind of analysis cannot be performed by the data mining algorithm if the values appear as distinct strings
    bogdan crivat [sql server data mining] / http://www.bogdancrivat.net/dm


    Hi, Bogdan,
    thanks!
    This starts to be more exciting!

    I am interested in enforcing the rules, though I am not that much proficient  in them.
     Where can I read more on their application.

    Are the applicable to models based on any algorithms
    (I am interested in DT and clustering one)?

    suppose I am interested to control the distribution?

    Suppose I am DM-ing (predicting) on a flat table which has column with 3 fixed char(3) values:
    'abc', 'def' and 'xyz'.
    I need to enforce the rule that predicted values for this (discrete) attribute to be always amongst one of these 3 known values and in the case of querying for anotheк (out of these three) the predicted probability of "another" value to be always staright 0 (or, at least, very low).

    Possible generalizations to this:
    - the "prohibited" values might be any pre-determined values with known (fixed for discrete and ranges for continuous) prohibited exceptions;
    - there are some values (for dicrete attributes) or ranges (for continuous) which I'd like to enforce as predicted with high(er) probabilities                                                                                                             How

    How would I enforce it in DM modeling?

    In other words how can I make the predictions manageable (and manageable according to some descriptions or expressions)?                                                                                                                                                                                                                    


    Guennadi Vanine -- Gennady Vanin -- Геннадий Ванин
    Wednesday, July 08, 2009 7:20 PM