MSDN > 論壇首頁 > Data Mining > Mining Structure Design
發問發問
 

問題Mining Structure Design

  • Wednesday, 27 May, 2009 16:20Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    My task is to find exceptions from sales data which has area# and sales values column. Let's see,
    area#,sales
    1,2
    1,100
    1,23
    1,11
    2,12
    2,13
    4,12
    ..,..

    For each area there are around 200 rows.

    I think this is a typical clustering task. I can create a clustering model and use PredictCaseLikelihood function to find the exception. I tried one area. I created a small table which includes RowNo(Key), sales (Input, Predict) for Area# 1. And was using Microsoft Clustering algorithm (Cluster Counter set as 0). It is working well. The result is exactly same as I use highlight exception in excel table analysis tools.

    And then I tried to build a mining structure for all my data (i.e. all areas). I created a table which includes RowNo(key), AreaNo(text, input), sales(input, predict). And use same way before. The issue is the result is different than first model for same Area# 1.

    My guess is in the second structure areaNo is added as input, so that affect the calculation. But how can I avoid this affection? I believe the first result is more accurate. But in that way, I have to loop each area to create the model and query the result. That sounds no sense.

    I am wondering if I can use nested table. I tried it, but I did not get it.

    Any suggestion?

    Thanks,
    :)
    • 已移動Darren GosbellMVPThursday, 28 May, 2009 3:08is a data mining question (From:SQL Server Analysis Services)
    •  

所有回覆

  • Wednesday, 27 May, 2009 18:39Thomas IvarssonMVP, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Hi,

    This is the correct discussion group for data mining questions:

    http://social.msdn.microsoft.com/forums/en-US/sqldatamining/threads/

    BR
    Thomas Ivarsson
  • Friday, 29 May, 2009 1:24Richarl 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Could you explain what an exception would look like?   Is the second row of input data an exception, with a large sales value?  If so, would an average and SD of sales for each Area# give you the basis for outliers.  ie, if sales > n*SD from that area# sales, then it's an exception?

    Richard
    Richard
  • Friday, 29 May, 2009 12:50Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Hi Richard,

    I do not know what will be the exception. I just want use Microsoft Clustering algorithm to find the nature group for sales, query the likelihood for each sales, and QA the low likelihood data which could be exception or not. From my data there is no way to find a exactly outlier to determine the exception. So I just want to find the data which likely be exception and let someone to QA it.

    The issue is the result from model using area# and sales as input is different than just using sales only. I prefer the model just using sales. But in this way, I have to trainning model same times as number of area and query same times to get the result. That is why I try add area# as input and query data based on the area# input. But the result is different.

    What is the solution?
    :)
  • Friday, 5 June, 2009 18:23Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Any suggestion?
    :)
  • Friday, 5 June, 2009 20:31Richarl 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    If you are new to data mining, I would suggest that your first few models are built on relationships that you understand.  Even once you become familiar with what data mining algorithms can do, you will tend to train them on data for which you have some intuition that a relationship might exist.  For example, you might be predicting sales items based on other items in a shopping basket, because you have some "intuition" that customers who buy certain products also tend to buy other products.  I.e. customers who buy coffee also buy chocolate, customers who buy ham and cheese, also buy sausages etc etc.  Data mining algorithms should discover these relationships if they exist, and they will tend to find many others that you weren't toally aware of.  One thing I like about data mining algorithms is that they can quantify the strenghth of a given relationship.

    So what I would suggest to you is that you start with some data that has some rather obvious relationships, such as retail shopping baskets.  It also helps very much that you have real data.  Synthetically generated data is not interesting to data mine, and will generally only discover relationships that were synthetically generated.  I have used my IIS web logs to do all sorts of data mining exercises, predicting response time, time series, sequence clustering for predicting next page, predicting geography of client based on pages visited, response time and browser.  If you don't have any good training data, you might want to download my sample database (it's real data from my web site) and follow the tutorial, or design your own model from http://technet.microsoft.com/en-us/library/dd883232.aspx  I have a couple of demonstration models on http://RichardLees.com.au/Sites/Demonstrations, including the web response time prediction, for the last 50 hits to the site.  This is structurally the same as a model that could predict customer profitability from demographics and first sale.

    There is a very good text on data mining http://richardlees.blogspot.com/2008/12/book-review-data-mining-with-sql-server.html

    Hope that helps you


    Richard
  • Monday, 8 June, 2009 12:58Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Many thanks, Richarl.

    My case is all my data is projected from my methodology. I do not have the real/correct data to validate my data or it is hard to validate it. So I try to use Clustering task to find the most unlike case to QA them. I am not sure this is a good approach to do the QA. Do you have some suggestions for this scenario?
    :)
  • Tuesday, 9 June, 2009 9:31SQLUSA 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    You may take a look at the Highlight Exceptions function in the Data Mining ribbon (Analyze tab) of Excel 2007 Addin for SQL Server 2008. It highlights outliers.
    Kalman Toth, SQL Server 2008 Training, SSAS, SSIS, SSRS, BI: www.SQLUSA.com
  • Tuesday, 9 June, 2009 13:17Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    Hi SQLUSA,

    Yes. Actually I used it for my one area. And in same way I created a microsoft clustering model (just set the sales for predict) and got the same results with Excel.

    But if you take a look at my first post. My point is I have thouands of arears, I try to create one model for all, so I have to set both area# and sales as input and predict sales. Comparing with the previous model, the result is different. HOW CAN I GET SAME RESULT?

    Thanks in advance.


    :)
  • Tuesday, 30 June, 2009 13:37Mr.QuestionMark 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Any suggestion......
    :)