Continuous learning for email text mining and categorization

Answered Continuous learning for email text mining and categorization

  • Monday, September 10, 2012 7:42 AM
     
     

    Hi,

    We are trying to come up with an approach for mining the data in emails (sent to a discussion forum) to categorize them into various topics and to be able to let the mining model learn and improve continuously. For building the mining structure, we are thinking of using SSIS to build the dictionary of terms and associate terms and frequencies with the emails. Using that, we can use one of the classification algorithms in SSAS to learn using a training data set.

    We also want to be able to constantly monitor and improve the training set in a part-automated / part-manually-guided manner. Here are 2 approaches that come to my mind. Please share your thoughts around them and let me know if there is a better approach. a) We can create an interface for the Forum owners to look at the email classification periodically (say weekly), approve / reclassify and on submitting, the model gets trained using the old + newly classified data. b) Is there a way how SSAS can associate a confidence score with the classification such that it says that the topic categorization has a probability of being x% accurate. In that case, we can use all the categorizations having say, 90% and above accuracy to get added to the training data - and the rest can be manually reviewed?

    Please let me know what you think.

All Replies

  • Thursday, September 13, 2012 8:51 AM
    Moderator
     
     Answered

    Hi MSBI Dev 2012,

    I suggest you can try to use Association Algorithm in SSAS, the Microsoft Association algorithm is an association algorithm provided by Analysis Services that is useful for recommendation engines. For each itemset, the algorithm creates scores that represent support and confidence. These scores can be used to rank and derive interesting rules from the itemsets. For more information about it, please see:
    http://msdn.microsoft.com/en-us/library/ms174916.aspx

    Please feel free to ask if you have any question.

    Thanks,
    Eileen


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. This can be beneficial to other community members reading the thread.

  • Friday, September 14, 2012 3:17 PM
     
     Answered

    Hi MSBI Dev,

    Please refer to this thread. My experience with Text classification is that you can't rely completely on any algorithms, but have to create a good knowledge based watefall approach.

    http://social.technet.microsoft.com/Forums/en/sqldatamining/thread/1ff1e900-f4de-42b0-83ff-7021f952ffe3 

    hth


    please remember to mark as answered if the post helped resolve the issue.