Continuous learning for email text mining and categorization
-
Monday, September 10, 2012 7:42 AM
Hi,
We are trying to come up with an approach for mining the data in emails (sent to a discussion forum) to categorize them into various topics and to be able to let the mining model learn and improve continuously. For building the mining structure, we are thinking of using SSIS to build the dictionary of terms and associate terms and frequencies with the emails. Using that, we can use one of the classification algorithms in SSAS to learn using a training data set.
We also want to be able to constantly monitor and improve the training set in a part-automated / part-manually-guided manner. Here are 2 approaches that come to my mind. Please share your thoughts around them and let me know if there is a better approach. a) We can create an interface for the Forum owners to look at the email classification periodically (say weekly), approve / reclassify and on submitting, the model gets trained using the old + newly classified data. b) Is there a way how SSAS can associate a confidence score with the classification such that it says that the topic categorization has a probability of being x% accurate. In that case, we can use all the categorizations having say, 90% and above accuracy to get added to the training data - and the rest can be manually reviewed?
Please let me know what you think.
All Replies
-
Thursday, September 13, 2012 8:51 AMModerator
Hi MSBI Dev 2012,
I suggest you can try to use Association Algorithm in SSAS, the Microsoft Association algorithm is an association algorithm provided by Analysis Services that is useful for recommendation engines. For each itemset, the algorithm creates scores that represent support and confidence. These scores can be used to rank and derive interesting rules from the itemsets. For more information about it, please see:
http://msdn.microsoft.com/en-us/library/ms174916.aspx
Please feel free to ask if you have any question.
Thanks,
EileenPlease remember to mark the replies as answers if they help and unmark them if they provide no help. This can be beneficial to other community members reading the thread.
- Marked As Answer by Eileen ZhaoMicrosoft Contingent Staff, Moderator Monday, September 24, 2012 8:58 AM
-
Friday, September 14, 2012 3:17 PM
Hi MSBI Dev,
Please refer to this thread. My experience with Text classification is that you can't rely completely on any algorithms, but have to create a good knowledge based watefall approach.
hth
please remember to mark as answered if the post helped resolve the issue.
- Marked As Answer by Eileen ZhaoMicrosoft Contingent Staff, Moderator Monday, September 24, 2012 8:58 AM

