Naive Bayes Algorithm
- I am facing issue with text classification by using SQL server Naive Bayes algorithm. When i try to find terms which are there in two different classes, it is giving wrong results. It is always giving same probability for each class.
I am using SQL Server 2008 Analysis services. I have 3 classes - Dog, Cow, Cat. Each has got 2 terms. When i try to find 2 terms in model from two different classes, it gives equal probability of being in all three classes.
Any pointers will be helpful.
Amit
Answers
1. Bayesian theory treats all your input attributes independent from each other. Since, NB doesn't support continuous attribute and it couldn't tell which term existed how many times, so its calculating the probability of likelihood for each state from input attributes(terms) (which in your case is same for all three states)
2. Although, predict and predicthistogram functions are not supported by SVM, you can get your prediction from prediction join. follow the link and you'll understand how.
http://www.codeplex.com/svmplugin/Thread/View.aspx?ThreadId=39413
hth,
Rok- Marked As Answer byJin ChenMSFT, ModeratorWednesday, November 11, 2009 9:37 AM
- Proposed As Answer byrok1 Tuesday, November 03, 2009 6:41 PM
All Replies
Although naive bayes is one of the most popular algorithm, for text mining, it is only ideal for a general classification because it doesn't account dependencies among inputs.
My suggestion is to use Neural network or logistic regression. The most popular algorithm for text classification is SVM (support vector machine).
You can download SVM plug-in for Analysis services from this link
http://www.codeplex.com/svmplugin
The best result you'll get is from SVM and Neural network (well, depending upon the number of input attributes and predictable states) the training time will take some time so try with different parameters and you'll get amazing results!
Sometimes logistic regression which is similar to Neural network (without the hidden nodes) can get you what you need also.
hth,
Rok
Please mark it as answer if you find this helpful!- Thanks Rok for detailed reply, I will try SVM and LR
I also wanted to understand why NB is behaving like this. I have taken very small and simple example with only 2 terms in each class( 3 classes). It will be great help if you can provide me some details about it.
Amit - Also Predict, PredictHistogram functions are not working for SVM. Can you please let me know which all DMXfunction wil work while predicting from the model.
Amit
1. Bayesian theory treats all your input attributes independent from each other. Since, NB doesn't support continuous attribute and it couldn't tell which term existed how many times, so its calculating the probability of likelihood for each state from input attributes(terms) (which in your case is same for all three states)
2. Although, predict and predicthistogram functions are not supported by SVM, you can get your prediction from prediction join. follow the link and you'll understand how.
http://www.codeplex.com/svmplugin/Thread/View.aspx?ThreadId=39413
hth,
Rok- Marked As Answer byJin ChenMSFT, ModeratorWednesday, November 11, 2009 9:37 AM
- Proposed As Answer byrok1 Tuesday, November 03, 2009 6:41 PM


