What algorithm to use for classifying where attribute has large num of states?

# What algorithm to use for classifying where attribute has large num of states?

• Wednesday, March 21, 2012 2:26 PM

Hi,

I've got a problem where I have a large number of distinct items that are mapped to codes. I started out by chopping up the item name training data into words and then using a naive bayes classifier to try to figure out based on which words are present, what is the code of highest probability that should be returned.

I think I'm running into a problem though that even though I have a single attribute that I'm modeling, I have millions of states in my training data. I believe that SSAS will limit this based on a parameter I set (MAXIMUM_STATES=0) which as a max of 65k.

That cap seems too low for me given that the number of items I have in the training data is in the millions. Is there anything I can do to raise this limit or should I perhaps start looking at a different algorithm? Or perhaps should I train N naive bayes classifiers that focus on specific sections of my classification code?

Any suggestions are appreciated. Thank you.

mj

### All Replies

• Wednesday, March 21, 2012 5:08 PM

How many distinct values do you have in that column?

How many rows of training data do you have?

Is there a way to group the codes (states)?

Tatyana Yakushev [PredixionSoftware.com]

• Wednesday, March 21, 2012 5:11 PM

Hi Tatyana,

The codes aren't the problem. There's ~40,000 codes. It's the items that are the problem. There are millions of them and dropping it down to 65k really cuts out a lot of fidelity of the model.

Does the Microsoft naive bayes just not cut it for me?

mj

• Thursday, March 22, 2012 5:52 PM

Look at the following tutorial http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=94&Id=164

You will need to use nested table to work with so many items (word is a key in the nested table). Depending on what algorithm you use, you will need to modify parameters (MAXIMUM_INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES for any algorithm except association rules or MAXIMUM_ITEMSET_COUNT, MAXIMUM_ITEMSET_SIZE for association rules)

Tatyana Yakushev [PredixionSoftware.com]

• Friday, March 23, 2012 2:03 AM
Moderator

Could you please give an example to illustrate your project and state exactly what result you want to get and the problem you are now running into? So we can further investigate on it.

Regards,
Jerry

• Friday, March 23, 2012 2:22 PM

Hi Jerry,

I've got a list of items  (stuff you can buy at a retail store) that are mapped to a code. There's millions of items and there's approximately 40k codes.

I created a case table that has an ID and Code and a nested table that has an ID (foreign key) and an Item Word Part. I say "part" because I divided up the items into individual words in order to figure out which words go with which codes. There's a code for "construction glue" and any training data that I have with "construction glue" in it divides up the training data into "construction" and "glue" and puts that in the nested table for training.

I then created a naive bayes classifier and set the max_input_attrs, max_output_attrs, and max_states all to 0.

I finally trained the model in BIDS. Upon completion of the training, I get the following warning. This is strange because I only have two attributes (not counting ID). My attributes do have a large number of states though but I don't see any errors regarding that.

Informational (Data mining): Automatic feature selection has been applied to model, Report Staging V due to the large number of attributes.  Set MAXIMUM_INPUT_ATTRIBUTES and/or MAXIMUM_OUTPUT_ATTRIBUTES to increase the number of attributes considered by the algorithm.

Any suggestions are greatly appreciated. Thanks,

mj

• Edited by Friday, March 23, 2012 5:42 PM
•
• Wednesday, April 25, 2012 8:17 AM

Hi,

The system notice is strange.

Does any of your attributes has more than 65535 distinct values? If you have 40k codes and you split them onto separate words you can get more than 65535 distinct values. Then less popular states will treated as a missing state (this may invoke warning about feature selection).

Regards,

gc