Where to start?
-
21 Mei 2012 12:55
Hi All,
I have a fairly large database filled with historical data concerning aptitude tests and have been tasked with running some sort of data mining to find predictive patterns. To put it in simple terms, the data is about classes filled with students who are graded on a curve and we want to determine (if at all possible) what attributes are most likely to make a student finish at the top of the class - and preferably come up with away to predict future results.
We have all sorts of demographic data about the students, past test results, background information etc. All of this can be flattened to a tabular view if necessary. Our database is in SQL 2008 R2 but we have the option to upgrade to SQL 2012.
My problem is getting started on this. I've been researching data mining in SQL Server and at first thought the ideal solution would be some sort of Decision Tree model, but the more I think about it the less sure I am - Decision Trees probably wouldn't be able to take the "competing" students in the class into account, which is important as they're graded relative to each other. There is only ever one top student per class.
I'm also having trouble just finding a simple "getting started" step by step tutorial to set something like this up - I'm a fairly seasoned SQL developer but I'm a total newbie when it comes to data mining and don't even know how to find out exactly what pieces I need installed.
Questions:
1. What model would you choose to analyse this data in order to make predictions?
2. Are there any simple examples of building and using such a model against a database? Is "simple" even a term that could ever be used in the same sentence as data mining?
Semua Balasan
-
28 Mei 2012 7:40
Hi,
You can find some resamblence in the fraud detection problem. In fraud detection there is much less fraudulent events than the whole number of transaction - you have one top student per class. So, generally to point one would be a problem. But you can always choose a group of student in which the best one will be (the probability for each one o be the best will exceed some treshold point).
Regards,
gc