DM Project for CS240B (Revised CS240A Take Home final)

Your final project is building an efficient Naive Bayesian classifier for a dataset of your choice using WEKA.
For instance, uci/kdd and Weka are two good sources of data sets:

Good results were reported in the past with datasets such as led, mushrooms, splice, titanic, waveform, abalone, letter, and census. But data are continously being revised and upgraded and you are encouraged to try new data sets. However make sure that your data set is not small, otherwise your experiments with performance will not be interesting.

You are encouraged to try new applications and if have your own interesting application, you should consider using it.

Your specific tasks are as follows (you should try to implement them using clean and compact SQL)

  1. Perform a preliminary analysis of your data and decide how you are going to deal with missing values and wheter you are going to discretize continuous values or you are going to assume and use a Gaussian distribution or some other kind of distribution.

  2. Partition your data into two sets MS and PS. The set MS will be used to build your classifier. The set PS will be used to predict its accuracy by testing.

  3. Derive your Naive Bayesian Classifier and determine its accuracy

  4. Repeat the last step with other kinds of classifiers (e.g., decision trees)

  5. [Ensemble-based bagging ] See if you can get better decisons by using voting ensembles of classifiers (possibly by assigning weights to the votes of each classifier).

  6. Write a nice report decribing the various steps of analysis, implementation and testing that you have performed and what you might have learned in the course of this project.