Data Classification [Wei98a,Wei98b] finds use in data analysis and pattern recognition. Data classification is a form of supervised learning. Given attributes and training instances, it outputs predictions. The attributes consist of independents, dependents and targets. An example set of attributes are:
Here the target is PlayTennis and it is dependent on the attributes like Outlook, Temperature, Humidity and Wind. An example of a training instance is:
Data classification outputs the expressions of predication made from dependents to targets. One of the expressions could be:
It has been shown that probability could be used for prediction. Naive Bayesian [Elk97] classifier which uses probability for prediction has shown significant results in terms of predictive performance. The next paragraph describes the fundamentals of Naive Bayesian classifier.
Let be attributes, with discrete
values, used to predict a discrete class C. Given an example with observed
attribute values a
through a
, the optimal prediction is class value c
such that
is maximal. This
probability, using Bayes' rule can be rewritten as:
Here P(C=c) can be easily computed based on the training data and is independent of c.
Using assumptions on the independence of the attributes
, can be written as:
where each probability can be estimated from the training data using:
Here if the training data is in table T with A, ..., A
, C as
fields, then we can compute
as
thus requiring separate computation for each column. Let Table 1.1 be used as the training data.
Now we want to predict which of the following will be more probable:
To compute the above probabilities using naive bayesian approach we will have to compute the following probabilities:
Or in other words to compute the following counts:
These count s can be computed by using the SQL code given earlier in this section. It can be observed that to compute the above predictions it requires 10 passes of the table Tennis. But these can be computed using user-defined aggregates in a single pass by computing all the counts together.