Data Classification

Next: Time Series Analysis Up: Data Mining Previous: Mining for Associations

Data Classification

Data Classification [Wei98a,Wei98b] finds use in data analysis and pattern recognition. Data classification is a form of supervised learning. Given attributes and training instances, it outputs predictions. The attributes consist of independents, dependents and targets. An example set of attributes are:

Here the target is PlayTennis and it is dependent on the attributes like Outlook, Temperature, Humidity and Wind. An example of a training instance is:

Data classification outputs the expressions of predication made from dependents to targets. One of the expressions could be:

It has been shown that probability could be used for prediction. Naive Bayesian [Elk97] classifier which uses probability for prediction has shown significant results in terms of predictive performance. The next paragraph describes the fundamentals of Naive Bayesian classifier.

Let be attributes, with discrete values, used to predict a discrete class C. Given an example with observed attribute values a through a, the optimal prediction is class value c such that is maximal. This probability, using Bayes' rule can be rewritten as:

Here P(C=c) can be easily computed based on the training data and is independent of c. Using assumptions on the independence of the attributes , can be written as:

where each probability can be estimated from the training data using:

Here if the training data is in table T with A, ..., A, C as fields, then we can compute as

thus requiring separate computation for each column. Let Table 1.1 be used as the training data.

Table 1.1: Tennis Table

Now we want to predict which of the following will be more probable:

To compute the above probabilities using naive bayesian approach we will have to compute the following probabilities:

Or in other words to compute the following counts:

These count s can be computed by using the SQL code given earlier in this section. It can be observed that to compute the above predictions it requires 10 passes of the table Tennis. But these can be computed using user-defined aggregates in a single pass by computing all the counts together.

Next: Time Series Analysis Up: Data Mining Previous: Mining for Associations

Punit Bhargava
Wed Mar 11 18:50:53 PST 1998