UCLA Computer Science Department

D. Stott Parker, Jr.
UCLA Computer Science Dept.
3532 Boelter Hall
(310) 825-6871 (ofc)
(310) 825-1322 (sec)
(310) 794-5056 (fax)

Data Mining

Empirical Comparisons of Various Voting Methods in Bagging

K. Leung, D.S. Parker
Proc. KDD, 2003.

Finding effective methods for developing an ensemble of models has been an active research area of large-scale data mining in recent years. Models learned from data are often subject to some degree of uncertainty, for a variety of reasons. In classification, ensembles of models provide a useful means of averaging out error introduced by individual classifiers, hence reducing the generalization error of prediction.

The plurality voting method is often chosen for bagging, because of its simplicity of implementation. However, the plurality approach to model reconciliation is ad-hoc. There are many other voting methods to choose from, including the anti-plurality method, the plurality method with elimination, the Borda count method, and Condorcet's method of pairwise comparisons. Any of these could lead to a better method for reconciliation.

In this paper, we analyze the use of these voting methods in model reconciliation. We present empirical results comparing performance of these voting methods when applied in bagging. These results include some surprises, and among other things suggest that (1) plurality is not always the best voting method; (2) the number of classes can affect the performance of voting methods; and (3) the degree of dataset noise can affect the performance of voting methods. While it is premature to make final judgments about specific voting methods, the results of this work raise interesting questions, and they open the door to the application of voting theory in classification theory.

SQL/LPP+: a Language for Temporal Correlation Verification in Representing Time Series by Landmarks

C.-S. Perng, D.S. Parker
Proc. DAWAK, 2000.

Time series data is often generated by continuous sampling or measurement of natural or social phenomena. The resulting quantified records are the basis of scientific analysis and theory construction. In most cases, events are not represented by individual records, and we argue that events can be better represented by time series segments (patterns, temporal intervals). A consequence of this segment-based approach is that the study of temporal correlation among events can be reduced to verifying temporal coupling among occurrences of time series patterns that represent the events.

A major obstacle on the path toward temporal correlation analysis is inability to define interesting time series patterns. We have introduced SQL/LPP[14], which provides fairly strong expressive power for time series pattern query, and are now able to attack the problem of specifying queries that analyze temporal correlation.

In this paper, we propose SQL/LPP+, a temporal correlation verification language for time series databases. SQL/LPP+ is an extension of SQL/LPP and inherits its ability to define time series patterns. SQL/LPP+ enables users to cascade multiple patterns using one or more of Allen's temporal relationships [1], and obtain the desired aggregates or meta-aggregates of the composition. The issues of pattern composition control are also discussed.

Landmark: A New Technique for Similarity-based Pattern Querying in Time Series Databases

C.-S. Perng, H. Wang, S. Zhang, D.S. Parker
Proceedings, ICDE, 2000.

In this paper we present the Landmark, a new technique for similarity-based time series pattern querying. The Landmark does not follow traditional similarity models which rely on the point-wise Euclidean distance function. Instead, it employs Landmark Similarity, a new similarity model based on human intuition and episodic memory. We show that Landmark Similarity is more general than Euclidean-based similarity model.

The Landmark is applicable even under six transformations; namely, Shifting, Uniform Amplitude Scaling, Uniform Time Scaling, Uniform Bi-scaling, Time Warping and Non-uniform Amplitude Scaling. A method of identifying features that are invariant under these transformations is proposed. We also discuss a generalized approach for removing noise from raw time series without smoothing out the peaks and bottoms. Beside the new capabilities, our experiments shows the performance of Landmark Indexing is faster than DFT-based techniques.

Representing Time Series by Landmarks

C.-S. Perng, D.S. Parker, K. Leung.
Proceedings, CIKM, 1999.

In this paper, we propose a new representation for time series called LANDMARK Representation as a basis for querying time series patterns. LANDMARK Representation is based on the mechanism of human episodic memory which processes episodes by events with significant meaning. LANDMARK Representation is not only a way to represent time series data, it also defines a similarity measurement. LANDMARK Representation is capable to work under six transformations namely Shifting, Uniform Amplitude Scaling, Uniform Time Scaling, Uniform Bi-scaling, Time Warping and Non-uniform Amplitude Scaling. A method of identifying invariable features under transformations is proposed. We also discuss a generalized approach to remove noise from raw time series without smoothing out the peaks and bottoms.

Term Domain Distribution Analysis, a Data Mining Tool for Text Databases: A Case History in a Thoracic Lung Cancer Text Database.

J.A. Goldman, W.W. Chu, D.S. Parker, R.M. Goldman
Proceedings, First. Intnl. Conf. on Discovery Science, 1998.

In this paper, we give a case history illustrating the real world application of a useful techinque for data mining in text databases. The techinque, which we call Term Domain Distribution Analysis (TDDA) consists in keeping track of term frequencies for specific finite domains, and announcing significant differences from standard frequency distributions over these domains as a discovery.

In the case study here, the domain of terms was the pair {right,left}, over which we expected a uniform distribution. In analyzing term frequencies in a thoracic lung cancer database, the TDDA technique led to the surprising discovery that primary thoracic lung cancer tumors appears in the right lung more often than the left lung, with a ratio of 3:2.

Treating the text discovery as a hypothesis, we verified this hypothesis against the medical literature in which primary lung tumor sites were reported, using a standard chi-squared statistic. We subsequently developed a model of lung cancer that may explain the discovery. This discovery and our model may change how oncologists view the mechanisms of primary lung tumor location.

Knowledge discovery in an earthquake text database: correlation between significant earthquakes and the time of day.

J.A. Goldman, D.S. Parker, W.W. Chu
Proceedings. Ninth Intnl. Conf. on Scientific and Statistical Database Management, pp. 12-21, 1997.

In this paper, we take a real world application from a text database and present a case history. The techniques ultimately led to a discovery contradicting an accepted paradigm in seismology. Using simple, tailored, keyword extraction, we examined a text collection of earthquake data. A discovery was made when an unusual pattern emerged from the text. We then tested a more comprehensive numerical database, treating the the text discovery as a hypothesis. It was verified using a standard chi-squared statistic. The hypothesis was significant earthquakes in the longitude regions that include California, occur more often in the morning hours than any other time of day.

Sample result: 32.8% of statistics are inaccurate.
D. Stott Parker (stott@cs.ucla.edu)
Mon Oct 13 21:55:44 PDT 2003