Stream Databases --- Publications

Stream Databases

NSF IRI 89-17907, 9/15/90 to 8/31/93.

The results of this grant include:

These materials are all available by anonymous ftp from {\tt pop.cs.ucla.edu (131.179.96.101)}. Filename paths are indicated below.

Publications

D. Stott Parker, Stream Data Analysis in Prolog, in L. Sterling, ed., The Practice of Prolog, Cambridge, MA: MIT Press, 1990.

Today many applications routinely generate large quantities of data. The data often takes the form of a time series, or more generally just a stream -- an ordered sequence of records. Analysis of this data requires stream processing techniques, which differ in significant ways from what current database query languages and statistical analysis tools support today. There is a real need for better stream data analysis systems.

Stream analysis, like most data analysis, is best done in a way that permits interactive exploration. It must support `ad hoc' queries by a user, and these queries should be easy to formulate and run. It seems then that stream data analysis is best done in some kind of powerful programming environment.

A natural approach here is to analyze data with the stream processing paradigm of transducers (functional transformations) on streams. Data analyzers can be composed from collections of functional operators (transducers) that transform input data streams to output streams. A modular, extensible, easy-to-use library of transducers can be combined in arbitrary ways to answer stream data analysis queries of interest.

Prolog offers an excellent start for an interactive data analysis programming environment. However most Prolog systems have limitations that make development of real stream data analysis applications challenging.

We describe an approach for doing stream data analysis that has been taken in the Tangram project at UCLA. Transducers are implemented not directly in Prolog, but in a functional language called Log(F) that can be translated to Prolog. Many stream processing programs are easy to develop this way. A by-product of our approach is a practical way to interface Prolog and database systems.

A. Bostani, CFC: An Efficient Stream-Processing Environment, M.S. Thesis, CSD-920099, UCLA Computer Science Department, March 1992.

This thesis describes the CFC, an effcient stream-processing environment. Stream-processing applications are developed using F*, a new programming language which combines logic programming, rewriting and lazy evaluation. Our primary focus in this work is to develop an environment for the effcient execution of F* programs and, where necessary, to provide extensions to the F* language itself. In the course of this research, we have developed a compiler that translates a class of F* programs called DF* into instructions for an abstract machine, called DFAM. We show that it is possible to directly translate DFAM programs into C programs which are extremely portable and effcient. CF* is introduced as a novel extension to the C programming language, providing a non-deterministic function call mechanism. We show that general F* programs can be compiled into an extension of DFAM, called FAM. A compiler is described which compiles FAM programs into CF*. Furthermore, we show that it is possible to effciently implement non-deterministic control structures, such as those found in CF*, on conventional machine architectures. Finally, we have extended F* to make it suitable as a general purpose programming language. Also, unlike the early implementations of F*, this extended F* programming environment does not rely on the availability of Prolog as a host environment.

D. Stott Parker, Eric Simon, and Patrick Valduriez, ``SVP -- a Model Capturing Sets, Streams, and Parallelism'', Full version: Technical Report CSD-9200020, 48pp, UCLA Computer Science Department, April 1992. Proc. 18th International Conference on Very Large Data Bases, pp. 115--126, Vancouver, British Columbia, August 1992.

We describe the SVP data model. The goal of SVP is to model both set and stream data, and to model parallelism in bulk data processing. SVP also shows promise for other parallel processing applications.

SVP models collections, which include sets and streams as special cases. Collections are represented as ordered tree structures, and divide-and-conquer mappings are easily defined on these structures. We show that many useful database mappings (queries) have a divide-and-conquer format when specified using collections, and that this specification exposes parallelism.

We formalize a class of divide-and-conquer mappings on collections called SVP-transducers. SVP-transducers generalize aggregates, set mappings, stream transductions, and scan computations. At the same time, they have a rigorous semantics based on continuity with respect to collection orderings, and permit implicit specification of both independent and pipeline parallelism.

Paul R. Eggert, D. Stott Parker, ``An Intensional File System'', Proceedings of the First USENIX File Systems Workshop, Ann Arbor, MI, May 1992.

This paper summarizes a general file system extension of Unix called IFS, implemented and available at UCLA. IFS permits files to be represented as descriptions (arbitrary user programs) that yield a desired object file as output. This permits a variety of powerful extensions by users currently being proposed by many ad hoc file system research proposals. Today's file systems store files extensionally, as a sequence of data items like bytes or blocks. This scheme is the simplest and most efficient method for traditional computers where main memory can be assumed to be limited and disk-to-memory bandwidth adequate. However, these traditional assumptions are becoming obsolete: in more modern, distributed systems, file system clients typically have both cycles to burn and adequate memory for caches. In this new environment, it is often more efficient and convenient to store files intensionally, that is, as a description of how to compute the data items instead of as the data items themselves. In other words, an intensional file is a program that produces the corresponding extensional file as output. This new opportunity meets a long-standing need for user-extensible file systems, a need that was recognized by earlier researchers. For example, this need is partially addressed by mechanisms provided by recent work in object-oriented file systems. However, most previously proposed mechanisms suffer two major drawbacks: they require kernel modifications or root privileges, and they are too complicated to be given to ordinary users. Much of this complexity stems from the ad hoc nature of the mechanisms' designs. Intensional file systems provide a simple, easy-to-explain abstraction for implementing one file system on top of another, an abstraction that provides motivating design principles for file system designers, and that makes it easy for programmers to define new file system types.

D. Stott Parker, Bop 0.2 Manual, Technical Report CSD-9200027, UCLA Computer Science Department, June 1992.

Bop is a portable extension of Edinburgh-style Prolog environments, that integrates term rewriting with Prolog. Bop rests on a kind of conditional rewrite rules, that supports various term rewriting styles. The default rewriting mechanism is nondeterministic and lazy. It offers the following: Bop was developed particularly in order to support research on stream processing and constraint processing, but it should also be of help in the exploration of other paradigmatic extensions of logic programming. This manual overviews the system and describes the primitives available, emphasizing its application to stream processing problems.

S-T. Huang, D. Stott Parker, ``A Study of Variants of the Lambek Calculus and an Implementation in Prolog'', Technical Report CSD-9200028, UCLA Computer Science Department, June 1992.

The Lambek Calculus was proposed as a syntactic calculus of expressions of natural languages, of one of its variants leads to the discovery of the property of local permutability; and the theorems and proofs suggest directions for useful extensions of the systems to meet concerns of specific applications, beyond linguistics. A couple of such possible variants are suggested and discussed. Some recently proposed linguistic extensions are shown to be achieved in those variants, from an algebraic consideration independent from the linguistic proposal. The study also pinpoints the similarity and differences between Categorial Grammars and Phrase Structure Grammars, from a computational perspective. An implementation in Prolog is discussed and linked with the modularity and compositionality features of the calculus, an instance of gain through interdisciplinary interplay.

S-T. Huang, D. Stott Parker, ``Types in Lazy Logic Programs: Computation and Infinite Objects'', Technical Report CSD-9200029, UCLA Computer Science Department, June 1992.

A framework for types of logic programs, in particular, those embodying lazy computation and infinite objects, is proposed. Examples are given to illustrate the need. Starting with two methodological views, Types As Sets (subsets of the Herbrand Universe) and Types for The Prolog Computational Model, we give a simple monomorphic type structure and extend it with principles of polymorphism and homogeneity. The SVP model of Parker, Simon and Valduriez is given a detailed type analysis as an application of this framework. Common logic programs serve as other instances of application. Directions for enriching the type structure (Herbrand Type Universe) are discussed.

X. Wang, D. Stott Parker, ``Computing Least Fixed Points by Asynchronous Iterations'', Technical Report CSD-9200030, UCLA Computer Science Department, June 1992.

We establish conditions for convergence of asynchronous iterations and random (chaotic) iterations, such as might be used in evaluation of database queries.

H.L. Chau, D.S. Parker, ``Narrowing Grammar: Theory, Implementation, and Applications'', J. Logic Programming 14, 253-286, November 1992.

This paper describes Narrowing Grammar, a new kind of grammar that combines concepts from logic programming, rewriting, lazy evaluation, and logic grammar formalisms such as Definite Clause Grammar (DCG). A Narrowing Grammar is a finite set of rewrite rules. The semantics of Narrowing Grammar is defined by a specialized kind of outermost rewriting strategy called NU-narrowing.

Narrowing Grammar is directly executable, like many logic grammars. In fact, Narrowing Grammar rules can be compiled to Prolog and executed by existing Prolog interpreters as generators or acceptors. Unlike many logic grammars, Narrowing Grammar also permits higher-order specification and modular composition, and provides lazy evaluation by virtue of its rewriting strategy. Lazy evaluation is important in certain language acceptance situations, such as in coroutined matching of multiple patterns against a stream.

We compare Narrowing Grammar with several established logic grammars: Definite Clause Grammar, Metamorphosis Grammar, Extraposition Grammar and Gapping Grammar, showing how these logic grammars can be transformed to Narrowing Grammar. We also investigate the versatility of Narrowing Grammar in language analysis by applying it to several natural language examples.

Paul R. Eggert, D. Stott Parker, ``File Systems in User Space'', Proceedings of the Winter 1993 USENIX Conference, pp. 229--239, San Diego, CA, January 1993.

Current methods for interfacing file systems to user programs suffer two major drawbacks: they require kernel modifications or root privileges, and they are too complicated to be given to ordinary users. In this paper we show alternative methods are possible. The recent rise of dynamic linking provides a new way for users to develop their own file systems: by interposing a layer of user code between a program and the system call interface, a user can alter or extend a file system's behavior.

For greatest power and reliability, such changes to file system behavior must be managed systematically. We present two user-extensible file systems that are loosely modeled on intensional logic. IFS0 is simple, and supports only extended pathname interpretation for files: it permits certain shell-like expressions as pathnames. To this, IFS1 adds directory lookup and an escape mechanism for interpreting pathnames that can be modified by the user at any point. These file systems operate by modifying the semantics of Unix system calls that take pathname arguments.

With IFS1 a user can develop a wide range of useful file systems without writing a line of C code. We have developed a variety of sample file systems with IFS1, including tar image navigation and a software development file system resembling 3DFS. IFS1 can thus be thought of as a simple user-programmable file system toolkit.

S-T. Huang, D. Stott Parker, ``Structure and Order Preservation: A Unifying Framework using Dual-M-Sets'', Technical Report CSD-930004, UCLA Computer Science Department, March 1993.

Structure and order are key aspects of computational objects. Structure preserving morphisms are emphasized in the study of abstract data types, and have been analyzed with universal algebra or category theory. The order preserving morphisms are traditionally the major focus of program semantics, and in particular, domain theory and denotational semantics, which typically capture order as increase in information. However, the interaction between structure preservation and order preservation has been studied less. While they share many similarities, and while in most interesting cases such as lattices, they can be mutually defined, the issue of their interaction is not trivial at all. As we also know, in lattices, structure preservation is not the same as order preservation.

This interaction is important in a variety of situations. Our particular interest comes from database processing, especially processing of large collections of structured (``complex object'') data. In such applications, both data structure and the morphisms defined on that data are of central importance, as in the SVP model. Structure preservation and order preservation translate directly to performance benefits in parallel processing, as they permits problems to partitioned over multiple computers in a direct and natural way. Ordering also permits lazy evaluation techniques, and in particular permits pipeline processing.

In this article, we introduce the notion of dual-M-sets, an algebraic structure capturing dualities among the category of monoids. In special cases this exposes the interaction between structure preservation and order preservation, by simultaneously considering acceptable morphisms and structured objects. As applications, we show that several useful formal systems can be reconstructed as dual-M-sets, including the Bird-Meertens functional calculus for computable functions on lists, and certain monads (Wadler's list comprehensions). The analysis by categorical constructions also points to relevances in database applications for complex objects, in particular, structured objects and queries defined in terms of structural preservations and possible extensions.

We also show that the Lambek Calculus with type judgement can be modeled as a dual-M-set. The Lambek Calculus has recently enjoyed a revival in formal linguistics as a mathematical foundation of Categorial Grammar. It is a rich grammatical formalism that can serve as the basis for powerful parsing mechanisms, like Narrowing Grammar. Again, with a suitable ordering we can relate structure and order preservation. The Lambek Calculus is also directly translated to a minimal propositional fragment of Linear Logic. The translation leads to a minimal propositional Linear Logic without the Exchange structural rule, and serves a claim that the (formula) resource-conscious linear logic may also location (order of formula) conscious too.

The framework induced by dual-M-sets with natural orderings ties together several interesting concepts, including structure and order preservation, the duality of category and monoid, list, grammatical, and logical formalisms. The framework induced by dual-M-sets therefore appears both powerful and useful.

M. Coleman, Aesthetics-based Graph Layout for Human Consumption, M.S. Thesis, Technical Report CSD-930004, UCLA Computer Science Department, March 1993.

M. Coleman, D.S. Parker, ``Aesthetics-based Graph Layout for Human Consumption'', UCLA Computer Science Department, June 1993. Submitted to Software -- Practice and Experience. Abstract will appear in the Proceedings of Graph Drawing '93, ALCOM International Workshop on Graph Drawing and Topological Graph Algorithms , Paris, September 26--29, 1993.

Automatic graph layout is an important and long-studied problem in computer science. The basic straight-edge graph layout problem is: Given a graph, position the vertices in a way which maximizes some measure of desirability. When graph layout is intended for human consumption, we call this measure of desirability an aesthetic. We seek an algorithm which produces graph layouts of high aesthetic quality, and which handles trees, directed acyclic graphs, and general graphs.

Our approach is to model graph layout as a multiobjective optimization problem, where the value of a layout is determined by a user-controlled set of layout aesthetics. We justify this model theoretically, and describe our (AGLO) algorithm and its implementation, the aglo library.

The AGLO algorithm combines the power and flexibility of the simulated annealing approach of Davidson and Harel (1989) with the relative speed of the method of Fruchterman and Reingold (1991), and provides a better theoretical foundation for these methods. In addition, we have developed several new layout aesthetics to support new layout styles. Using these aesthetics, we are able to produce pleasing displays for graphs on which these other methods flounder.

Prasad Ram, ``A New Understanding of Greed'', Ph.D. Dissertation, UCLA Computer Science Department, June 1993.

Anyone who has read traditional presentations of what greedy solvable problems are, comes away with a sense of confusion or disbelief or both. The pervasiveness of greedy algorithms is due to their simplicity. It is just not comprehensible that a complex theory (e.g. matroids, greedoids) underlies the class of optimization problems that can be solved using greedy algorithms. It seems impossible to reconcile one's intuitive understanding with matroids, polymatroids, greedoids, submodularity, etc.

Our chief contribution is to provide a new and simple characterization of greed. This is a very important problem with several applications in all the areas of computer science. Befitting the prominence of the problem, it has been worked on by many reasearchers. Our characterization of greed is a radical departure from the other characterizations based on matroids, polymatroids, greedoids, and submodular functions, and subsumes all these existing formalisms. In this dissertation we have proposed a novel solution to an important problem.

Greedy algorithms to optimization problems are typically iterative and generate a sequence as an answer. Each step of the iteration uses the prefix sequence generated up to the previous iteration to compute the next element of the sequence. We define a very useful partial order relation on sequences called majorization and immediately see the intimate relationship between greed and majorization. We define lower triangular stochastic matrices (LTSMs) and relate them to majorization. We show that a sequence x is majorized by y if y = xL where L is a lower triangular stochastic matrix. An LTSM can also be viewed as a transformer, i.e., one can start with a feasible answer and by repeated applications of LTSMs more optimal answers can be obtained. Order-preserving maps on sequences called Schur-convex functions are defined and necessary and sufficient conditions for a function to be Schur-convex have been identified. The work on defining and developing concepts related to majorization is important in itself, beyond its applicability to characterizing greed.

An optimization problem is greedy solvable if the objective function is Schur-convex on the set of feasible answers when the majorization relation is a topped partial order. The above definition of greedy solvability is the first comprehensive definition of greedy solvability. The classical properties of greedy solvable problems (viz. greedy choice property and the optimal substructure property) can be interpreted in terms of majorization and Schur-convex functions. This helps reconcile our intuitive understanding of greed in terms of our characterization. This definition of greedy solvability creates exciting new possibilities for studying optimization problems in other contexts, such as `greedy heuristics' and `dynamic programming problems'.

C. Amador, ``An Implementation of the SVP Database Model'', Technical Report, UCLA Computer Science Department, July 1993.

We describe an implementation of the SVP data model. SVP is a database model intended for modeling both set and stream data, and to model parallelism in bulk data processing.

SVP models collections, which include sets and streams as special cases. Collections are represented as ordered tree structures, and divide-and-conquer mappings are easily defined on these structures. We show that many useful database mappings (queries) have a divide-and-conquer format when specified using collections, and that this specification exposes parallelism.

We describe a multithreaded implementation written in C. This implementation is parametric in control schemes for forking threads; that is, one can easily modify the control scheme used for generating new threads of execution while processing queries. It has been run successfully on an Encore Multimax at INRIA Rocquencourt. While performance has not been a key focus of this implementation, we feel the implementation shows what is possible.

Xin Wang, D. Stott Parker, ``Moebius Inversion, Boolean Differentiation, and Incremental Computation'', Technical Report, UCLA Computer Science Department, July 1993.

This paper studies Moebius inversion on Boolean algebras and its relation with generalized Boolean differentials on Boolean rings. A motivation for this study stems from an interest in improving our understanding of the Moebius inversion formula as an analog of the fundamental theorem of calculus, and pursuing concepts and tools of incremental computations for functions on ordered algebras.

M.H.M. Cheng, M. van Emden, D.S. Parker, ``Applicative Term Rewriting Systems and Prolog Technology'', Proceedings of the HOA'93 International Workshop on Higher Order Algebra, Logic and Term Rewriting, CWI, Amsterdam, September 1993.

We explore Applicative Term Rewriting Systems, first-order term-rewriting systems that axiomatize function application. These systems are similar in flavor to Klop's higher order combinatory reduction systems, but involve only first-order terms.

Applicative term-rewriting systems have several aspects that are interesting to us. First, they are simple, and their rules can be integrated directly with declarative, first-order axiomatizations of the equality relation. Second, they are easily implemented with Prolog, thereby making a basis for a Prolog Technology Term Rewriter in the sense of Stickel's ``Prolog Technology Theorem Prover'', and providing both easy implementation and efficient execution. Third, they permit specification of computationally useful fragments of higher order equality theories, by allowing selective omission of equality axioms. Finally, in this approach innermost and outermost reduction, as well as many combinations of these, are obtained by merely rearranging the clauses that express equality axioms.

A number of higher order term rewriting systems proposed recently rely on simplified lambda-term unification algorithms, implementing fragments of equality theories for lambda-terms. Our approach provides both a way to implement such schemes, and an alternative equality theory when the fragment required is limited (as is often the case).

S-T. Huang, ``Dual M-Sets: The Theoretical Framework and Applications'', Ph.D. Dissertation, Technical Report, UCLA Computer Science Department, December 1993.

A new framework, the dual-M-set, is proposed both as a means for structuring certain kinds of programs, and as a means for reasoning about aspects of software practice such as programming by patterns.

The framework is based upon the `rigid body' metaphor in physics, an outlook which models information as the result of operator applications (measurements) on immutable objects (rigid bodies). This framework takes monoids as points and vectors, and structure preserving homomorphisms as linear operators. Order preserving endomorphisms (in particular, induction morphisms) are also included as basics. The conceptual framework treats both the objects and their functions as first class citizens. Instead of taking a conventional approach using category theory, we emphasize the local, though universally uniform, analysis of the construction and interaction of subalgebraic closure and homomorphisms, and order preserving morphisms.

We apply the framework to model, comprehensively, two different systems: The Theory of Lists (also known as the Bird-Meertens formalism), for specification and systematic derivation of programs; and the Lambek Calculus, a deductive systems and formal foundation for linguistic applications in Categorial Grammars and formal semantics. Beyond the modeling and new proofs of old results, we also develop extensions inspired by the perspectives that accompany the dual-M-set modeling.

As the rigid body metaphor is very basic in the modeling of information, monoid structures are prevalent in current mathematical program structures. The framework of dual-M-sets aspires for a mathematical program development methodology exploiting this basic metaphor, in keeping with the common long term aim of developing programs as mathematical objects susceptible to mathematical analysis and manipulation. With the encouraging results from its applications to the two examples, we believe that the framework of dual-M-sets is a powerful and promising tool with wide applicability.

Shen-Tzay Huang, D. Stott Parker, ``Kahn's Networks of Processes as discrete analogues of Picard's Systems of Differential Equations'', Technical Report, UCLA Computer Science Department, February 1994.

Anyone who has studied parallel programming with Kahn's networks of processes and also systems theory or control theory has noticed a strong similarity between the two. For one thing, both use similar diagrams, often with feedback, to express computations formed as compositions of incremental transforms. What is a mathematical basis for reconciling the two?

Picard's theorem is a classical result in the theory of differential equations and integral equations. Kahn's theorem has similar renown in the theory of parallel programming, networks of processes, and stream processing. In this paper we clarify how these results are related, and show that Kahn's theorem and Picard's theorem are both special cases of a more general result. That we can do this may be surprising, since the foundations of the two theorems are quite different: metric function spaces and the fixed point theorem for contraction mappings in Picard's case, and complete partially ordered sets and the fixed point theorem for continuous mappings in Kahn's.

Our strategy is to translate the foundations for Kahn's theorem into appropriate metric space concepts. We accomplish this by establishing discrete versions of Picard's Theorem for the metric space of streams. We define contraction mappings over the set of functions of streams, and show that Kahn's Theorem is a variant of Picard's Theorem in this setting.

We explore a number of representative examples showing applications of this perspective: Mealy machines (showing feedback-free solution), stream prefix accumulators and the `powerset' function (showing both simple and complex feedback loops), recursively-defined fractal objects (such as Cantor's set), and formal power series or generating functions. The metric perspective seems particularly useful for stream computations.

The result here, that Kahn's theorem is related to Picard's theorem, is not just a technical result. It gives a formal connection between stream processing and differential equations, a connection between continuous and discrete mathematical models. This connection gives new ways to formalize incremental computation, deal with infinite objects like streams, understand the role of feedback, and adapt tools in the theory of differential equations to stream processing.

Ping-Yu Hsu, D. Stott Parker, ``Improving SQL with Generalized Quantifiers'', Proc. Data Engineering 1995, to appear.

A generalized quantifier is a particular kind of operator on sets. Coming under increasing attention recently by linguists and logicians, they correspond to many useful natural language phrases, including phrases like: three, Chamberlin's three, more than three, fewer than three, at most three, all but three, no more than three, not more than half the, at least two and not more than three, no student's, most male and all female, etc.

Reasoning about quantifiers is a source of recurring problems for most SQL users, and leads to both confusion and incorrect expression of queries. By adopting a more modern and natural model of quantification these problems can be alleviated. In this paper we show how generalized quantifiers can be used to improve the SQL interface.

D. Stott Parker (stott@cs.ucla.edu)
Fri Sep 29 13:48:18 PDT 1995