\documentstyle[twocolumn]{article}
\setlength{\textwidth}{6.0in}
\setlength{\oddsidemargin}{0.25in}
\setlength{\evensidemargin}{0.25in}
\addtolength{\textheight}{1.8in}
\addtolength{\topmargin}{-0.9in}
%% include and place a tpic figure
\newcommand{\glofig}[1]{\input{#1}\centerline{\box\graph}}
\newcommand{\glot}[1]{(\input{#1.time}s)}
\newcommand{\gtabline}[6]{#1 & #2 & #3 & #4 & #5 & {\tt\footnotesize #6} \\}
\newcommand{\glop}[1]{
\begin{quote}
{\tt gloss \input{#1.param}}
\end{quote}
}
\newcommand{\SVP}{{SVP}}
\newcommand{\turnstile}{{\vdash}}
\newcommand{\turnstileStar}{{\stackrel{*}{\vdash}}}
\newcommand{\derives}{{\Rightarrow}}
\newcommand{\derivesStar}{{\stackrel{*}{\Rightarrow}}}
\newcommand{\sizeOfString}[1]{{\mid\! #1 \!\mid}}
\newcommand{\sizeOfSet}[1]{{\mid\!\mid\! #1 \!\mid\!\mid}}
\newcommand{\pointDistance}[1]{{\mid\!\mid\! #1 \!\mid\!\mid}}
\newcommand{\boldvector}[1]{{\bf #1 }}
%% tuples
\newcommand{\tuple}[1]{{( #1 )}}
%% streams
\newcommand{\stream}[1]{{[ #1 ]}}
\newcommand{\eof}{{[\,]}}
\newcommand{\Cdot}{{\; \cdot \;}}
\newcommand{\append}{{\, \bullet \,}}
\newcommand{\Append}{{\; \bullet \;}}
%% collections
\newcommand{\collection}[1]{{\langle #1 \rangle}}
\newcommand{\eoc}{{\langle \rangle}}
\newcommand{\oppendR}{{\, \diamond^R \,}}
\newcommand{\oppend}{{\, \diamond \,}}
\newcommand{\Oppend}{{\; \diamond \;}}
\newcommand{\collprefix}{{\preceq_{\diamond}}}
\newcommand{\collseq}{{\preceq_{\diamond seq}}}
\newcommand{\collstrictseq}{{\prec_{\diamond seq}}}
\newcommand{\cappend}{{\, \star \,}}
\newcommand{\cAppend}{{\; \star \;}}
%% arrays
\newcommand{\quadtree}[4]{{
{
\left(
{
\begin{array}{cc}
#1 & #2 \\
#3 & #4 \\
\end{array}
}
\right)
}
}}
\newenvironment{aligneddefs}{\[
\begin{array}{lcl}
}{\end{array}
\]%
}
\newenvironment{citations}{
\begin{quote}
\sf
}{
\end{quote}
}
\input psfig
%%%\pagestyle{empty} % numbering?
\begin{document}
%%%\psdraft % omit postscript includes?
\title{{\large \bf Stream Databases}}
\author{D. Stott Parker \\
Computer Science Department \\
University of California, Los Angeles, CA 90024-1596}
\date{Final Report for NSF Grant IRI 89-17907}
\maketitle
\begin{abstract}
This project was originally motivated by the need to develop
a foundation for constructing database systems that manipulate
streams. Such a project is inherently more complex than that
for ordinary database systems, because a comprehensive model of
stream data processing inherently requires explicit handling of
computation. What may seem a simple job at first glance becomes
quite involved as one gets down to specifics.
Much of the burden of mastering any vast information base
comes from the challenge of becoming familiar with it.
Information filtering and exploration tools are necessary.
A common approach is to impose only a limited structure on the information base,
and develop a `query language' for browsing of the resulting structure.
We have concentrated on studying fundamental issues for systems handling
streams and sets, but we have also considered graphs and arrays.
The results of this work range from query languages to system architectures
incorporating evaluation of expressions as a primitive.
\end{abstract}
\section{Information Structures}
Over the course of this project,
we investigated several kinds of useful structures:
streams, sets, graphs, and arrays.
The primary focus of the investigation was on streams,
as they are in a sense the most basic structure.
\subsection{Stream Structures}
A {\em stream} is an ordered sequence of data items.
The stream structure is natural for representing textual objects
(such as programs, and program traces).
Imposing a stream structure on information reduces
the problem of its analysis to the problem of manipulating streams.
The stream data analysis paradigm has proven itself in both statistical
and scientific areas, and has been adopted in a large number of tools,
including visualization tools (AVS and apE),
statistical analysis tools (S, etc.),
and performance analysis tools.
However, these tools are often difficult to understand or extend,
and are based upon limited graphical or textual models.
We have concentrated on studying fundamental issues in stream processing,
in the interest of developing general models and languages for
analyzing streams.
In stream data analysis, the data to be studied are cast in the form of streams
and analysis tools are cast in the form of transducers.
Just as in a functional programming environment,
these transducers are viewed as basic building blocks by data analysts.
Below we summarize work on several languages for stream analysis.
\subsection{Other Structures}
Understanding language issues for streams
permits generalization to other useful structures.
For example, {\em arrays} of data can be treated like (2-dimensional) streams,
and many matrix algebra operations also fit the stream data analysis paradigm.
%\footnote{
%Note that arrays have been implemented elegantly in
%the Haskell functional language \cite{Hudak1989}.
%}
This is sometimes cited as a basis of the success of
the APL programming language and the Unix operating system.
Another critical structure is the {\em set},
the foundation of most database systems.
Sets are easily modeled by streams, and in fact most database systems
implement sets as streams anyway.
A great deal of useful perspective is obtained for set languages
by first understanding the issues in stream languages well.
We have spent a considerable amount of time on graph structures as well.
As graph structures are difficult to present visually, we developed
a general graph layout system (AGLO) for this purpose.
It will be described below.
\section{Bop}
At UCLA we have developed several stream-processing languages
over the past five years.
The first significant language, called {\em Bop},
combined lazy functional languages and logic programming features.
In addition, it incorporated features for object-oriented program structuring
and modularization.
Although it is not feasible to explain the system in detail
within the space limitations here, we can illustrate the flavor of the system.
In Bop, stream mappings (transducers) are expressed using conditional
rewrite rules.
These rules have a functional flavor,
and the result is similar in ways to Miranda and ML,
two members of the most recent generation of functional programming languages.
However the system retains various expressiveness advantages
and features inherited from Prolog, in which it was designed,
and from other features of modularization and typing.
% Some Bop transducers are presented below
% that compute statistical aggregates and parse program text.
% To begin with, let us just consider a transducer that computes the maximum
% number in a stream:
% {\small
% \begin{verbatim}
% max([X|S]) => maxAccum(S, X).
%
% maxAccum([], Max) => Max.
% maxAccum([X|S],Max) => maxAccum(S, XMax) :-
% XMax is max(Max,X).
% \end{verbatim}
% }
% This program reads a stream, which is either \verb"[]" -- the empty stream --
% or \verb"[X|S]" -- the number \verb"X" followed by a subsequent stream \verb"S".
% The program works by repeatedly comparing
% the number \verb"X" to its running maximum.
%
% A similar transducer incrementally computes the first three moments of a stream:
% {\footnotesize
% \begin{verbatim}
% moments3([X|S]) => moments3(S, 1, X, 0.0, 0.0).
%
% moments3([], N,XN,M2_N,M3_N) => [].
% moments3([X|S],N,XN,M2_N,M3_N) =>
% [(N,XN,M2_N,M3_N)|moments3(S,N1,XN1,M2_N1,M3_N1)] :-
% N1 is N+1,
% DX is (X-XN)/N1,
% DXsq is DX*DX,
% XN1 is XN + DX,
% M2_N1 is M2_N + N1*N * DXsq,
% M3_N1 is M3_N - DX * (3*M2_N - N1*N*(N-1) * DXsq).
% \end{verbatim}
% }
% This gives an incremental algorithm for computing the moments
% \(
% M_r^n = \sum_{i=1}^{n} (x_i - \bar{x}_n)^r
% \)
% for $r = 2, 3$.
% When run on a file containing the integers from 1 to 1000, for example,
% we obtain a stream of quadruples
% \begin{verbatim}
% [ (1, 1.0, 0.0, 0.0),
% ... ,
% (1000, 500.0, 83333250.0, 0.0) ].
% \end{verbatim}
% This transducer is textually very much like Fortran code, but dispenses
% with the I/O and/or array manipulation code needed to access
% the sequence of input values.
Bop is a language in the Unix tradition of stream processing,
but strives for more general applicability by virtue of its
greater expressive power
(such as its ability to support pattern matching and parsing directly),
extensibility through composition,
and support for switching among threads in stream computations.
It is theoretically possible to perform stream processing in any
complete programming language;
however, Bop is geared to support stream processing-oriented tasks.
Essentially, Bop provides the programmer the ability to ignore the details
of stream I/O
(opening and closing of streams,
reading and writing of potentially complex structures, etc.)
and switching control among strema processing threads,
as well as to exploit stream pattern matching.
It is a superior language
for sophisticated analysis of program text and event traces.
The Bop system is an extension of Prolog supporting general rewriting,
constraint processing, and in particular stream processing.
It consists of an interpreter, a library with several hundred
predefined functions, and development tools.
\begin{citations}
\noindent
D.S. Parker, Stream Data Analysis in Prolog.
in: {\em The Practice of Prolog}, Leon Sterling, Editor,
Cambridge, MA: MIT Press (1990).
\noindent
A. Bostani,
{\em CFC: An Efficient Stream-Processing Environment},
M.S. Thesis,
UCLA Computer Science Department, March 1992.
\noindent
D. Stott Parker,
{\em Bop 0.2 Manual},
Technical Report CSD-9200027,
UCLA Computer Science Department, June 1992.
\end{citations}
\section{SVP}
Following on the experience with Bop,
we investigated generalizations for other structures, particularly sets.
The result was {\em SVP},
a database model intended for modeling both set and stream data,
and to model parallelism in bulk data processing.
SVP models {\em collections},
which include sets and streams as special cases.
Collections are represented as binary tree structures,
and divide-and-conquer mappings are easily defined by recursions
on these structures.
Many database mappings (queries) can be defined with
divide-and-conquer mappings over structures.
For example, we can define the `maximum' of a set of values
with such a mapping:
\begin{aligneddefs}
max(\{\}) & = & \bot \\
max(\{x\}) & = & x \\
max(S_1 ~\cup~ S_2) & = & max(S_1) ~\sqcup~ max(S_2)
\end{aligneddefs}
define a set mapping (in this case an aggregate) recursively,
if `$\sqcup$' is the binary maximum operator.
% This definition reflects parallelism that can
% be obtained by computing cardinalities of subsets independently.
% For example, in the computation
% \begin{aligneddefs}
% max(\{3,5,8\})
% & = & max(\{3,5\}) ~\sqcup~ max(\{8\}) \\
% & = & (max(\{3\}) ~\sqcup~ max(\{5\})) ~\sqcup~ 8 \\
% & = & (3 ~\sqcup~ 5) ~\sqcup~ 8 \\
% & = & 5 ~\sqcup~ 8 \\
% & = & 8
% \end{aligneddefs}
% we have ultimately three independent parallel threads that are
% `fanned-in' to an aggregate.
\medskip
The stream mapping
\begin{aligneddefs}
{\em diffs}( \eof ) & = & \eof \\
{\em diffs}(x \cdot \eof ) & = & \eof \\
{\em diffs}(x \cdot y \cdot S) & = & (y - x) \Cdot {\em diffs}(y \cdot S)
\end{aligneddefs}
This yields a stream of the differences
between adjacent elements in the input stream.
% For example:
% \begin{eqnarray*}
% \lefteqn{{\em diffs}( 98 \cdot 99 \cdot 97 \cdot 97 \cdot 99 \cdot 96 \cdot \
% eof )} \\
% & = & +1 \Cdot {\em diffs}( 99 \cdot 97 \cdot 97 \cdot 99 \cdot 96 \cdot \eof
% ) \\
% & = & +1 \Cdot -2 \Cdot {\em diffs}( 97 \cdot 97 \cdot 99 \cdot 96 \cdot \eof
% ) \\
% & = & +1 \Cdot -2 \Cdot ~0 \Cdot {\em diffs}( 97 \cdot 99 \cdot 96 \cdot \eof
% ) \\
% & = & +1 \Cdot -2 \Cdot ~0 \Cdot +2 \Cdot {\em diffs}( 99 \cdot 96 \cdot \eof
% ) \\
% & = & +1 \Cdot -2 \Cdot ~0 \Cdot +2 \Cdot -3 \Cdot {\em diffs}( 96 \cdot \eof
% ) \\
% & = & +1 \Cdot -2 \Cdot ~0 \Cdot +2 \Cdot -3 \Cdot \eof .
% \end{eqnarray*}
This mapping implements a transducer that scans the stream of values
and translates it to a stream of pairwise differences.
In SVP, collections are defined as binary trees.
% In SVP, collections are recursively defined as follows:
% \begin{itemize}
% \item $\eoc$ is the empty collection.
% \item $\collection{v}$ is a unit collection, if $v$ is a SVP value.
% \item $S_1 \Oppend S_2$ is a collection
% if $S_1$ and $S_2$ are {\em nonempty\/} SVP collections.
% (Collections are forbidden to properly contain the empty collection.)
% \end{itemize}
This definition allows SVP collections to model many structures of interest,
including sets, streams, sequences, indices (balanced trees), etc.
% including:
% \begin{itemize}
% \item {\em sets} \\
% The SVP-collection
% \(
% ( \collection{1} \Oppend \collection{2})
% \Oppend
% ( \collection{3} \Oppend \collection{4})
% \)
% represents the set \{1,2,3,4\} as a balanced binary tree.
% \item {\em streams} \\
% The SVP-collection
% \(
% \collection{1} \Oppend
% ( \collection{2} \Oppend
% ( \collection{3} \Oppend
% ( \collection{4} \Oppend \eof )))
% \)
% represents the stream {[1,2,3,4]} as a linear list-like structure.
% \item {\em sequences} \\
% A sequence is a right-linear tree, such as the SVP-collection
% \(
% \collection{1} \Oppend
% ( \collection{2} \Oppend
% ( \collection{3} \Oppend
% \collection{4} )).
% \)
% A sequence is a non-$\eof$-terminated stream.
% In many situations the stream terminator $\eof$ is not significant,
% and can be omitted.
% \end{itemize}
%
% Divide-and-conquer mappings are easily defined on collection structures.
% We call these divide-and-conquer mappings {\em SVP-transducers}.
% Examples include:
% \begin{enumerate}
%
% \item
% The set mapping $max$ above is expressible as the SVP transducer
% \begin{aligneddefs}
% {\sf max}( \eoc ) & = & \eoc \\
% {\sf max}( \collection{x} ) & = & x \\
% {\sf max}(S_1 \Oppend S_2) & = & {\sf max}(S_1) \sqcup {\sf max}(S_2).
% \end{aligneddefs}
%
% \item
% The {\em diffs} transducer above
% can be easily implemented as an SVP transducer,
% assuming that the input collection is a sequence (a right-linear tree),
% that is not terminated with $\eof$:
% \begin{aligneddefs}
% {\sf diffs}( S ) & = & {\sf diffs1}( \eoc , S) \\[6pt]
% {\sf diffs1}(Q, \eoc ) & = & \eoc \\
% {\sf diffs1}(Q, \collection{x} ) & = &
% \mbox{{\bf if} $Q = \eoc$ {\bf then} $\eoc$
% {\bf else} $\collection{x \, - \, {\sf unitcollectionvalue}(Q)}$} \\
% {\sf diffs1}(Q, S_1 \Oppend S_2) & = & {\sf diffs1}(Q,S_1) \Oppend {\sf diffs1}
% ( S_1, \, S_2).
% \end{aligneddefs}
% In fact this transducer can be generalized to handle
% arbitrary SVP collections, by introducing `restructuring'
% transducers that coerce the input into the assumed sequence format.
%
% \end{enumerate}
On these collections SVP transducers are defined as generalizations
of the maps shown above.
SVP-transducers generalize aggregates, set mappings,
stream transductions, and scan computations.
A further interesting property of SVP model is that the recursive
specification of queries exposes {\em parallelism}.
At the same time, they have a rigorous semantics
(based on continuity with respect to collection orderings),
permitting polymorphic typing,
and permitting implicit specification
of both independent and pipeline parallelism.
To evaluate its promise, we developed a parallel implementation of SVP.
The main objective was to produce a system that would support
execution of SVP queries on a system supporting little more than
multiple threads, like the C Threads interface in Mach.
This implementation has shown that the critical issue in
implementation is the scheduling policy used in deciding when
to fork threads while executing an expression in parallel.
\begin{citations}
\noindent
D. Stott Parker, Eric Simon, and Patrick Valduriez,
``SVP -- a Model Capturing Sets, Streams, and Parallelism'',
Technical Report CSD-9200020, 48pp,
UCLA Computer Science Department, April 1992.
Submitted to {\em ACM TODS}.
\noindent
D. Stott Parker, Eric Simon, and Patrick Valduriez,
``SVP -- a Model Capturing Sets, Streams, and Parallelism'',
{\em Proceedings of the 18th VLDB Conference}, pp. 115--126,
Vancouver, British Columbia, Canada, August 1992.
\noindent
S-T. Huang, D. Stott Parker,
``Types in Lazy Logic Programs: Computation and Infinite Objects'',
Technical Report CSD-9200029,
UCLA Computer Science Department, June 1992.
\noindent
C. Amador,
``An Implementation of the SVP Database Model'',
Technical Report,
UCLA Computer Science Department, July 1993.
\end{citations}
\section{Incremental Processing and Term Rewriting}
Database systems require a variety of `small' computations to be applied
to a `large' structure.
For example, an insertion adds a tuple to a relation,
and an aggregate runs a simple numeric computation over a database.
With complex structures like streams,
such operations are prevalent, and have a particularly `incremental' flavor.
This has raised our interest in incremental processing for database systems.
Recently, there has been a great deal of interest in
`incremental algorithms', since they arise in many important situations:
pattern matching,
reevaluation of expressions that have changed slightly,
handling of constraints,
computing fixed points of functions,
evaluation in lazy functional languages,
and greedy algorithms to name a few.
After some experimentation, we feel incremental processing is
best understood in terms of rewriting,
an abstract form of expression evaluation defined both by rewrite rules
and by the policies used in selecting and applying these rules.
The Bop and SVP systems mentioned above have elegant definitions in terms of
term rewriting, and this is likely to be the case in any system definable
with rewrite rules.
Furthermore,
as term rewriting is basic to a number of programming methods, such as
functional programming, symbolic mathematics, theorem proving, and
constraint solving, its efficient implementation is important.
We have developed relatively efficient implementations by
exploiting the combination of high level and
efficient implementation of Prolog-like languages.
A grammar is a system of rewrite rules,
From our earlier work on Bop we were interested in pattern matching,
and were able to develop general grammars as Bop programs.
The implementation of Bop allows it to double
as an implementation of {\em Narrowing Grammar},
an extremely powerful kind of attribute grammar.
Recently we have studied specializations of Narrowing Grammar
and term rewriting with a simpler semantics,
which will lead to expressive grammars that are efficient and elegant.
A result of this work was a reconstruction
of term rewriting in first-order logic, showing that term rewriting
(a theory riddled with arcane terminology) can be both summarized
elegantly in logic, and executed very efficiently with state-of-the-art
logic programming systems.
We have also shown that incremental processing is a subject
that is amenable to algebraic techniques in surprising ways.
Specifically,
it is possible to define symbolic analogues of traditionally
numeric concepts such as `signal', `exchange', and `differential'.
One interesting result is a new algebraic characterization of greedy algorithms,
incremental optimization procedures that process inputs as a stream
using the best-first ordering.
% We have formalized such algorithms in terms of the majorization
% on sequences, showing that they yield optimal results when the
% desired objective is a Schur convex function of the sequence.
\begin{citations}
\noindent
S-T. Huang, D. Stott Parker,
``A Study of Variants of the Lambek Calculus
and an Implementation in Prolog'',
Technical Report CSD-9200028,
UCLA Computer Science Department, June 1992.
\noindent
X. Wang, D. Stott Parker,
``Computing Least Fixed Points by Asynchronous Iterations'',
Technical Report CSD-9200030,
UCLA Computer Science Department, June 1992.
\noindent
H.L. Chau, D.S. Parker,
``Narrowing Grammar: Theory, Implementation, and Applications'',
{\em J. Logic Programming} {\bf 14}, 253--286,
November 1992.
\noindent
S-T. Huang, D. Stott Parker,
``Structure and Order Preservation:
A Unifying Framework using Dual-$\cal M$-Sets'',
Technical Report CSD-930004,
UCLA Computer Science Department, March 1993.
\noindent
Prasad Ram,
``A New Understanding of Greed'',
Ph.D. Dissertation, UCLA Computer Science Department,
June 1993.
\noindent
M.H.M. Cheng, M. van Emden, D.S. Parker,
``A Prolog Technology Term Rewriter'',
Technical Report, Computer Science Department,
University of Victoria, Victoria B.C., Canada, July 1993.
\noindent
M.H.M. Cheng, M. van Emden, D.S. Parker,
``Applicative Term Rewriting Systems and Prolog Technology'',
Technical Report, Computer Science Department,
University of Victoria, Victoria B.C., Canada, July 1993.
\noindent
X. Wang, D. Stott Parker,
``M\"{o}bius Inversion, Boolean Differentiation, and Incremental Computation'',
Technical Report, UCLA Computer Science Department, July 1993.
\end{citations}
\section{Graph Layout}
Graph structures (collections of vertices and edges)
arise frequently in database systems: E-R diagrams and semantic data
models, CAD, complex objects, part-subpart and inheritance hierarchies, etc.
As an initial part of this project,
we developed a general tool for visualization of graphs.
AGLO is a graph layout system that provides both a library
of user-selectable aesthetics for graph rendering,
and an engine for laying out graphs by constrained optimization of
aesthetic objectives.
The engine uses a hybrid algorithm that combines the best of
simulated annealing and force-directed placement.
Generally, AGLO improves on existing graph layout schemes
in its generality-performance tradeoff; it permits specification of
aesthetic layouts in a way not previously possible with reasonable
execution time.
\begin{citations}
\noindent
M. Coleman,
{\em Aesthetics-based Graph Layout for Human Consumption},
M.S. Thesis,
Technical Report CSD-930004,
UCLA Computer Science Department, March 1993.
\noindent
M. Coleman,
D.S. Parker,
``Aesthetics-based Graph Layout for Human Consumption'',
UCLA Computer Science Department, June 1993.
Submitted to {\em Software---Practice \& Experience}.
Abstract will appear in the
{\em Proceedings of Graph Drawing '93,
ALCOM International Workshop on Graph Drawing and Topological Graph Algorithms},
Paris, September 26--29, 1993.
\end{citations}
\section{Intensionality}
Out of the work on incremental processing, we arrived at an unanticipated
result: that with Unix, we could overload the concept of `file'
by the concept of `program that yields a file incrementally'.
This idea gave a way to generalize on the
hundreds of ad hoc file systems developed in the past few years,
which run the gamut from AFS (Andrew File System) to ZFS (Zebra File System).
The Unix community has long needed user-extensible file systems, but
most previously proposed mechanisms suffer two major drawbacks: they
require kernel modifications or root privileges, and are too
complicated for ordinary users.
We propose intensional file systems as a simple, easy-to-explain
abstraction for implementing one file system on top of another, an
abstraction that provides motivating design principles for file system
designers.
In an intensional file system, a file system is stored
as a description of how to name or produce files, instead of as the
file names or contents themselves. The term ``intensional'' comes from
intensional logic, which concentrates on the problem of ascribing
meaning to a logical expression or name. The fundamental design problem
of an intensional file system is to specify the rules and mechanism
for ``extensionalizing'' the files of an intensional file system,
i.e. for computing the traditional files that the IFS represents.
We have developed two user-extensible filesystems that
are loosely modeled on intensional logic.
IFS0 is very simple to describe and use, but supports only
regular read-only files. IFS1 combines a user space kernel with an
external extensionalizer specified by the user at run time; our goal
with IFS1 is to let the user program a wide range of intensional
filesystems, including IFS0.
IFS1 can be thought of as a simple user-programmable file system toolkit.
\begin{citations}
\noindent
Paul R. Eggert, D. Stott Parker, ``An Intensional File System'',
{\em Proceedings of the First USENIX File Systems Workshop},
Ann Arbor, MI, May 1992.
\noindent
Paul R. Eggert, D. Stott Parker, ``File Systems in User Space'',
{\em Proceedings of the Winter 1993 USENIX Conference},
% pp.
229--239,
San Diego, CA, January 1993.
\end{citations}
\end{document}