User-Defined Aggregates
|
single(newaggr, Y, NV) and multi(newaggr, Y, OV, NV).
where
single(max, Y, Y). multi(max, Y, MO, MN) <- Y > MO, MN=Y. multi(max, Y, MO, MN) <- Y <= MO, MN=MO.Likewise, in LDL++, count and sum could have been defined as shown by the two pairs of rules below:
The count and sum so defined behave as count_all and sum_all since these rules accumulate the Old value with the new Y, without checking whether the same Y value had already occurred.single(count, Y, 1). multi(count, Y, Old, New) <- New= Old+1. single(sum, Y, Y). multi(sum, Y, Old, New) <- New= Old+Y.
User-defined aggregates can also be called by means of aggr goals. In this case, when applied to the empty set, the compiler will search for empty rule defining the behavior of that particular aggregate on an empty set. For instance, the our built-in aggregates behave as if they were defined by the following rules:
On empty set, aggr will return 0 for sum and count, and will fail on max (also fails on min and avg).empty(sum, 0). empty(count, 0). empty(max, 0) <- false.
Several new aggregates can be defined using the single and multi rules. For instance in SQL, after the maximum is found, a second sub-query is needed to return all the values associated with the maximum. In LDL++, if sppp denotes a supplier-part-price relation, to find, for each supplier their most expensive items and their common price of these items, we can write:
This example illustrates that:findmax(S, mymax<(Itm,Pric)>) <- sppp(S, Itm, Pric). single(mymax, (Item, Pr), (Item, Pr)). multi(mymax, (Sit,Sp),(Oit,Op), (Sit, Sp)) <- Sp >= Op. multi(mymax, (Sit,Sp),(Oit,Op), (Oit, Op)) <- Sp < Op.
Return Rules
|
ereturn(newaggr, NewY, OldV, VR) <- ...While in multi the last argument is stored in the accumulator, in ereturn it is returned as a (partial) result. This value can be computed by user-defined rule from the new value in the input and the old value in the accumulator---same as for the multi rule. When no return rule is given for an aggregate being defined, then the last argument of multi is returned at the end of the computation--for compatibility with previous versions of LDL++.
select(Sup) <- allcounts(Sup, CC), CC>7 . allcounts(Sup, cntol<Itm>) <- sppp(Sup, Itm, Price).where cntol can be defined as follows:
single(cntol, _, 1). multi(cntol, S, Old, New) <- New= Old+1. ereturn(cntol, S, Old, Value) <- Old ~= nil, Value=Old+1.The return rule is applied after each new value generated by either the single or the multi rule. But the single rule leaves the value of of Old equal to nil. To avoid a type error that will follow from the computation of Old+1, therefore, we have the condition Old ~= nil. The following example, illustrates the use of nil, to emulate the choice construct. Under the following definition of mychoice:
single(mychoice, Y, Y). multi(mychoice,Y, nil, nil) <- fail. ereturn(mychoice, Y, nil, Y).The following two rules are equivalent:
p(X, Y) <- q(X, Y), choice((X), Y). p(X, mychoice<Y>) <- q(X, Y).At the end of the dataset the value of the next input is set to nil.
The computation of average can be performed by computing the sum and the count and then returning their ratio every seven records. The value nil in the first argument of the return rule denotes that we have reached the end of the dataset and the current value of the input is therefore null.single(avg, X, (X,1)). multi(avg, X, (OS,OC), (NS,NC)) <- NS=OS+X, NC=OC+1. freturn(avg, nil,(OS,OC), Avg) <- Avg= OS/OC.
Using ereturn rules, an assortment of very useful aggregates, e.g., those used for data mining applications, can be defined. Moving window aggregates, for instance, are of common usage in time-series analysis.
Example. Moving time window aggregation : Average the prices of IBM stocks over the last five days.
p(mw5avg<A>) <- stock-closing('IBM',A). single(mw5avg, X, [X]). multi(mw5avg, X, OL, NL) <- if(OL= [X1, X2, X3, X4, X5] then L = [X1, X2, X3, X4] else L= OL), NL = [X | L]. ereturn(mw5avg, _,[X5,X4,X3,X2,X1],Avg) <- Avg= (X1+X2+X3+X4+X5)/5.
findmax(S, maxtwo<(Itm,Pric)>) <- sppp(S, Itm, Pric). single(maxtwo, (Item, Pr), (Item, Pr)). multi(maxtwo, (Sit,Sp),(Oit,Op), (Sit, Sp)) <- Sp >= Op. multi(maxtwo, (Sit,Sp),(Oit,Op), (Oit, Op)) <- Sp <= Op. ereturn(maxtwo, nil, (Sit, Sp), Sit, Sp).The first three arguments in the head of a return rule denoted the name of the aggregate, the new value in the stream, and the old value in the accumulator. Any additional argument is returned in a separate column. Thus, in our case maxtwo returns Sit (max priced item) and Sp (its price) in two separate column. Thus, findmax is now treated as a ternary predicate. The occurrence of nil in the last rule denotes that we are defining a return that takes place when all the input values have been visited. When the heads of return rules only have three arguments, this is boolean aggregate, which produces no argument in the output. For instance the following aggregate determines whether the count exceeds the value of 7.
single(count7, _, 1). multi(count7, _, Old, New) <- Old<7, New=Old+1. ereturn(count7, _, Old) <- Old=7.Thus, to find suppliers who supply more than 7 items we can write the following rule:
select(Sup, count7<Itm>) <- sppp(Sup, Itm, Price).
p(K1,K2,...,Km, aggr1<A1>, aggr2<A2>, ..., aggrN<An>) <- Rule Body.under the following conventions:
Monotone Aggregation
|
However, aggregates that have early return rules and no final return rules are monotonic. These aggregates can be used in recursive programs without restrictions. This leads to the simpler expression of complex algorithms.
Suppose we define a count-like predicate mcount as follows:
single(mcount, Y,1). multi(mcount, Y, Old, New) <- New=Old+1. ereturn(mcount, Y, Old, New) <- if(Old=nil then New=1 else New=Old+1).Since the return rule operates on the new value of input and the old value of the accumulator, the situation Old=nil defines the value
q(mcount<X>) <- p(X).returns In= { q(1), q(2), ..., q(n) }. If the original set of facts is increased to a new set of cardinality m > n, then, our rule returns: Im= {q(1), q(2), ..., q(m)}, where Im is a superset of In. Therefore:
Join the Party: Some people will come to the party no matter what, and their names are stored in a sure(Person) relation. But many other persons will join only after they know that at least K of their friends will be there. Here, friend(A, B) denotes that A views B as a friend.
willcome(P)<- sure(P). willcome(P)<- c_friends(P, K), K >= 3. c_friends(P, mcount<F>) <- willcome(F), friend(P, F).Here, we have set K=3 as the number of friends required for a person to come to the party.
By specializing the count aggregate, we can further improve the efficiency of the computation. Let us define an aggregate kcount as follows:
single(kcount,(K,Y),1). multi(kcount,(K,Y),Old,New) <- Old<K, New=Old+1. ereturn(kcount,(K,Y),K1,yes) <- K1+1=K.Thus, the early ereturn rule succeeds (producing a yes) only when the count reaches the value of K. Since we assume that k>1 we do not need to return the values produced by single. Also, the computation of multi fails after we return the value. Thus, the computation of party goers becomes:
wllcm(F,yes) <- sure(F). wllcm(X,kcount<(3,F)>) <- wllcm(F,_), friend(X,F).Unlike in the previous formulation, where a new tuple c_friends is produced every time a new friend is found, a new wllcm tuple is here produced only when the threshold of 3 is crossed. Rather than returning yes we should have programmed our aggregate to return no argument, i.e., to act as a boolean predicate. Then our program simplifies as follows:
single(zcount, (K,X), 1). multi(zcount, (K,X), Old, New) <- Old < K, New=Old+1. ereturn(zcount, (K,X), K1) <- K1~=nil, K=K1+1. wllcom(F) <- sure(F). wllcom(X, zcount<(3,F)>) <- wllcom(F), friend(X, F).Next, we define msum and mmin that provide monotone extensions for sum and min.
For msum we have:
single(msum, Y, Y). multi(msum, Y, Old, New) <- New = Old + Y. ereturn(msum, Y, Old, New) <- if(Old = nil then New=Y else New=Old+1).For mmin, we will return the last value if this is a new min.
single(mmin, Y,Y). multi(mmin, Y, Old,New) <- if(Y < Old then New=Y else New=Old). ereturn(mmin, Y, Old, Y) <- if(Old ~= nil then Y < Old).Least-Distance Connections: Given a graph g(X,Y, C) where C is the cost of an edge from node X to node Y, the least-cost distance between any two nodes can be computed as follows:
ld(X, Y, mmin<C>) <- g(X,Y, C). ld(X, Y, mmin<C>) <- ld(X,Z, C1), ld(Z, Y, C2), C= C1+C2. least_dist(X, Y, min<C>) <- ld(X,Z, C1).This transitive-closure like computation adds a new arc ld(X, Y, C) provided that this then becomes the new least-cost arc between the nodes X and Y. The arcs so produced are then used in the next step of the seminaive computation. At the end of this fixpoint computation, the least_dist rule is used to select the least-distance arc between these two nodes, out of the succession of arcs of decreasing C values produced in the computation. For a given graph, the values obtained during the computation of ld can vary depending on the order in which the arcs are considered. The final values in least_dist, however, are always the same (a nondeterministic computation producing a deterministic answer).
Company Control: Another interesting example is transitive ownership and control of corporations. Say that owns(C1, C2, Per) denotes that corporation C1 owns a percentace Per of the shares of corporation C2. Then, C1 controls C2 if it owns more than, say, 49% of its shares. In general, to decide whether C1 controls C3 we must also add the shares owned by corporations such as C2 that are controlled by C1. This yields the transitive control predicate defined as follows:
control(C, C) <- owns(C, _, _). control(C1, C2) <- twons(C1, C2, Per), Per>49. towns(C1, C3, msum<Per>) <- contrl(C1, C2), owns(C2, C3, Per).Thus, every company controls itself, and a company C1 that has transitive ownership of more than 49% of C2's shares controls C2 . In the last rule, twons computes transitive ownership with the help of msum that adds up the shares of controlling companies. Observe that any pair (C2,C3) is added at most once to control, thus the contribution of C2 to C1's transitive ownership of C3 is only accounted once. To further simplify the program and expedite the computation we can introduce a boolean aggregate as follows:
single(sum49, Y, Y). multi(sum49, Y, Old, Z) <- Old<49, Z= Old+Y. ereturn(sum49, Y, Old) <- if(Old=nil then Y>49 else Old+Y>49).Then the recursive rules become:
cntrl(C1, C2) <- owns(C1, C2, Per), Per >49. cntrl(C1, C3,sum49<Per>) <- cntrl(C1,C2), owns(C2,C3,Per).Thus, sum49 succeeds only when the 49% threshold is crossed during the summation. Here, the value of 49 was cast into the very definition of our aggregate. Alternatively, this value could be given as a parameter, as in the case of kcount.
Bill-of-Materials (BoM) Applications: BoM applications represent an important application area that requires aggregates in recursive rules. Say, for instance that psb(P1, P2, QT) denotes that P1 contains part P2 in quantity QT. We also have elementary parts that are purchasable for a price and will be delivered in a certain number of days: these are described by the relation basic(P, Price, Days). Then, the following program computes the cost of a part as the sum of the cost of the basic parts it contains.
part_cost(Part, O, Cst) <- basic(Part, Cst). part_cost(Part, mcount<Sb>, msum<MCst>) <- part_cost(Sb,ChC,Cst), prolfc(Sb,ChC), psb(part,Sb,Mult), MCst=Cst*Mult.Thus, the key condition in the body of the second rule is that a subpart Sb is counted in part_cost only when all Sb's children have been counted. This occurs when the number of Sb's children counted so far by mcount is equal to its total number of children in the psb graph. This last number is kept in the prolificity table, prolfc, which can be computed as follows:
prolfc(P1, 0) <- basic(P1, _). prolfc(P1, count<P2>)<- psb(P, P2, _).Also, this BOM computation can be simplified and made more efficient using the zcount aggregate, yielding:
pcost(Part, Cost) <- basic(Part, Cost). pcost(Part, zcount<(K,Sb)>, msum<Cst>) <- pcost(Sb, yes, Cst), psb(Part, Sb, Mult), prolfc(Part, K), MCst=Cst*Mult.Observe that the prolfc relation is now used to qualify Part in the rule head, rather than its subparts in the body. The technique of counting the children could also be used with least_dist problem, above, if the underlying graph is acyclic. For cyclic graphs we must use the current formulation that exploits the property that extrema are unaffected by duplicates (idempotence).