Ruth Liow(*) and Jacques J. Vidal(**)

(*) National Computer Board, 71 Science Park Drive, Singapore 0511

(**)UCLA Computer Science Dept., University of California,Los Angeles, CA 90024



This paper describes a methodology for Expert Systems that combines the learning and inferencing capabilities of artificial neural networks with a user-friendly interactive interface. The network architecture is a dual, three-layered feedforward network of continuous value units, fully-interconnected between layers. The network is trained from raw data, avoiding the need for specific knowledge acquisition from domain experts.

The system will generate probability estimates for each hypothesis (output, diagnoses) as well as probabilistic pointers toward missed symptoms and supports continuous user interaction.



Classical Expert Systems [1] are based on if-then rules, sometimes associated with a probability estimate. During execution, a procedural "inference engine" searches the rule base, attempting to match rules with current conditions in order to select those rules eligible for "firing". All the rules must be obtained from domain experts and entered explicitly during the design phase. This extraction of knowledge from experts is often the hardest task involved in building such systems.

Connectionist networks can alleviate some of the difficulties associated with expert system generation:

1) Neural networks can build rules automatically from examples rather than depend on experts to provide explicit a priori rules.

2) The distributed representation inherent to multilayered networks provides a basis for generalization between observed categories even when no specific examples have been shown.

3) Neural Networks offer inherently parallel execution, and with proper implementation, could be much more effective for inferencing than the serial rule search of conventional AI.

We describe a novel hybrid methodology where Neural Network technology is combined with interactive procedures in order to provide a user-friendly environment. The network architecture is a dual, three-layered feedforward network, fully-interconnected between layers which uses continuous activation and output values. Input conditions are translated into continuous activation levels in a network’s input units. The network is trained from raw data, avoiding the need for specific knowledge acquisition from a domain expert. Activations in output units represent levels of confidence in the conclusion.


Earlier Work

Expert systems can be viewed from the three different perspectives of data acquisition, inference and analysis. The possibility of using Neural Networks for some or all three of these aspects has been recognized for some time and several authors have attempted to combine neural networks with conventional procedures. Hillman [2], for instance, adopts a procedural methodology for acquisition and analysis and confines the role of the network to that of inference engine. In Caudill's "Fuzzy Cognitive Maps" [3], the primitives of the domain and the strengths of the relations are derived from expert queries and mapped into network connections. The actual connection strengths, the weights values, are then determined from fuzzy set theory.

In the system proposed by Gallant [4] , the weights are learned from examples but, as in [3], network topology must be handcrafted heuristically. Specifically, individual links are created between two nodes wherever field knowledge would predict a significant causal relation between the corresponding concepts. The nodes are threshold logic units and their output are therefore inherently restricted to binary values representing true or false. Input ports can also take a third, neutral, value to signal an unknown input value. A learning procedure known as the "pocket algorithm" developed earlier by the author is used to learn the weights from data.

Several shortcomings remain.

1) Units are constrained to have binary inputs and outputs. Concepts that allow a range of discrete or continuous values, cannot be represented directly.

2) Confidence measures must be computed separately.

3) The dependency framework must be known in advance to determine the topology of the network. Hence, one of the most appealing aspect of multilayered networks, the ability to automatically develop effective dependency representations, is not exploited. For instance, hidden units would be needed wherever some input-output relations are non-linearly separable, but their proper placement in the network would require a complete understanding of each sub-function, or else, hidden units would have to be added empirically whenever a training failure occurs.

4) To feedback uncertainty information to the user, confidence measures are computed through a tedious backward chaining procedure, performed one output unit at a time.

Still, this scheme represents a major methodological progress in that it utilizes the learning capability of the network to replace expert interviewing for the process of rule acquisition



Our model addresses the aforementioned shortcomings. To set up the network one needs to identify each of the conditions and hypotheses that appear in the data and to provide one input unit for each condition and one output unit for each hypothesis. All activations, inputs and outputs will be represented by continuous real-values in the {0,1} range, to be interpreted as probability measures. The intent is to embody in the final state of the network weights the probability distribution that is implicitly expressed in the training set [5].

Figure 1

The network is a three layer structure that will be alternatively traversed in the forward and backward directions in order to alternately perform forward and backward chaining (Figure 1). The hidden units are divided into two separate groups, isolated from each other, but both fully connected to the input and output layers. As a consequence the input units must be actual processing units, rather than simple input ports as would be the case with a strictly feedforward network.

This dual network is trained iteratively from case data. Since all activation, input and output values are continuous values, weights can be be learned by classical error backpropagation. Training occurs in both directions, alternatively affecting the connections to and from each group of hidden units. This procedure not only sidesteps the need for explicit knowledge acquisition from domain experts, but also allows the network itself to provide user feedback.

In exploiting the fully trained network, the backward reasoning propagation will pinpoint the need for additional information or testing. The user will then have the opportunity to interact in real-time with the system, provide additional information and conduct "what if" inquiries.



To illustrate the method, we consider a medical application, the diagnosis of heart disease from a combination of angiographic and general patient data. Each input unit represents a symptom level and the single output node will represents the diagnosis, i.e., the most probable level of heart disease defined here as artery narrowing.

Let us first assume that the network has been set and trained to a sufficiently large set of cases histories to competently represent the relevant probability distributions. During the subsequent exploitation, sets of symptoms (i.e., input patterns) will be entered. Diagnosis will be initially unknown. To enter a given symptom pattern, the input units for those symptoms that have definite values in the pattern will have their activation levels clamped to these values normalized to the {0,1} range. Other input units which correspond to conditions or symptoms not specified in the case under scrutiny, will be left floating. The activations will then be propagated through the forward hidden units, establishing a pattern of output activity. This resulting pattern it is then propagated backward to all the inputs through the backward hidden layer. At that point, formerly "unknown" units will have acquired specific values. The process is repeated with this modified input and this alternation of forward-backward propagation is pursued until the process stabilizes.

The activations at the end of one bidirectional pass through the network can then be interpreted as probabilities of the diagnosis based on the symptoms, as well as the likelihood of symptoms not yet detected or that may develop. These predicted conditions can then be confirmed or denied by the user who will have the opportunity to update the appropriate input units before performing the next pass. The user control the process at each step and may set previously unknown values, perhaps after the prescription of additional laboratory tests. This cycle can be repeated until the hypothesis under scrutiny reaches a stable activation level.

The training of the network, from a dataset that contains both symptoms and confirmed diagnoses, reflects the dual organization. In the forward pass, the conditions or symptoms specified in each training pattern are again clamped. The output activation is then compared with the desired result to provide an error measure. Iterative weight modification in this phase is performed only on the forward group of hidden units.

After convergence, the backward pass is performed. The diagnosis is clamped and the resulting input activations are compared with the inputs as they were presented in the forward pass. Weight modifications in this phase are limited to the backward path. In practice, only the connections to input units that had specified activation levels can be adjusted and these units vary from pattern to pattern. The backward mapping is also one-to-many. Each output activation value can legally be associated with a number of different combinations of activations over the input layer. As such, the backward network can only converge to the average activation values corresponding to each diagnosis.

We tested the concept with angiographic data structured as follows:


Chest pain type {typical, atypical, non-anginal, asymptomatic}

Resting Blood Pressure in mm Hg on admission {pressure}

Serum Cholesterol : {concentration in mg/dl}

Fasting Blood Sugar > 120 mg/dl : {true,false}

Resting Electrocardiogram: {normal, ST-T wave abnorm*, left ventricular hyper**}

*T wave inversions and/or ST elevation or depression of > 0.05 mV

**probable or definite left ventricular hypertrophy by Estes' criteria

Maximum Heart Rate achieved in bpm{heart rate }

Exercise Induced Angina: {yes,no}

Slope of the peak exercise ST segment*: {upsloping, flat, downsloping}

*ST depression induced by exercise relative to rest

Number of major vessels colored by fluoroscopy {0,1,2,3}

Thal: {normal, reversible defect, fixed defect}

Diagnosis (angiographic disease status) % diameter narrowing

{< 25 , 25 to 50%, about 50%, 50% to 75%, > 75%}


Quantitative entries have been mapped to the [0,1] range using the raw data bounds. Non quantified entries are set to heuristically chosen values, graded according to their perceived bearing on the diagnosis.

The diagnosis statistics from a batch of 200 cases * are shown on the two tables below. Each cycle involved the presentation of all of the training patterns and convergence required large numbers of iterations. However, the results from Table I show that a tight representation of the data base had been achieved


Size of Hidden Layers




Percentage of exact Matches




Number of Iterations




Table I


Size of Hidden Layers




Mean Deviation at Input




Number of Iterations




Table II


A network with 20 hidden units in each hidden sub-network was used. The learning rate and momentum parameters used in the back propagation algorithm were somewhat critical. A higher learning rate facilitates faster convergence but can cause oscillations For these particular experiments, the best compromise was found to be about 0.25.

The momentum parameter speeded up convergence and inhibited oscillation. but a relatively low momentum rate (0.2 to 0.3) worked best , probably because of the relatively unstructured nature of the data.

To facilitate generalization and the ability to correctly handle novel inputs, one must avoid the "overfitting" of the network to the training set. Overfitting can result from an excess of hidden units. Connections will develop weights that can specifically handle the training patterns very well, but are too specific to handle new cases that have slightly different profiles. When fewer hidden units are used, the network is forced to discriminate between the patterns using fewer of the more salient characteristics. For instance, experiments with different numbers of hidden units showed that networks with 20 or fewer hidden units per layer resulted in a larger pool of novel inputs being accepted in the initial descriptor than with larger networks. Overfitting problems will arise when the relations between attributes are complex and the data sparse, as in this example. These could be alleviated by breaking down the hypotheses, (allocating multiple output units), and using winner-take-all or mutually exclusive decision node.



This paper describes a methodology for Expert Systems that incorporates the best features of artificial neural networks as well as a user-friendly interactive interface. The architecture is a dual, three-layered feedforward network, fully-interconnected between layers which uses continuous value units. The network can be trained from raw data, bypassing the need for specific knowledge acquisition from domain experts.



[1] D. A. Waterman, A Guide to Expert Systems, 1986

[2] D.V. Hillman, "Integrating Neural Nets and Expert Systems", AI Expert, June 1990

[3] M. Caudill, "Using Neural Nets : Fuzzy Cognitive Maps", AI Expert, June 1990

[4] S. I. Gallant, "Connectionist Expert Systems", Comm. ACM, 31, February 1988

[5] R. Golden, "Probabilistic Characterization of Neural Model Computations", in Neural Information Processing Systems, D. Z. Anderson (Ed.), 1988

[6] L. A. Becker and J. Peng, "Network Processing of Hierarchical Knowledge for Classification and Diagnosis", IEEE International Conference on Neural Networks, 2, 1987