Experimental Validation of Learning Accomplishment

Allen Klinger

Professor, University of California, Los Angeles

Summary

This paper reports on educational assessment: measures to validate that a subject has been learned. Material presented here derives from UCLA Computer Science courses, but the approach is independent of subject matter: the bibliography leads to the extensive literature on its use in diverse subjects and settings, including distance learning and elementary schools. The data summarized indicates that the method clearly distinguishing individuals with subject-mastery from others tested. The figures here show two things. First, a new way to help teachers and students understand and apply this form of testing and grading. Second, a useful visual representation of test-differences. The method uses probability and information, though it is unnecessary to understand either concept to test this way: references here enable further exploration. An initial figure displays scoring based on logarithmic-weighting, the basis of the approach. Thirteen possible responses to questions that can be completed with any of three candidate statements enable expression of partial knowledge. While the additional answers can readily be understood in subjective probability terms, the related statistical concepts (risk, loss) are not needed by testing participants. The thirteen-responses enable converting three-alternative statement-completion questions into a means to evaluate whether complex material has been absorbed. That this applies to most subjects, at most grade levels, is clear from the bibliography.

Introduction

Assessment methodology is usually thought of as divided between thorough methods that involve time-consuming evaluation of student essays and rapidly-scored multiple choice tests. However, logarithmic weighting [1-3] applied in the statistical context of risk and loss [4, pp. 14-15] converts multiple-choice response to questions into a subtle tool to evaluate how diverse materials have been absorbed. Having background in probability, statistics, information theory and decision theory is useful but not necessary to either to design or take tests that use procedures explained below. A summary of the key weights used appears here as Figure 1. Although in use for many years [5] the method is especially relevant today for instructional delivery innovations: video, remote-learning, internet-web etc. It was in use in a wide variety of subjects at Rand under the Plato system more than twenty years ago, and commercial systems exist today continuing that work in distance-learning. The classes tested here showed accurate measurement of students' accomplishments at the university for both undergraduate and graduate students.

Two ways extend the analytical presentation of logarithmic examination-scoring in [3], where a cogent operations research analysis makes an argument in favor of logarithmic weighting. First, while space doesn't permit inclusion here, from score values as in Figure 1, the possibility of loss makes it advantageous to admit an inability to distinguish between the three completions (no-knowledge state, m response), or to describe accurately which statement-completions might be correct by selecting from the other options (i.e., from d, e, f, g, h, i, j, k, l). Second, there is an extensive literature about subjective probability applied to educational test that is described below by a bibliography; for material about the core issue, information and rational decision-making, see [6-9].

Experience

My previous teaching UCLA courses confirmed that the multiple-response method discriminates between those with firm, some, and meager understanding of a subject. It seems worthwhile to experimentally show the idea to others, and to create a single page summary - Figure 1 below. (My past UCLA teaching using the method included undergraduate data structures; introductory Pascal programming; general-students' computer seminar; mathematics for non-majors lecture - I was not the instructor; and continuing/adult education courses.) The experimental results that follow come from two courses taught Fall '96 in the engineering school computer science curriculum, one graduate, the other for seniors. Graduate students had some exposure to loss/risk; undergraduates had minimal probability - most no knowledge. Both groups were equally-able to work with uncertainty-measuring multiple-responses after only a less than fifteen minute class discussion: .

Initiating Multiple-Response Assessments

The fifteen-minute introduction emphasizes that a multiple-response test enables partial-credit. To get greatest benefit, responses should reflect knowledge and avoid guessing. Responses require either choosing one of several given statement-completions, or selecting from a range of intermediate possibilities. Instead of probability implications, the discussion uses non-mathematical terms such as elimination of an obviously-incorrect statement-completion, to describe these selections.

The inadvisability of guessing is emphasized by indicating that a given-completion or a, b, c response is desirable only when more than 95 % sure of its truth. This is reinforced by presenting a single question to the class, one students know nothing about, and scoring the overall class responses. The one-question experience is followed by a trial quiz, such as five different statements, each with three completions. (Such a quiz is used in these experiments.) Students quickly learn that the most serious negative impact on overall score occurs from an incorrect choice: selection of a wrong offered-alternative rather than either a letter indicating some partial-knowledge state, or the default no-knowledge m response.

Two of the five initial quiz items from the graduate course follow (a similar trial quiz was administered to the undergraduate computer project design course).

1. A pattern is:

a. A textured image.

b. An element from a set of objects that can in some useful way be treated alike.

c. An immediately recognizable waveform or alphanumeric symbol.

2. A feature is:

a. An aspect of a pattern that can be used to aid a recognition decision process.

b. A real number that characterizes an aspect of a pattern.

c. The primary thing one observes in inspecting a pattern.

The less-technical undergraduate course included this comparable question:

3. Flaws in software (glitches, bugs)

a. Are of consequence only to the programmer.

b. Can have economic, strategic (defense), and social implications.

c. Make it essential to concentrate on hardware-based designs.

Questions 1 and 2 each involve a definition stated in class lectures, presented with two other sentence-completions, one near-right, the other clearly wrong; item 3 was similarly constructed from recitation discussions. Many such questions aggregated are an effective way to measure learning when coupled to the multiple-response system. Composing questions this way is easy for the instructor - begin by recording key points taught each week. Quizzes enable introducing students to the multiple-response method, and expanding them, with revisions sets leads to rapid-generation of longer examinations that are capable of clearly separating students based on their actual learning.

Overview of Results

The primary issue one seeks to measure is: did the students actually acquire the technical knowledge we presume was transmitted (by the classroom interaction and the assigned reading and homework)? Numerical quantification of learning, the index, is sums of Figure 1 second-column fractions - scores obtained from all the questions. This measure validates accomplishment two ways.

First, the methodology differentiates between obtaining general understanding and acquiring exact knowledge. There is overwhelming evidence in the experiments reported here that the measure places individuals into two distinct groups. In the better-performing group in one trial an individual showed a near-perfect response pattern. This is a reasonable outcome since questions are simple to one who knows the material. The inferior-performance group contains those befuddled by one or more fundamentals: many fall into that group. Misunderstandings identified from low scores produce questions, interaction and discussion that supports active learning. That is the main benefit from using the multiple-response method.

Second, the experiments compared multiple-response measures with traditional means, i.e. problem-solving open-ended examination questions. This was done both at mid-term and end-quarter. Homework and term reports reinforced the results of that comparison: students' work was similarly rated under these different systems. The graduate course midterm also had an open-ended question (chosen from two alternatives ) to be done on a take-home basis over a five-day period. Grades on this item were almost identical with scores from thirty multiple-response questions answered within three hours. Again, the traditional means of evaluating students confirmed multiple-response indices. The experiments in the two classes demonstrated consistent multiple-response numerical index and open-ended, essay/problem-work, results.

A simple view of the index compares the aggregate value to that from reporting no knowledge. Any consistent declaration of no-knowledge about posed questions receives a numerical score of 76.9%. Examining whether a score falls above or below that value determines whether sought learning occurred. The presentation material, graphs and tables that summarize administering the assessment procedure, gives detail beyond division based on 76.9%: see the Figure 3 scatter plot. The scatter plot sample statistics permit another type of visual display. That data shows individuals compared to their classmates or peers.

Numerical Assessment Data

The experiments involved student scores for initiation (trial-quiz) and midterm progress in two Fall 1996 classes. In the graduate class trial-quiz there were four main groups: see Table 1 where the data appears. The groups are visually distinct in a histogram plot of these scores given as Figure 2. Further, the advanced and marginal students stand out. The trial quiz differentiates four under-achieving (the lowest two groups, labeled here in the tabulated data as X and Y) and five superior-knowledge students (labeled E) within a population of twelve. Those within the two lowest categories of achievement received aggregate scores below what could have been attained by uniformly responding with an option (under the thirteen alternative multiple-response method) that signifies know-nothing about this topic. The category labeled F gained only a slight amount above the level that corresponds to answering know-nothing for all questions. Finally, the five in category E saw the statements as obvious.

Score (percent) Wrong (number) No-Knowledge (number) Group

56.9 2 0 X

60 2 0 X

67.7 1 0 Y

72.3 1 1 Y

76.9 1 0 F

80 1 0 F

80 1 0 F

93.8 0 0 E

93.8 0 1 E

93.8 0 0 E

95.4 0 1 E

96.9 0 0 E

(Five Definition-Questions)

m 80.6 0.8 0.3

s 13.7 0.7 0.4

Table 1. Graduate Course Quiz Data

Space limits prevent including visuals involving midterm time-constrained multiple-response scores compared to take-home essay-problems; still thirty-questions correlated almost perfectly with essay-problem grades. That corresponds to past experience. The method differentiated accomplishing individuals and those minimally participating in undergraduate, graduate, and life-long learning courses the author taught to diverse student types since Fall of 1989. In over a decade I frequently employed fifty-question final and thirty-question midterm multiple-response examinations (data structures, pattern analysis and other classes). Similar results - clear differentiation of student-accomplishment - were found in Plato evaluations at Rand [5] on nontechnical topics. The inescapable conclusion when one uses the method is that numerical multiple-response assessments achieve stable and useful student classification. The only thing needed is simple statements involving fundamentals covered in the course material. The multiple-response, partial-credit, logarithmic-weighting tool can be applied to multimedia and remote-learning contexts. That is being done, both in the web site for the undergraduate course mentioned here, CS 190 Design Project, and in a commercial continuation of the original Plato system.

Course Completion

This section concerns how the multiple-response measures indicate summary or overall level of accomplishment. The framework was a ten week term, lecture/recitation classes, and an eleventh final examination meeting. The primary data is a table with components derived from a spreadsheet of the data. (Examples will be shown and made available at the meeting by hard copy.) Slightly less detail from the midterm is shown here in Table 2 and Figure 3.

Final tables start with a row of average values of eight quantities: overall score, followed by the numbers of answers that were correct, incorrect, no choice m, preferred a correct answer, equally comfortable with correct and wrong answers (e, h, or k), preferred an incorrect answer, and then the sum of the correct and prefer-correct values. The left-most column simply lists the students' scores. These tables avoid being solely presentations of all the data. This is done by using italics to indicate those outside one-standard-deviation interval on either side of the sample mean value (over all students in the course). That highlights unusual student accomplishment. Table 2 shows that high and low achievers both stand out from such a display of the multiple-response scores.

Other figures to be presented in the talk show relationship between summary-tests multiple-response end-of-quarter student-performance evaluations and activity and achievement in class. The data confirms only a general value to testing: some students who performed poorly over the term were able to complete the twenty-basic questions successfully. Nevertheless, the history of multiple-response measures shows the validity of the approach for the overall test population. In the aggregate the higher grades go to those who tested well on both final and during the term via quizzes. This leads to the conclusion that the logarithmic, multiple-response, information-based measures successfully measure learning. It seems clear from these experiments that this method has value above and beyond that associated with right-wrong all or nothing (linear) scoring (see [3]) and that it is an efficient and engaging way to organize student and instructor interaction. Other material that cannot be included because of space limits shows comparisons of multiple-response scores with undergraduate students' open-ended work. Since undergraduates worked on team-projects the data is less conclusive. However, as in the graduate course, there appears to be a relationship between project grades, those on other open-ended individual work efforts, and multiple-response scores.

Score (%) Correct Near Correct Correct & Wrong Didn't Know Thirty Questions

Near Correct

66.9 16. 1. 17. 9. 3.

70. 21. 0. 21. 9. 0.

72.6 16. 2. 18. 7. 2.

79.5 19. 0. 19. 5. 3.

86.9 22. 1. 23. 3. 3.

90.8 17. 6. 23. 1. 3.

92. 17. 1. 18. 0. 6.

92.3 19. 5. 24. 1. 2.

92.6 15. 8. 23. 0. 3.

93.3 28. 0. 28. 2. 0.

95.4 27. 0. 27. 1. 1.

m 84.8 19.7 2.2 21.9 3.5 2.4

s 10. 4.2 2.7 3.5 3.3 1.6

Table 2. Graduate Course Multiple-Response Midterm Detail

Both from my own experience and that of others it seems clear that thirty to fifty questions on a multiple-response method test provides scores sufficiently accurate to gauge relative accomplishment. These scores can be applied in final examinations or any situation summarizing learning accomplished. Significant value also comes from the method stimulating learning. In a class where multiple-response testing takes place it initiates dialog. It fosters rapid feedback from instructor to student. Likewise it enables follow-up discussions on fine points that solidifies learning. These tests often foster challenging discussions.

Administering Tests

The basic material needed to administer a multiple-response test is the diagram portion of Figure 1. That displays thirteen alternative responses possible associated with three alternative statement-completions. Once the choices are made, giving out the correct-letter answers, and having individuals score each others' papers, quickly leads to the assessment index values. Students themselves total per-question values and compute exam scores/percentages, particularly since hand calculators make this rapid.

To the method in use consider first five, then seven items with Figure 1. Suppose a is the correct question one answer but the response was g (being uncertain of a, the decision assigned it 3/4 weight compared to 1/4 for b), a 0.923 fraction score. If questions two through five had c, b, c, and b correct completions but c, h, h, and m responses scores 1.000, 0.846, 0, and 0.769, yield a total of 3.538 on the first five; average score 0.7076; 70.76 %. With the same first five, a question six wrong guess, e.g., correct was a but b response; and a poor choice on seventh item, would reduce this. If a seventh item got a 1/4 weight at correct completion: e.g., c correct but f response, the five question total of 3.538 becomes 4.230 for seven (0 for the error at sixth; 0.692 for poor choice at seventh). That yields 0.6043 or 60.43%, a low-D, down a grade from low-C on five.

Conclusion

The explosive growth in information, both in production and availability, goes along with learning occurring by informal means, e.g., via world-wide-web sources of varied quality and accuracy. Hence assessing what is known is meaningful for many purposes. That has been done over a long period by means of subjective-probability-based multiple-response methodologies. Although space does not permit a complete review, items in the bibliography substantiate the validity of that method in virtually any subject matter, at levels or student-ages that range from the early primary school grades to certification of adult competence in technical domains. Furthermore there is activity today extending the method via computer networks. Nevertheless, this paper has shown applying multiple-response assessment with low technology - paper and pencil - in the classroom. The modest restrictions involved were: specifying a discrete choice of responses ( Plato at Rand in the '70s allowed continuous choice on or within the Figure 1 triangle via light-pen input to a computer video display); restriction to thirteen alternatives (a contribution of J. Bruno, Prof., Education, UCLA); and limits to three main completions. [Bruno and Shuford offer expertise where more completions may be desirable.] One way to use the results is to view them as displaying under- and superior-achievers. Another considers assessments in adapting instruction to the learners, i.e., as indicators suggesting which transmitted information students actually receive. In other words, a multiple-response assessment is another useful an instructional tool.

References

[1] Hartley, R. V.., "Transmission of Information," Bell System Technical Journal, 7/28, p.535.

[2] Shannon, C. E., Weaver, W., The Mathematical Theory of Communication, Urbana IL: '49.

[3] Brown, T.A., "A Theory of How External Incentives Affect, and Are Affected by, Computer-aided Admissible Probability Testing," Rand Corp., Santa Monica CA, '74; presented 59th Annual Meeting, American Educational Research Association, Chicago IL, 4/74.

[4] Duda, R. O., Hart, P. E., Pattern Classification and Scene Analysis, Wiley, NY, '74.

[5] Landa, S.,"CAAPM: Computer-Aided Admissible Probability Measurement on Plato IV," Rand Corp., Santa Monica CA, R-1721-ARPA, '76.

[6] Brown, T. A., Shuford, E., "Quantifying Uncertainty Into Numerical Probabilities for the Reporting of Intelligence," Rand Corp., Santa Monica CA, R-1185-ARPA, '73.

[7] Savage, L. J., "Elicitation of Personal Probabilities and Expectations," J. Amer. Statistical Association, 66, '71, pp. 783-801.

[8] Good, I. J., "Rational Decisions," J. Royal Statistical Society, Ser. B, 14, '52, pp. 107-14.

[9] McCarthy, J., "Measurement of Value of Information," Proc. Nat. Acad Sci., '56, pp. 654-5.

Bibliography

(RM, R signify Rand Corp. Report)

Eisenberg, E., Gale, G. "Consensus of Subjective Probabilities: The Pari-Mutuel Method," Annals of Mathematical Statistics, 30 (1), 3/59, pp. 165-168.

Epstein, E. S., "A Scoring System for Probability Forecasts of Ranked Categories," J. Applied Meteorology, 8 (6), 12/69, pp. 985-987.

Winkler, R. L., "Scoring Rules and the Evaluation of Probability Assessors," J. Amer. Statistical Association, 64, '69, pp. 1073-1078.

Shuford, E, Albert, A. and Massengill, H. E., "Admissible Probability Measurement Procedures," Psychometrika, 31, '66, pp. 125-145.

Shuford, E., and Brown, T.A., "Elicitation of Personal Probabilities and Their Assessment," Instructional Science 4, '75, pp. 137-188

Brown, T. A., Probabilistic Forecasts and Reproducing Scoring Systems, RM-6299-ARPA, 6/70.

Brown, T. A., An Experiment in Probabilistic Forecasting, R-944-ARPA, July 1973.

Brown, T. A., Shuford, E., Quantifying Uncertainty Into Numerical Probabilities for the Reporting of Intelligence, R-1185-ARPA, '73.

Sibley, W. L., A Prototype Computer Program for Interactive Computer Administered Admissible Probability Measurement, R-1258-ARPA, 1973.

Bruno, J. E., "Using Computers for Instructional Delivery and Diagnosis of Student Learning in Elementary School," J. Computers in the Schools, 4 (2), '87, pp. 117-134.

Bruno, J. E., "Admissible Probability Measurement in Instructional Management," J. Computer Based Instruction, 14 (2), 1987.

Bruno, J. E., "Computer Assisted Formative Evaluation Procedures to Monitor Basic Skills Attainment Elementary Schools," J. Computing Childhood Education, 2 (2), W'90/9