Project 1. DNA fragment analyzer

Introduction

In this assignment you will write a simple pattern-matcher generator for DNA fragments. Your generator will take a pattern and produce a pattern-matcher. This is akin to the POSIX regcomp regular expression compiler. However, unlike regcomp your generator will produce a procedure that can be called directly, not data to be passed to a separate regexec interpreter.

It should be said right up front that this assignment is just a toy: real DNA analyzers are much more complicated than this, and don't rely solely on the simple kinds of backtracking used here.

The key notion of this assignment is that of a ``matcher''. A matcher is a procedure that inspects a given list to find a match for a list prefix that corresponds to a pattern, and then checks whether the match is acceptable by testing whether a given ``acceptor'' procedure succeeds on the corresponding list suffix. For example, a matcher for the pattern (or c a) will succeed only on a list whose car is c or a, and whose cdr is acceptable to the acceptor.

An acceptor can be any procedure that accepts a DNA fragment as an argument. For example, the acceptor (lambda (frag) (and (pair? frag) (eq? (car frag) 'g))) accepts only suffixes beginning with g. Such an acceptor would cause the example matcher to succeed on (a g) but fail on (c t).

As you can see by mentally executing the (a g) example, matchers sometimes need to try multiple alternatives and to backtrack to a later alternative if an earlier one is a blind alley.

Definitions

nucleotide
one of the four symbols a, c, g, and t, which stand for adenine, thymine, cytosine, and guanine, respectively. Please see Richard B. Hallick's Introduction to DNA Structure for more about nucleotides in biochemistry.
DNA fragment
a list of nucleotides.
DNA pattern
an object that represents a set of DNA fragments. It is said to match each such fragment. It takes one of the following forms:
a, c, g, t
A nucleotide nt matches the singleton list (nt). For example, a matches (a).
(junk k)
matches up to k nucleotides. k must be a nonnegative integer. If there are several possible matches, it uses the shortest match. For example, (junk 0) matches only the empty list (), and (junk 2) matches (), (a), (c), (g), (t), (a a), (a c), (a g), (a t), (c a), (c c), (c g), (c t), (g a), (g c), (g g), (g t), (t a), (t c), (t g), and (t t).
(or pat…)
matches any DNA fragment that any DNA pattern pat matches. If more than one pat matches, it uses the first match. For example, (or) does not match any list, and (or a g t) matches (a), (g), and (t).
(list pat…)
matches any concatenation of DNA fragments that are matched by each pat, respectively. If there are several possible matches, (list pat1 pat2 ) uses the first match for pat1 that is acceptable to (list pat2 ). For example, (list), which has zero patterns, matches only the empty list (), and (list a (junk 1) (or a t)) matches (a a), (a t), (a a a), (a a t), (a c a), (a c t), (a g a), (a g t), (a t a), and (a t t).
(* pat)
matches any concatenation of nonnull DNA fragments that are each matched by pat. If the matched concatenation is of two or more fragments, this pattern is equivalent to (list pat (* pat)). If there are several possible concatenations, it uses the concatenation of the least number of fragments. For example, (* (or g (list a t))) matches (), (g), (a t), (g g), (g a t), (a t g), (a t a t), ….

A DNA pattern is not a procedure call. For example, you do not need to define a procedure called junk. The symbol junk is part of the DNA pattern (junk 1), but junk does not need to be defined as a procedure.

acceptor
a procedure with one argument: a DNA fragment frag. It returns true (i.e., some object other than #f) if frag is acceptable, and #f otherwise.
matcher
a procedure with two arguments: a DNA fragment frag and an acceptor accept. A matcher matches a prefix of frag of length k such that (accept (list-tail frag k)) returns true. If there is such a match, the matcher returns whatever accept returns; otherwise it returns #f.

Assignment

Write a procedure (make-matcher pat) that returns a matcher for the DNA pattern pat. When applied to a DNA fragment frag and an acceptor accept, the matcher should return the first acceptable match of a prefix of frag, using the definition of ``first'' given by the DNA pattern rules described above; this is not necessarily the shortest nor the longest acceptable match. A match is considered to be acceptable if accept returns true when given the suffix DNA fragment that immediately follows the matching prefix.

Your implementation of make-matcher must be free of side affects and should avoid using unnecessary storage. Tail recursion is desirable but is not required, and you will probably find it necessary to use non-tail-recursive procedures both in make-matcher and in the matchers that make-matcher returns.

There are extra constraints on the matchers that make-matcher returns. These matchers must use only the following Scheme primitives:

along with the following procedures and special forms: <, <=, =, >, >=, -, +, and, car, cond, cdr, eq?, if, lambda, let, letrec, not, null?, or, pair?, quote, and symbol?.

In all cases it is an error if procedure arguments are not of the proper form. For example, it is an error if the argument to make-matcher is not a DNA pattern, if the first argument to a matcher or acceptor is not a DNA fragment, or if the second argument to a matcher is not an acceptor. Your can assume that make-matcher and the matchers it returns are given valid arguments by the test program; however, you must insure that the matchers do not cause any errors when given valid arguments.

To turn in your assignment, submit a file pr1.ss that contains a definition of make-matcher along with any auxiliary definitions that make-matcher needs. Do not submit any definitions of acceptors, as the test program will supply them. The first line of pr1.ss should be a comment containing your name and student ID.

Examples

The following examples show how make-matcher might be used in a test program. Please do not submit them as part of pr1.ss.

(define frag0 '())
(define frag1 '(a t g c t a))
;
; Scheme does not care about the newlines in the definition of frag2.
; From Scheme's point of view, they are merely extra white space that
; is ignored.  The newlines are present only to help humans understand
; how the patterns defined below are matched against frag2.
(define frag2 '(c c c g a t a a a a a a g t g t c g t
                a
                a g t a t a t g g a t a
                t a
                a g t a t a t g g a t a
                c g a t c c c t c g a t c t a))

; Most of the uses of "list" in the following pattern definitions
; are as a symbol that is part of a pattern, not as a procedure.
; However, there are two exceptions, one when defining pat3
; and the other when defining pat4.
(define pat1 '(list a t g c))
(define pat2 '(or
               (list a g t a t a t g g a t a)
               (list g t a g g c c g t)
               (list c c c g a t a a a a a a g t g t c g t)
               (list c g a t c c c (junk 1) c g a t c t a)))
(define pat3 (list 'list pat2 '(junk 2)))
(define pat4 (list '* pat3))

; For each pattern defined above, use "make-matcher" to create a
; matcher that matches the pattern.
(define matcher1 (make-matcher pat1))
(define matcher2 (make-matcher pat2))
(define matcher3 (make-matcher pat3))
(define matcher4 (make-matcher pat4))


; The following acceptors are examples that you can use as part of
; your own test cases.

; Always fail, i.e., never accept a match.
(define fail
  (lambda (x)
    #f))

; Always succeed.  This acceptor returns the suffix
; containing all the nucleotides that were not matched.
; It causes the matcher to return the unmatched suffix.
(define identity
  (lambda (x) x))

; Write the unmatched suffix, then fail.
; This backtracks through all matches, and can be used to debug.
(define write-then-fail
  (lambda (x)
    (begin
      (write x)
      (newline)
      #f)))


; Use the acceptors to test the matchers.  Most of these
; test cases use the acceptor "identity", so they return the
; unmatched suffix that corresponds to the first match, or
; #f if there was no match.
(matcher1 frag0 identity)       ===> #f
(matcher1 frag1 fail)           ===> #f

; A match must always match an entire prefix of a fragment.
; So, even though matcher1 finds a match in frag1,
; it does not find the match in (cons 'a frag1).
(matcher1 frag1 identity)       ===> (t a)
(matcher1 (cons 'a frag1) identity)
                                ===> #f

(matcher2 frag1 identity)       ===> #f
(matcher2 frag2 identity)       ===> (a
                                      a g t a t a t g g a t a
                                      t a
                                      a g t a t a t g g a t a
                                      c g a t c c c t c g a t c t a)

; These matcher calls match the same prefix,
; so they return unmatched suffixes that are eq?.
(eq? (matcher2 frag2 identity)
     (matcher3 frag2 identity)) ===> #t

; matcher4 is lazy: it matches the empty fragment first,
; but you can force it to backtrack by insisting on progress.
(eq? (matcher4 frag2 identity)
     frag2)                     ===> #t
(eq? (matcher2 frag2 identity)
     (matcher4 frag2 (lambda (frag) (if (eq? frag frag2) #f frag))))
                                ===> #t

; Here null? is being used as an acceptor.
; It accepts only the empty unmatched suffix,
; so it forces matcher4 to backtrack until all of frag2 is matched.
(matcher4 frag2 null?)          ===> #t

Hints

A matcher for (list) is (lambda (frag accept) (accept frag)).

The following expression returns a matcher for (list pat1 pat2):

(let ((matcher1 (make-matcher pat1))
      (matcher2 (make-patcher pat2)))
  (lambda (frag accept)
    (matcher1 frag
      (lambda (frag1) (matcher2 frag1 accept)))))

The following expression returns a matcher for (or pat1 pat2):

(let ((matcher1 (make-matcher pat1))
      (matcher2 (make-patcher pat2)))
  (lambda (frag accept)
    (or (matcher1 frag accept)
        (matcher2 frag accept))))

The pattern (* pat) is equivalent to the pattern (or (list) (list pat (* pat))), except that the latter doesn't require that its first pat matches a nonempty instance of the pattern.

A solution

(define match-junk
  (lambda (k frag accept)
    (or (accept frag)
	(and (< 0 k)
	     (pair? frag)
	     (match-junk (- k 1) (cdr frag) accept)))))

(define match-*
  (lambda (matcher frag accept)
    (or (accept frag)
	(matcher frag
		 (lambda (frag1)
		   (and (not (eq? frag frag1))
			(match-* matcher frag1 accept)))))))

(define make-matcher
  (lambda (pat)
    (cond

     ((symbol? pat)
      (lambda (frag accept)
	(and (pair? frag)
	     (eq? pat (car frag))
	     (accept (cdr frag)))))

     ((eq? 'or (car pat))
      (let make-or-matcher ((pats (cdr pat)))
	(if (null? pats)
	    (lambda (frag accept) #f)
	    (let ((head-matcher (make-matcher (car pats)))
		  (tail-matcher (make-or-matcher (cdr pats))))
	      (lambda (frag accept)
		(or (head-matcher frag accept)
		    (tail-matcher frag accept)))))))

     ((eq? 'list (car pat))
      (let make-list-matcher ((pats (cdr pat)))
	(if (null? pats)
	    (lambda (frag accept) (accept frag))
	    (let ((head-matcher (make-matcher (car pats)))
		  (tail-matcher (make-list-matcher (cdr pats))))
	      (lambda (frag accept)
		(head-matcher frag
			      (lambda (frag1)
				(tail-matcher frag1 accept))))))))

     ((eq? 'junk (car pat))
      (let ((k (cadr pat)))
	(lambda (frag accept)
	  (match-junk k frag accept))))

     ((eq? '* (car pat))
      (let ((matcher (make-matcher (cadr pat))))
	(lambda (frag accept)
	  (match-* matcher frag accept)))))))

© 2003 Paul Eggert. See copying rules.
$Id: pr1.html,v 1.12 2003/02/11 19:30:11 eggert Exp $