Winter 2019 CS 32

Programming Assignment 4
Gee-nomics

Due 11:00 PM Thursday, March 14

The Project 4 specification document has been posted.

Updates:

3/12 11:40 pm: A paragraph on p. 21 has been updated to read: Your GenomeMatcher implementation must use your Trie class template in the implementation of all data structures that hold DNA sequences. It may use any other STL container classes so long as these are not used to hold DNA sequence data (i.e., strings of As, Cs, Ts, Gs, or Ns), except that it may maintain a list or vector of Genomes so that the Trie may contain pointers to or indexes of the Genomes in that vector or list instead of the Trie containing the Genomes itself. Your GenomeMatcher methods may also use any functions from <algorithm>.

This relaxation should help those of you who have been having issues with your program using huge amounts of memory with our provided data files. Also, remember the lessons of problem 5 of the Project 3 warmup when it comes to pointers into growing vectors.

Also, in the test harness we provided, the two functions that call findRelatedGenomes pass it 2*minlength as the fragmentMatchLength. That was an arbitrary choice; if you like, you can change the test harness to prompt you for the fragmentMatchLength to use and pass that.

3/11 1:30 am: Not an update, but a note: With our implementation of Project 4, on the SEASnet Linux server, using our test harness to load the eight provided data files takes 93 seconds when the program is built using g32, but only 18 seconds when built using g32fast. Presumably, building in Release mode for Xcode or Visual C++ (see problem 6 of Homework 4) will also produce speedups, maybe even more dramatic in Visual C++.

3/10 9:00 pm: The spec has been updated to include not only the earlier updates below, but a revised description and example of the test harness (pp. 28-29), a main.cpp you can use to test some aspects of your overall Gee-nomics implementation.

3/10 1:00 am: Sigh, on p. 15, the big-O requirement for Trie::find() has been corrected to say O(LC+V) if exactMatch is true, or O(L2C2+V) if exactMatch is false, where V is the size of the returned vector.

3/8 8:45 pm: On p. 15, the big-O requirement for Trie::find() has been corrected to say O(LC) if exactMatch is true, O(L2C2) if exactMatch is false. 3/8 4:15 pm: A few examples for findGenomesWithThisDNA() on p. 26 were corrected.

3/7 2:00 pm: Your Trie class template must work for key strings containing any characters, not just A, C, G, T, and N. (Of course, the Trie your GenomeMatcher implementation uses may well have no keys containing any character that is not one of those five.)

3/6 10:45 pm: In the skeleton file, provided.h and Genome.cpp were changed to add a copy constructor and assignment operator for Genome.

The posted skeleton code does not yet include a main routine you can use to test the whole program, although there are skeletons for the Trie class template and the Genome and GenomeMatcher classes.