CS 32 Project 4, Spring 2024

Spring 2024 CS 32

Project 4
Precision Revision

Time due: 11:00 PM Tuesday, June 11

Introduction

Mallmart, one of the world's largest retailers, wants to hire you to help them manage their inventory. A centralized file contains information about every item Mallmart sells. This file is updated each day. All around the world, there are devices that need to receive a copy of the file each day: computers and cash registers in the stores, vending machines, handheld devices, etc., so each day, Mallmart sends a copy of the inventory file to all these devices. There are 10 million such devices, and the inventory file is about 100 kilobytes long, so about 1 terabyte (i.e., 1000 gigabytes = 100KB * 10 million) is sent each day. Given that their internet provider charges Mallmart roughly $1 per gigabyte, it costs about $1,000 per day to provide up-to-date inventory information to the devices. In addition to the inventory files, Mallmart occasionally sends updates of other files to the devices as well (for example, new versions of applications as they are updated).

One of Mallmart's managers recently had an idea based on the recognition that the file Mallmart sends each day changes very little from the file it sent the day before. For example, consider these two sample inventory files sent by Mallmart to the devices on April 10 and April 11. Each file contains multiple records; each record has an item number, an item name, and the number of those items in the inventory:

April 10 version of the inventory file:

81609,Feather Duster,198,92246,Lawn Chair Set,50,03854,Carrano C++ book,183,27408,Toilet paper 9-roll,89

April 11 version of the inventory file:

66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair Set,50,03490,Bedspread,87,27408,Toilet paper 9-roll,89,40411,Hair Spray,380

As you can see, both the earlier and later version of the inventory file have many similarities. There are some new additions in the April 11 version of the file, such as the entry

03490,Bedspread,87

There are also some changes to existing records between the two versions:

81609,Feather Duster,198

changes to

81609,Feather Duster,195

And some of the entries in the original file have been removed (e.g., it looks like Mallmart sold all of the Carrano C++ books in one day, since there is no entry for it in the second file).

Given the significant similarities between successive versions of the inventory file, Mallmart could save hundreds of dollars every day. How? Since a given day's inventory file is so similar to the previous day's inventory file, it doesn't make sense to ship the entire 100K inventory file to each salesperson every day. Instead, Mallmart could just send a smaller file containing the list of changes (called a revision file) to each device to indicate what has changed from yesterday's inventory file to today's file. Each device could then run an incremental updater program to use this revision file to convert their existing version of the inventory file to the new version of the inventory file.

Let's number each character in both of inventory files and then see an example.

April 10 version of the inventory file:

              1         2         3         4
    01234567890123456789012345678901234567890123456789
  0 81609,Feather Duster,198,92246,Lawn Chair Set,50,0
 50 3854,Carrano C++ book,183,27408,Toilet paper 9-rol
100 l,89

April 11 version of the inventory file:

              1         2         3         4
    01234567890123456789012345678901234567890123456789
  0 66284,Screwdriver,1000,81609,Feather Duster,195,92
 50 246,Lawn Chair Set,50,03490,Bedspread,87,27408,Toi
100 let aper 9-roll,89,40411,Hair Spray,380

Given these two versions of the inventory file, let's see what a revision file might look like. Remember, the revision file contains a set of instructions on how to convert an older version of the file (e.g., the April 10 version of the file) into a newer version of the file (e.g., the April 11 version). This revision file would normally be created at Mallmart headquarters and then sent to each of the 10 million devices, which already have the full April 10 version of the inventory file:

Create a new, empty output file
Add 23 characters to the newer file: 66284,Screwdriver,1000,
Copy 23 characters starting from offset 0 in the older file to the newer file.
Add 1 character to the newer file: 5
Copy 27 characters starting from offset 24 in the older file to the newer file.
Add 16 characters to the newer file: 490,Bedspread,87
Copy 29 characters starting from offset 75 in the older file to the newer file.
Add 21 characters to the newer file: ,40411,Hair Spray,380

Given this set of instructions, an incremental updater program on a device could follow the above set of instructions and convert the April 10 version of the file (which is already on their computer) into the April 11 version of the file. How would this work?

After following the first instruction, the incremental updater program would create a new empty file.

After following the second instruction, that newer file would contain

66284,Screwdriver,1000,

After following the third instruction, the newer file would contain

66284,Screwdriver,1000,81609,Feather Duster,19

After following the fourth instruction, the newer file would contain

66284,Screwdriver,1000,81609,Feather Duster,195

After following the fifth instruction, the newer file would contain

66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair Set,50,03

And so on.

To perform this conversion, only two different instructions are required: a Copy instruction and an Add instruction. The Copy instruction specifies that starting from a particular offset in the older version of the file, a certain number of characters should be copied to the end of the newer file. The Add instruction is used to add entirely new content to the newer file when this content cannot be located and copied from the older file.

Of course, the eight revision instructions shown above are in a human-readable format that is quite wordy. We can make our revision file much smaller by removing all of the human-readable text and using a more compact encoding. Shown below is a 90-character revision file containing all of the instructions above; note that this revision file is 35% smaller than the full April 11 version of the file:

+/66284,Screwdriver,1000,/#0,23+/5/#24,27+/490,Bedspread,87/#75,29+/,40411,Hair Spray,380/

In the example above, each add instruction is represented by a +, followed by a delimiter character (which can be any character; we used a / in this example), followed by characters ending with the next occurrence of the delimiter character. The characters between the delimiter characters are the actual characters to append to the newer file. Each copy instruction is represented by a # followed by the offset in the original file from which to start copying characters, a comma, and the number of characters to copy. It is possible to convert any file to any other file using just these two types of instructions.

So to recap, each day, in its corporate office, Mallmart can create the latest version of the full inventory file (e.g., the April 11 version). Then, Mallmart can use a special tool called a revision creator to create a revision file (made up of Add and Copy instructions) that contains the instructions for converting yesterday's version (the April 10 version) of the file to today's version (the April 11 version) of the file. Then, Mallmart can send this revision file to the 10 million devices instead of sending the full (much larger) April 11 inventory file.

Upon receiving the revision file, each of the 10 million devices then runs a program called an incremental updater that takes yesterday's full April 10 inventory file (which is already on the device, produced yesterday) and the just-received revision file, and follows the instructions in the revision file to create today's April 11 full inventory file. On April 12, the device will receive a revision file that it will use along with this full April 11 file to produce a full April 12 inventory file. Since the revision files are just a fraction of the full inventory files' sizes, this saves Mallmart considerable networking costs.

Of course, this revision approach can be used to update all types of files, not just inventory files. For instance, consider the following two files A and A', where A' is a derived version of A.

              1         2         3   
    01234567890123456789012345678901234
A : ABCDEFGHIJBLAHPQRSTUVPQRSTUV
A': XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK

Here's a revision file to convert A into A'. Verify for yourself that it works:

+/XY/#0,12+/ETC/#13,13+:QQ/OK:

Notice that for the last Add command, we chose to use : as the delimiter character. Choosing / when we wanted to add the text QQ/OK would be wrong, since +/QQ/OK/ is an Add command to add QQ (+/QQ/) followed by an O, which is not a valid character to start a Copy or Add command. The whole reason for our designing the Add command syntax to allow for a choice of the delimiter character instead of requiring that it always be, say, / is so that the text to be added can contain a /. In this example, we chose : as the delimiter character, but we could have chosen any character (letter, punctuation, digit, space, newline, etc.) other than Q, /, O, or K, since those were characters in the text we wanted to add.

Now it may not be obvious, but there are actually many possible correct revision files that can be created to convert a first version of a file A into a second version A'. For example, here's another correct revision file for the example above:

+/XYA/#1,9+?BLETCH?#14,12+ZQQ/OKZ

Notice we chose to add the first A instead of copying it, we chose in the second Add command to use ? as the delimiter character (for no particular reason), and for the last Add command, we chose Z (again, for no particular reason).

And here's another correct, but not very useful, revision file:

+=XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK=

This revision file simply contains the entire contents of A' and instructs the incremental updater to write all 35 characters of this data out to the output file. It's a bad solution because it's actually larger in size than A'! It would be a little cheaper to just send A' instead of this revision file.

Your Assignment

You have been hired by the Chief Frugality Officer (CFO) of Mallmart to create two functions:

createRevision: This function takes the contents of two files, A and A', and produces a revision file containing instructions for converting A into A'. Each day, Mallmart will use this function at their corporate headquarters to create a new revision file that they will then send to all the devices.
revise: This function takes the content of a file A and a revision file, and will apply the instructions in the revision file to produce a new file A'. Each day, every device will use this function to update the previous day's inventory file.

The createRevision function has the following interface:

    void createRevision(istream& fold, istream& fnew, ostream& frevision);

You may name the parameters something else, but there must be three parameters of the indicated types in the indicated order. (The File I/O tutorial explains istream and ostream.) The parameters are

an already-opened input source (for yesterday's full file, say)
an already-opened input source (for today's full file, say)
an already-opened output destination to which you will write the instructions for converting the first input to the second.

The revise function has the following interface:

    bool revise(istream& fold, istream& frevision, ostream& fnew);

You may name the parameters something else, but there must be three parameters of the indicated types in the indicated order. The parameters are

an already-opened input source (for yesterday's full file, say)
an already-opened input source (the revision file)
an already-opened output destination to which you will write the file resulting from applying the revision file to the first input.

The revise function returns true if the operation succeeds. If it fails because the revision file is malformed (e.g., there's a character other than an + or # where an instruction is expected, or an offset or length is invalid), the function returns false. If the function returns false, the caller can make no assumption about what may or may not have been written to the output destination (so you're free to, for example, just return false as soon as you detect an error, without regard to what you may have already written).

Except for the output required to be written to its third parameter, each of the two required functions must not cause anything to be written to cout or any file. They may write to cerr anything they want (presumably for debugging purposes); our testing program will ignore anything you write to cerr.

Your functions must be able to create revision files and apply revision files for files up to 100 kilobytes (102,400 bytes) in length, and if you wish, may be able to handle longer files as well.

The revision file that your implementation of createRevision produces must be in this format: A revision file is a sequence of zero or more instructions, where an instruction is either

an Add instruction of the form +ΔtextΔ in which Δ represents a single character (the delimiter character), text represents the text to add: a sequence of zero or more characters not containing the delimiter character.
a Copy instruction of the form #offset,len in which offset is a sequence of one or more digits representing the (zero-based) offset in the original file from which to start copying, and len is a sequence of one or more digits representing the number of characters to copy.
a Do-nothing instruction consisting of either a single newline character ('\n') or a single carriage return character ('\r').

An Add instruction with no characters as the text (such as +//) or a Copy instruction with a len of 0 (such as, for an old file with at least 13 characters, #12,0) is valid and successful when executed by revise, even though it has no effect since no characters are to be added or copied. Of course, it would be foolish for createRevision to produce such an instruction, since it needlessly makes the revision file longer. If a Copy instruction specifies a length of 0 and an offset beyond the last character in the old file (e.g., #12,0 for an old file with 12 or fewer characters), you may either consider it valid (and have it do nothing) or you may consider it invalid (and have revise return false), your choice. Notice that an instruction like #007,00001 has the same effect as #7,1).

The Do-nothing instructions exist solely to make working with revision files easier if you create or examine them under Windows or macOS or Linux, whose text editors may or may not append a newline or carriage return-linefeed at the end of the file. To execute a Do-nothing instruction, revise does nothing. For example, a five-character revision file containing #3,6• (where • in this sentence represents a newline character) causes the same effect that a four-character revision file containing #3,6 would. Of course, it would be foolish for createRevision to produce a Do-nothing instruction, since it needlessly makes the revision file longer.

Your program must build and run successfully under g32 and either Visual C++ or Xcode.

When run under g32fast, your createRevision function must not run for longer than 3 seconds for any file of up to 10 kilobytes and no more than 60 seconds for any file of up to 100 kilobytes. Your revise function must not run for longer than 5 seconds for any file of up to 100 kilobytes.

Here's an example of a main routine that performs a test of the functions, checking that the revision file produced by createRevision from an older file and a newer file can be used by revise with the older file to produce a file identical to the newer file.

	#include <iostream>
	#include <fstream>
	#include <string>
	#include <algorithm>
	#include <iterator>
	#include <cassert>
	using namespace std;

	bool runtest(string oldName, string newName, string revisionName, string newName2)
	{
	    if (revisionName == oldName  ||  revisionName == newName  ||
	        newName2 == oldName  ||  newName2 == revisionName  ||
	            newName2 == newName)
	    {
	        cerr << "Files used for output must have names distinct from other files" << endl;
	        return false;
	    }
	    ifstream oldFile(oldName, ios::binary);
	    if (!oldFile)
	    {
	        cerr << "Cannot open " << oldName << endl;
	        return false;
	    }
	    ifstream newFile(newName, ios::binary);
	    if (!newFile)
	    {
	        cerr << "Cannot open " << newName << endl;
	        return false;
	    }
	    ofstream revisionFile(revisionName, ios::binary);
	    if (!revisionFile)
	    {
	        cerr << "Cannot create " << revisionName << endl;
	        return false;
	    }
	    createRevision(oldFile, newFile, revisionFile);
	    revisionFile.close();

	    oldFile.clear();   // clear the end of file condition
	    oldFile.seekg(0);  // reset back to beginning of the file
	    ifstream revisionFile2(revisionName, ios::binary);
	    if (!revisionFile2)
	    {
	        cerr << "Cannot read the " << revisionName << " that was just created!" << endl;
	        return false;
	    }
	    ofstream newFile2(newName2, ios::binary);
	    if (!newFile2)
	    {
	        cerr << "Cannot create " << newName2 << endl;
	        return false;
	    }
	    assert(revise(oldFile, revisionFile2, newFile2));
	    newFile2.close();

	    newFile.clear();
	    newFile.seekg(0);
	    ifstream newFile3(newName2, ios::binary);
	    if (!newFile)
	    {
	        cerr << "Cannot open " << newName2 << endl;
	        return false;
	    }
	    if ( ! equal(istreambuf_iterator<char>(newFile), istreambuf_iterator<char>(), 
	                 istreambuf_iterator<char>(newFile3), istreambuf_iterator<char>()))
	    {
	        cerr << newName2 << " is not identical to " << newName
	                 << "; test FAILED" << endl;
	        return false;
	    }
	    return true;
	}

	int main()
	{
	    assert(runtest("myoldfile.txt", "mynewfile.txt", "myrevisionfile.txt", "mynewfile2.txt"));
	    cerr << "Test PASSED" << endl;
	}

(Note: We create the streams in binary mode (ios::binary), which is harmless under macOS and Linux, but necessary under Windows to make the offsets work correctly, since Windows represents line ends for text files with two characters even though it delivers just one newline character to and from your C++ program.)

(Note: We close the output streams to ensure that all output to the file is completed before we open it for input. Because createRevision reached the end of file on the oldFile input stream, we need to clear the end of file condition on the stream and reset the stream so that we read from the beginning again. We do the same for newFile2 before comparing the files.)

Here's another main routine that tests the functions. It uses the types istringstream (an input source where the characters read come from a string instead of a file), and ostringstream (an output destination where the characters written are later available as a string).

	#include <iostream>
	#include <sstream>  // for istringstream and ostringstream
	#include <string>
	#include <cassert>
	using namespace std;

	void runtest(string oldtext, string newtext)
	{
	    istringstream oldFile(oldtext);
	    istringstream newFile(newtext);
	    ostringstream revisionFile;
	    createRevision(oldFile, newFile, revisionFile);
	    string result = revisionFile.str();
	    cout << "The revision file length is " << result.size()
	         << " and its text is " << endl;
	    cout << result << endl;

	    oldFile.clear();   // clear the end of file condition
	    oldFile.seekg(0);  // reset back to beginning of the stream
	    istringstream revisionFile2(result);
	    ostringstream newFile2;
	    assert(revise(oldFile, revisionFile2, newFile2));
	    assert(newtext == newFile2.str());
	}

	int main()
	{
	    runtest("There's a bathroom on the right.",
	            "There's a bad moon on the rise.");
	    runtest("ABCDEFGHIJBLAHPQRSTUVPQRSTUV",
	            "XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK");
	    cout << "All tests passed" << endl;
	}

Neither of the main routines above check that the revision file is validly formatted, so do not rely on them alone to demonstrate your code's correctness. You might have some misunderstanding that results in your createRevision function incorrectly creating an improperly formatted revision file that won't work with our revise function even though it works with yours because that misunderstanding also caused you to write an incorrect revise function that accepted that incorrect revision file.

To check that a revision file is validly formatted, you can supply an old, a new, and a revision file to the Project 4 Revision Checker.

We will provide you with several different old and new files for you to test your program with, in a Windows version and a Mac and Linux version. For each of pair of old and new files, your program must create revision files that are at least 5% smaller than the new version of the file. We will test your program on other files as well, so it can't work just for these sample files. Below we show how big our solution's revision file is for each of these sample pairs of files we're providing:

	Windows file sizes in bytes			Mac and Linux file sizes in bytes
	old	new	our revision file	old	new	our revision file
Mallmart inventory	106	142	92	105	141	91
Weird Al's Dr. Seuss	533	606	65	510	579	65
War and Peace	77285	77333	382	76634	76682	376
Some strange files	100764	100764	8913	100440	100440	8913

Algorithm Requirements

You may be wondering how to write a function to create a revision file. This is definitely a non-trivial problem, and most senior job interview candidates can't come up with a viable algorithm (i.e., a conceptual approach) within a one-hour interview.

Here's a general, high-level algorithm that can be used to build a revision file. It's not ideal, however, and you'll have to come up with improvements of your own to make it work well:

Read in the entire contents of the old file into a string. Read the entire contents of the new file into another string.
For all consecutive N-character sequences in the old file's string, insert that N-character sequence and the offset F where it was found in the old file's string, into a table (e.g. hash table or binary search tree). You might use 8 for N, or maybe 16.
Once you have filled up your table with all N-byte sequences from the source file, start processing the new file's string, starting from offset j=0, until j reaches the end of the string.
1. Look up the N-byte sequence which starts at offset j ([j,j+N-1]) in the new file's string in the table you created from the old file's string.
2. If you find this sequence in the table, you know that that sequence is present in both the old and new versions of the file.
  1. Determine how long the match goes on (it might be just N bytes long, or it might extend past the first N matching bytes to be many bytes longer).
  2. Once you have determined how long the match is (call this L), write a Copy instruction to the revision file to copy L bytes from offset F from the source file.
  3. Go back to step 3a, continuing at offset j = j + L in the new file's string.
3. If you don't find the current sequence (new file's string [j,j+N-1]) in the table you created, you know that the first version of the file doesn't contain the current N byte sequence.

Of course, this is a simple version of the algorithm, and many improvements are possible. For example, this version of the algorithm will result in only single-character add instructions: If the new file contains the new text BLAH (not present in the old text), then the above algorithm would produce four Add instructions (e.g., +/B/+/L/+/A/+/H/) instead of a single (and more compact) Add instruction that adds all four new characters at once (+/BLAH/).

To obtain the highest score possible and create the smallest revision files, you'll have to improve on the algorithm substantially. Be creative! In addition, this naïve algorithm also has troubles with certain types of files (such as the strange*.txt sample files that we provide). You'll have to figure out why and make sure this doesn't cause problems in your solution.

For this project, we are constraining your implementation:

You are required to use either a hash table or a binary search tree (your choice of which) for your table data structure.
The only containers from the C++ standard library you may use are vector, list, stack, and queue (and string). You'll thus have to implement your BST or hash table yourself instead of using a map or unordered_map, but you can, say, use the list type to help you implement a hash table. Although we're limiting your use of containers from the library, you are free to use library algorithms (e.g., sort from <algorithm>). You are also free to use std::pair from <utility> and std::hash from <functional> if you wish.

Compared to creating a revision file, the algorithm for applying a revision file is much easier. You may use the following functions to aid you in extracting instructions from the revision file if you wish:

	bool getInt(istream& inf, int& n)
	{
	    char ch;
	    if (!inf.get(ch)  ||  !isascii(ch)  ||  !isdigit(ch))
	        return false;
	    inf.unget();
	    inf >> n;
	    return true;
	}

	bool getCommand(istream& inf, char& cmd, char& delim, int& length, int& offset)
	{
	    if (!inf.get(cmd))
	    {
	        cmd = 'x';  // signals end of file
	        return true;
	    }
	    switch (cmd)
	    {
	      case '+':
	        return inf.get(delim).good();
	      case '#':
	        {
	            char ch;
	            return getInt(inf, offset) && inf.get(ch) && ch == ',' && getInt(inf, length);
	        }
	      case '\r':
	      case '\n':
	        return true;
	    }
	    return false;
	}

Turn it in

What you will turn in for this assignment is a zip file containing these files and nothing more:

One or more source files that contain the code for the two required functions and any supporting types and functions you use to implement them. Your source code should have helpful comments that explain any non-obvious code. You may include a main routine if you wish; if you do, we will rename it to something harmless, never call it, and instead use our own main routine.

How nice! You don't have to turn in a report file this time.

Grading

Your revise function is worth 22% of the points for this project.

The correctness of your createRevision function is worth 54% of the points, subject to these caveats: You will receive no points for any test case that produces a revision file that is not at least 5% smaller than the file that is the second argument to createRevision, and you will lose half of the correctness points you earn if you use any library containers other than those we allowed. (In other words, do not use library containers that are implemented using trees or hash tables, such as set, map, unordered_set, unordered_map, etc.)

The size of the revision files that createRevision produces is worth 24% of the points. The smaller your revision files, the more points you'll earn. For each test case for which you produce a correct revision file, of the revision file size points possible for that case:

You earn 25% of those points if the revision file you produce is at least 5% smaller that the file that is the second argument to createRevision.
You earn 50% of those points if the revision file you produce is more than 5% and no more than 20% larger than the revision file that our solution produces.
You earn 75% of those points if the revision file you produce is larger, but no more than 5% larger, than the revision file that our solution produces.
You earn 100% of those points if the revision file you produce is not larger than the revision file that our solution produces.

Project 4 Precision Revision