Time due: 11:00 PM Tuesday, June 11
Mallmart, one of the world's largest retailers, wants to hire you to help them manage their inventory. A centralized file contains information about every item Mallmart sells. This file is updated each day. All around the world, there are devices that need to receive a copy of the file each day: computers and cash registers in the stores, vending machines, handheld devices, etc., so each day, Mallmart sends a copy of the inventory file to all these devices. There are 10 million such devices, and the inventory file is about 100 kilobytes long, so about 1 terabyte (i.e., 1000 gigabytes = 100KB * 10 million) is sent each day. Given that their internet provider charges Mallmart roughly $1 per gigabyte, it costs about $1,000 per day to provide up-to-date inventory information to the devices. In addition to the inventory files, Mallmart occasionally sends updates of other files to the devices as well (for example, new versions of applications as they are updated).
One of Mallmart's managers recently had an idea based on the recognition that the file Mallmart sends each day changes very little from the file it sent the day before. For example, consider these two sample inventory files sent by Mallmart to the devices on April 10 and April 11. Each file contains multiple records; each record has an item number, an item name, and the number of those items in the inventory:
April 10 version of the inventory file:
81609,Feather Duster,198,92246,Lawn Chair Set,50,03854,Carrano C++ book,183,27408,Toilet paper 9-roll,89
April 11 version of the inventory file:
66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair Set,50,03490,Bedspread,87,27408,Toilet paper 9-roll,89,40411,Hair Spray,380
As you can see, both the earlier and later version of the inventory file have many similarities. There are some new additions in the April 11 version of the file, such as the entry
03490,Bedspread,87
There are also some changes to existing records between the two versions:
81609,Feather Duster,198
changes to
81609,Feather Duster,195
And some of the entries in the original file have been removed (e.g., it looks like Mallmart sold all of the Carrano C++ books in one day, since there is no entry for it in the second file).
Given the significant similarities between successive versions of the inventory file, Mallmart could save hundreds of dollars every day. How? Since a given day's inventory file is so similar to the previous day's inventory file, it doesn't make sense to ship the entire 100K inventory file to each salesperson every day. Instead, Mallmart could just send a smaller file containing the list of changes (called a revision file) to each device to indicate what has changed from yesterday's inventory file to today's file. Each device could then run an incremental updater program to use this revision file to convert their existing version of the inventory file to the new version of the inventory file.
Let's number each character in both of inventory files and then see an example.
April 10 version of the inventory file:
1 2 3 4 01234567890123456789012345678901234567890123456789 0 81609,Feather Duster,198,92246,Lawn Chair Set,50,0 50 3854,Carrano C++ book,183,27408,Toilet paper 9-rol 100 l,89
April 11 version of the inventory file:
1 2 3 4 01234567890123456789012345678901234567890123456789 0 66284,Screwdriver,1000,81609,Feather Duster,195,92 50 246,Lawn Chair Set,50,03490,Bedspread,87,27408,Toi 100 let aper 9-roll,89,40411,Hair Spray,380
Given these two versions of the inventory file, let's see what a revision file might look like. Remember, the revision file contains a set of instructions on how to convert an older version of the file (e.g., the April 10 version of the file) into a newer version of the file (e.g., the April 11 version). This revision file would normally be created at Mallmart headquarters and then sent to each of the 10 million devices, which already have the full April 10 version of the inventory file:
66284,Screwdriver,1000,
5
490,Bedspread,87
,40411,Hair Spray,380
Given this set of instructions, an incremental updater program on a device could follow the above set of instructions and convert the April 10 version of the file (which is already on their computer) into the April 11 version of the file. How would this work?
After following the first instruction, the incremental updater program would create a new empty file.
After following the second instruction, that newer file would contain
66284,Screwdriver,1000,
After following the third instruction, the newer file would contain
66284,Screwdriver,1000,81609,Feather Duster,19
After following the fourth instruction, the newer file would contain
66284,Screwdriver,1000,81609,Feather Duster,195
After following the fifth instruction, the newer file would contain
66284,Screwdriver,1000,81609,Feather Duster,195,92246,Lawn Chair Set,50,03
And so on.
To perform this conversion, only two different instructions are required: a Copy instruction and an Add instruction. The Copy instruction specifies that starting from a particular offset in the older version of the file, a certain number of characters should be copied to the end of the newer file. The Add instruction is used to add entirely new content to the newer file when this content cannot be located and copied from the older file.
Of course, the eight revision instructions shown above are in a human-readable format that is quite wordy. We can make our revision file much smaller by removing all of the human-readable text and using a more compact encoding. Shown below is a 90-character revision file containing all of the instructions above; note that this revision file is 35% smaller than the full April 11 version of the file:
+/66284,Screwdriver,1000,/#0,23+/5/#24,27+/490,Bedspread,87/#75,29+/,40411,Hair Spray,380/
In the example above, each add instruction is represented by a +
,
followed by a delimiter character (which can be any character; we
used a /
in this example), followed by characters ending with the
next occurrence of the delimiter character. The characters between the
delimiter characters are the actual characters to append to the newer file.
Each copy instruction is represented by a #
followed by the
offset in the original file from which to start copying characters, a comma,
and the number of characters to copy. It is possible to convert any file to
any other file using just these two types of instructions.
So to recap, each day, in its corporate office, Mallmart can create the latest version of the full inventory file (e.g., the April 11 version). Then, Mallmart can use a special tool called a revision creator to create a revision file (made up of Add and Copy instructions) that contains the instructions for converting yesterday's version (the April 10 version) of the file to today's version (the April 11 version) of the file. Then, Mallmart can send this revision file to the 10 million devices instead of sending the full (much larger) April 11 inventory file.
Upon receiving the revision file, each of the 10 million devices then runs a program called an incremental updater that takes yesterday's full April 10 inventory file (which is already on the device, produced yesterday) and the just-received revision file, and follows the instructions in the revision file to create today's April 11 full inventory file. On April 12, the device will receive a revision file that it will use along with this full April 11 file to produce a full April 12 inventory file. Since the revision files are just a fraction of the full inventory files' sizes, this saves Mallmart considerable networking costs.
Of course, this revision approach can be used to update all types of files, not just inventory files. For instance, consider the following two files A and A', where A' is a derived version of A.
1 2 3 01234567890123456789012345678901234 A : ABCDEFGHIJBLAHPQRSTUVPQRSTUV A': XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK
Here's a revision file to convert A into A'. Verify for yourself that it works:
+/XY/#0,12+/ETC/#13,13+:QQ/OK:
Notice that for the last Add command, we chose to use :
as the
delimiter character. Choosing /
when we wanted to add the text
QQ/OK
would be wrong, since +/QQ/OK/
is an Add
command to add QQ
(+/QQ/
) followed by an
O
, which is not a valid character to start a Copy or Add command.
The whole reason for our designing the Add command syntax to allow for a choice
of the delimiter character instead of requiring that it always be, say,
/
is so that the text to be added can contain a /
.
In this example, we chose :
as the delimiter character, but we
could have chosen any character (letter, punctuation, digit, space,
newline, etc.) other than Q
, /
, O
, or
K
, since those were characters in the text we wanted to add.
Now it may not be obvious, but there are actually many possible correct revision files that can be created to convert a first version of a file A into a second version A'. For example, here's another correct revision file for the example above:
+/XYA/#1,9+?BLETCH?#14,12+ZQQ/OKZ
Notice we chose to add the first A instead of copying it, we chose in the
second Add command to use ?
as the delimiter character (for no
particular reason), and for the last Add command, we chose Z
(again, for no particular reason).
And here's another correct, but not very useful, revision file:
+=XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK=
This revision file simply contains the entire contents of A' and instructs the incremental updater to write all 35 characters of this data out to the output file. It's a bad solution because it's actually larger in size than A'! It would be a little cheaper to just send A' instead of this revision file.
You have been hired by the Chief Frugality Officer (CFO) of Mallmart to create two functions:
createRevision
: This function takes the contents of two files,
A and A', and produces a revision file containing instructions for converting A
into A'. Each day, Mallmart will use this function at their corporate
headquarters to create a new revision file that they will then send
to all the devices.
revise
: This function takes the content of a file A and
a revision file, and will apply the instructions in the revision file to produce
a new file A'. Each day, every device will use this function to update
the previous day's inventory file.
The createRevision
function has the following interface:
void createRevision(istream& fold, istream& fnew, ostream& frevision);
You may name the parameters something else, but there must be three
parameters of the indicated types in the indicated order. (The
File I/O tutorial explains istream
and ostream
.) The parameters are
The revise
function has the following interface:
bool revise(istream& fold, istream& frevision, ostream& fnew);
You may name the parameters something else, but there must be three parameters of the indicated types in the indicated order. The parameters are
The revise
function returns true if the operation succeeds. If
it fails because the revision file is malformed (e.g., there's a character
other than an +
or #
where an instruction is
expected, or an offset or length is invalid), the function returns false.
If the function returns false, the caller can make no assumption about
what may or may not have been written to the output destination (so you're
free to, for example, just return false as soon as you detect an error,
without regard to what you may have already written).
Except for the output required to be written to its third parameter, each of
the two required functions must not cause anything to be written to
cout
or any file. They may write to cerr
anything
they want (presumably for debugging purposes); our testing program will
ignore anything you write to cerr
.
Your functions must be able to create revision files and apply revision files for files up to 100 kilobytes (102,400 bytes) in length, and if you wish, may be able to handle longer files as well.
The revision file that your implementation of createRevision
produces
must be in this format: A revision file is a sequence of zero or more
instructions, where an instruction is either
+ΔtextΔ
in which Δ represents a single character (the delimiter
character), text represents the text to add: a sequence of zero or
more characters not containing the delimiter character.
#offset,len
in which offset is a
sequence of one or more digits representing the (zero-based) offset in the
original file from which to start copying, and len is a
sequence of one or more digits representing the number of characters to
copy.
'\n'
) or a single carriage return character
('\r'
).
+//
)
or a Copy instruction with a len of 0 (such as, for an old file with
at least 13 characters, #12,0
) is valid and successful when
executed by revise
, even though it has no effect since no
characters are to be added or copied. Of course, it would be foolish for
createRevision
to produce such an instruction, since it
needlessly makes the revision file longer. If a Copy instruction specifies a
length of 0 and an offset beyond the last character in the old file (e.g.,
#12,0
for an old file with 12 or fewer characters), you may
either consider it valid (and have it do nothing) or you may consider it
invalid (and have revise
return false), your choice. Notice
that an instruction like #007,00001
has the same effect as
#7,1
).
The Do-nothing instructions exist solely to make working with revision files
easier if you create or examine them under Windows or macOS or Linux, whose
text editors may or may not append a newline or carriage return-linefeed at
the end of the file. To execute a Do-nothing instruction,
revise
does nothing. For example, a five-character revision
file containing #3,6•
(where • in this sentence
represents a newline character) causes the same effect that a four-character
revision file containing #3,6
would. Of course, it would be
foolish for createRevision
to produce a Do-nothing instruction,
since it needlessly makes the revision file longer.
Your program must build and run successfully under g32 and either Visual C++ or Xcode.
When run under g32fast, your createRevision
function must not run
for longer than 3 seconds for any file of up to 10 kilobytes and no more than
60 seconds for any file of up to 100 kilobytes. Your revise
function must not run for longer than 5 seconds for any file of up to 100
kilobytes.
Here's an example of a main routine that performs a test of the functions,
checking that the revision file produced by createRevision
from an older
file and a newer file can be used by revise
with the older
file to produce a file identical to the newer file.
#include <iostream> #include <fstream> #include <string> #include <algorithm> #include <iterator> #include <cassert> using namespace std; bool runtest(string oldName, string newName, string revisionName, string newName2) { if (revisionName == oldName || revisionName == newName || newName2 == oldName || newName2 == revisionName || newName2 == newName) { cerr << "Files used for output must have names distinct from other files" << endl; return false; } ifstream oldFile(oldName, ios::binary); if (!oldFile) { cerr << "Cannot open " << oldName << endl; return false; } ifstream newFile(newName, ios::binary); if (!newFile) { cerr << "Cannot open " << newName << endl; return false; } ofstream revisionFile(revisionName, ios::binary); if (!revisionFile) { cerr << "Cannot create " << revisionName << endl; return false; } createRevision(oldFile, newFile, revisionFile); revisionFile.close(); oldFile.clear(); // clear the end of file condition oldFile.seekg(0); // reset back to beginning of the file ifstream revisionFile2(revisionName, ios::binary); if (!revisionFile2) { cerr << "Cannot read the " << revisionName << " that was just created!" << endl; return false; } ofstream newFile2(newName2, ios::binary); if (!newFile2) { cerr << "Cannot create " << newName2 << endl; return false; } assert(revise(oldFile, revisionFile2, newFile2)); newFile2.close(); newFile.clear(); newFile.seekg(0); ifstream newFile3(newName2, ios::binary); if (!newFile) { cerr << "Cannot open " << newName2 << endl; return false; } if ( ! equal(istreambuf_iterator<char>(newFile), istreambuf_iterator<char>(), istreambuf_iterator<char>(newFile3), istreambuf_iterator<char>())) { cerr << newName2 << " is not identical to " << newName << "; test FAILED" << endl; return false; } return true; } int main() { assert(runtest("myoldfile.txt", "mynewfile.txt", "myrevisionfile.txt", "mynewfile2.txt")); cerr << "Test PASSED" << endl; }
(Note: We create the streams in binary mode (ios::binary
), which
is harmless under macOS and Linux, but necessary under Windows to make the
offsets work correctly, since Windows represents line ends for text files
with two characters even though it delivers just one newline character to
and from your C++ program.)
(Note: We close the output streams to ensure that all output to the file
is completed before we open it for input. Because createRevision
reached the end of file on the oldFile
input stream, we need
to clear the end of file condition on the stream and reset the stream so
that we read from the beginning again. We do the same for
newFile2
before comparing the files.)
Here's another main routine that tests the functions. It uses the types
istringstream
(an input source where the characters read come
from a string instead of a file), and ostringstream
(an output
destination where the characters written are later available as a string).
#include <iostream> #include <sstream> // for istringstream and ostringstream #include <string> #include <cassert> using namespace std; void runtest(string oldtext, string newtext) { istringstream oldFile(oldtext); istringstream newFile(newtext); ostringstream revisionFile; createRevision(oldFile, newFile, revisionFile); string result = revisionFile.str(); cout << "The revision file length is " << result.size() << " and its text is " << endl; cout << result << endl; oldFile.clear(); // clear the end of file condition oldFile.seekg(0); // reset back to beginning of the stream istringstream revisionFile2(result); ostringstream newFile2; assert(revise(oldFile, revisionFile2, newFile2)); assert(newtext == newFile2.str()); } int main() { runtest("There's a bathroom on the right.", "There's a bad moon on the rise."); runtest("ABCDEFGHIJBLAHPQRSTUVPQRSTUV", "XYABCDEFGHIJBLETCHPQRSTUVPQRSTQQ/OK"); cout << "All tests passed" << endl; }
Neither of the main routines above check that the revision file is validly
formatted, so do not rely on them alone to demonstrate your code's
correctness. You might have some misunderstanding that results in your
createRevision
function incorrectly creating an improperly
formatted revision file that won't work with our revise
function even though it works with yours because that misunderstanding also
caused you to write an incorrect revise
function that
accepted that incorrect revision file.
To check that a revision file is validly formatted, you can supply an old, a new, and a revision file to the Project 4 Revision Checker.
We will provide you with several different old and new files for you to test your program with, in a Windows version and a Mac and Linux version. For each of pair of old and new files, your program must create revision files that are at least 5% smaller than the new version of the file. We will test your program on other files as well, so it can't work just for these sample files. Below we show how big our solution's revision file is for each of these sample pairs of files we're providing:
Windows file sizes in bytes | Mac and Linux file sizes in bytes | |||||
old | new | our revision file | old | new | our revision file | |
Mallmart inventory | 106 | 142 | 92 | 105 | 141 | 91 |
Weird Al's Dr. Seuss | 533 | 606 | 65 | 510 | 579 | 65 |
War and Peace | 77285 | 77333 | 382 | 76634 | 76682 | 376 |
Some strange files | 100764 | 100764 | 8913 | 100440 | 100440 | 8913 |
You may be wondering how to write a function to create a revision file. This is definitely a non-trivial problem, and most senior job interview candidates can't come up with a viable algorithm (i.e., a conceptual approach) within a one-hour interview.
Here's a general, high-level algorithm that can be used to build a revision file. It's not ideal, however, and you'll have to come up with improvements of your own to make it work well:
Of course, this is a simple version of the algorithm, and many
improvements are possible. For example, this version of the algorithm
will result in only single-character add instructions: If the new file
contains the new text BLAH
(not present in the old text),
then the above algorithm would produce four Add instructions
(e.g., +/B/+/L/+/A/+/H/
) instead of a single (and more compact) Add
instruction that adds all four new characters at once (+/BLAH/
).
To obtain the highest score possible and create the smallest revision files, you'll have to improve on the algorithm substantially. Be creative! In addition, this naïve algorithm also has troubles with certain types of files (such as the strange*.txt sample files that we provide). You'll have to figure out why and make sure this doesn't cause problems in your solution.
For this project, we are constraining your implementation:
vector
, list
, stack
, and
queue
(and string
). You'll thus have to implement
your BST or hash table yourself instead of using a map
or
unordered_map
, but you can, say, use the list
type to help you implement a hash table. Although we're limiting
your use of containers from the library, you are free to use
library algorithms (e.g., sort
from
<algorithm>
). You are also free to use std::pair
from <utility>
and std::hash
from
<functional>
if you wish.
Compared to creating a revision file, the algorithm for applying a revision file is much easier. You may use the following functions to aid you in extracting instructions from the revision file if you wish:
bool getInt(istream& inf, int& n) { char ch; if (!inf.get(ch) || !isascii(ch) || !isdigit(ch)) return false; inf.unget(); inf >> n; return true; } bool getCommand(istream& inf, char& cmd, char& delim, int& length, int& offset) { if (!inf.get(cmd)) { cmd = 'x'; // signals end of file return true; } switch (cmd) { case '+': return inf.get(delim).good(); case '#': { char ch; return getInt(inf, offset) && inf.get(ch) && ch == ',' && getInt(inf, length); } case '\r': case '\n': return true; } return false; }
What you will turn in for this assignment is a zip file containing these files and nothing more:
One or more source files that contain the code for the two required functions and any supporting types and functions you use to implement them. Your source code should have helpful comments that explain any non-obvious code. You may include a main routine if you wish; if you do, we will rename it to something harmless, never call it, and instead use our own main routine.
How nice! You don't have to turn in a report file this time.
Your revise
function is worth 22% of the points for this
project.
The correctness of your createRevision
function is worth 54% of
the points, subject to these caveats: You will receive no points for any
test case that produces a revision file that is not at least 5% smaller than
the file that is the second argument to createRevision
, and you
will lose half of the correctness points you earn if you use any library
containers other than those we allowed. (In other words, do not use
library containers that are implemented using trees or hash tables, such as
set
, map
, unordered_set
,
unordered_map
, etc.)
The size of the revision files that createRevision
produces is worth
24% of the points. The smaller your revision files, the more points you'll earn.
For each test case for which you produce a correct revision file, of the
revision file size points possible for that case:
createRevision
.