Project 2. Web page comparison server

Background

Your boss at Jerry-built Computer Networks Corp. (JCN) complains that JCN is maintaining millions of web pages, many of which are nearly copies, and that JCN should attempt to find the web pages that are nearly copies and work to unify them. Your overall job will be to write an application that looks through JCN's web pages, and finds pages that are similar. Your initial assignment is to write a simple comparison server that looks for RURLs (as defined in Homework 3) common to two web pages. Your boss also wants you to think of ways to improve your server so that it is useful even to nontechnical users.

Assignment

Write a web server that repeatedly does the following:

  1. Accept a query that uses a URL of the form <http://host:port/compare-rurls.rpy?old=OLDURL&new=NEWURL>.
  2. For each such query, fetch the resources OLDRESOURCE and NEWRESOURCE located by OLDURL and NEWURL.
  3. Generate a text/plain (not text/html) response that contains the following three output sections, separated by empty lines: The response should contain exactly two empty lines, to separate the three output sections. Each section should be free of duplicate lines, and each section's lines should be sorted in ascending ASCII order. If a resource cannot be retrieved, the corresponding section should be a single line containing a decimal integer indicating the HTTP status code (e.g., 404 for resource not found), and the last section should be empty.

Implement your web server with Twisted, an event-driven networking framework written in Python. Twisted documentation is available. You can find a copy of the Twisted 1.1.0 source code and documentation in ~cs131ta/src/Twisted-1.1.0 on SEASnet. A compressed tar image is also available in ~cs131ta/src/tarpit/Twisted-1.1.0.tar.gz and at Twisted Matrix Laboratories.

You may find it necessary to modify Twisted. If so, minimize the number of changes to the existing source code. For example, instead of modifying an existing class, use a new subclass of your own instead, whenever possible.

An .rpy file is a resource script: it is like a normal Python .py file, except that there is one extra restriction: an .rpy file must define a global variable resource whose value is an instance of a subclass of twisted.web.resource.Resource. By default, the Twisted web server renders the resource in response to a URL naming the resource script. Please see the twisted.web.resource.Resource.render API for a few more details, but the full story is best discovered by reading the Twisted source code mentioned below. A simple example .rpy file can be found in ~cs131ta/twisted/public-html/test.rpy on SEASnet.

OLDURL and NEWURL may need to be escaped so that the overall query is a valid URI. For example, if either URL contains # it must be represented using the standard escape sequence %23 since any # would be interpreted as the start of a fragment after the overall query. Please see Internet RFC 1808 §2.2 for details.

Escapes can also be used to represent arbitrary characters within URLs. For example, to compare http://www.cs.ucla.edu/ to http://www.seas.ucla.edu/seasnet/, you might use either of the following queries:

http://host:port/compare-rurls.rpy?old=http://www.cs.ucla.edu/&new=http://www.seas.ucla.edu/seasnet/
http://host:port/compare-rurls.rpy?old=http%3a%2f%2Fwww.cs.ucla.edu%2f&new=%68ttp%3A/%2fwww.seas.ucla.edu%2fseasnet%2f

Suggestions for testing

Make sure that your web server outputs exactly the correct text, with no leading or trailing white space on any line, as your server will be tested by a persnickety script that considers spaces to be important.

To figure out which SEASnet host you're running on, type the shell command hostname. If it outputs the string landfair, you are running on landfair.seas.ucla.edu. To simplify exposition the rest of this section assumes that you're running on landfair.

To avoid clashes with other people's web servers, use a port number that is 14000 greater than your student ID modulo 424. For example, if your student ID is 123456789, then use port 14285 (because 14285 is 14000 + 123456789%424, as you can easily confirm by typing that expression into an interactive Python session) by using the --port=14285 option when generating your web server with mktap as described below.

Here's a recipe for getting started with Twisted. Execute the following shell commands:

   # Set environment variables appropriately.
   setenv PYTHONPATH /u/cs/class/cs131/cs131ta/src/Twisted-1.1.0
   setenv PATH /u/cs/class/cs131/cs131ta/src/Twisted-1.1.0/bin:$PATH

   # Make a new directory and build a web server that will run in it.
   mkdir twisted
   cd twisted
   mktap web --path=public-html --port=14285

   # Put your web pages in this directory.  Here is a sample:
   mkdir public-html
   cp ~cs131ta/twisted/public-html/test.rpy public-html/test.rpy

   # Run your web server.  Always use "-n", for debug mode.
   twistd -n -f web.tap
   # You should see some log messages starting with "Log opened"
   # and ending with "set uid/gid".

   # Now you can test your web server.
   # You should see more log messages as you test.

   # Type Control-C to exit your web server.

To test whether your server is working with the sample web pages mentioned above, use a browser to visit <http://landfair.seas.ucla.edu:14285/test.rpy?old=http://www.cs.ucla.edu/&new=http://www.seas.ucla.edu/seasnet/>. You should get a Test Web Page that lists its URL, URI, and query arguments, among other things.

Submitting your work

Submit a file named pr2.tgz. It should be a gzipped tar file containing all the source files that are needed to build and run your project. One of these files must be called written-by.txt and must contain your name and student ID. For example, you might use the following command to create pr2.tgz:

   cd $HOME/twisted
   tar cf - written-by.txt *.py public-html | gzip -9 >pr2.tgz

This causes tar to generate a single output stream containing the named files; the | is the Unix pipe symbol, which causes tar's output to be sent as input to gzip. Don't forget the - (separated by spaces) after the cf.

Before submitting pr2.tgz, test it by running the following commands:

   setenv PATH /u/cs/class/cs131/cs131ta/src/Twisted-1.1.0/bin:$PATH

   mkdir testdirectory
   cd testdirectory
   # Substitute your own port for "14285".
   mktap web --path=public-html --port=14285
   gunzip <../pr2.tgz | tar xf -
   setenv PYTHONPATH `pwd`:/u/cs/class/cs131/cs131ta/src/Twisted-1.1.0
   twistd -n -f web.tap

   # [Test your web server here.]

   # When testing is done, clean up as follows.
   # First, type Control-C.
   # Then, remove your test copy as follows:
   cd ..
   rm -fr testdirectory

Optional extra work

For extra credit, you can do some extra work for this project. If you'd like to do this, please send email to Michael proposing the extra work that you'd like to do. Your proposal should be in the form of ASCII text (not HTML or PostScript or anything fancy; just ASCII) and should be in Michael's mailbox by noon Friday, 2003-11-21. Your proposal should be succinct but should have enough detail so that we'll know how to test your extension. Michael will respond with his estimate of how many extra points your proposal will be worth. You may submit up to three proposals and may implement any combination of the accepted proposals, so long as their combined value is at least 20 points extra.

If you submit nontrivial work for extra credit, you may work together with one other person in this class (from either section). This is an exception to the ordinary class rule that all work submitted must be your own. The extra work is defined to be "nontrivial" if we evaluate your extra-work proposal(s) to be worth at least 25 points extra.

In your joint submission you must identify which part each coworker was primarily responsible for; one way to do this is to split your work by source files, and to put a comment "# Written by YOUR NAME." near the start of each source file. The file written-by.txt should list both coworkers' names and student IDs.

Here are some sketches of ideas for extra work, along with our guess of how many extra points will be awarded for each idea, over and above the 100 points awarded for the base implementation (your mileage may vary depending on your exact proposal):


© 2003 Paul Eggert. See copying rules.
$Id: pr2.html,v 1.11 2003/11/18 19:32:57 eggert Exp $