Project. Twisted Twitter/BitTorrent herd

Background

Wikipedia and its related sites are based on the Wikimedia Architecture, which uses a LAMP platform based on GNU/Linux, Apache, MySQL, and PHP, using multiple, redundant web servers behind a load-balancing virtual router for reliability and performance. For a brief introduction to the Wikimedia Architecture, please see Mark Bergsma, Wikimedia architecture (2007). For a more extensive discussion, please see Domas Mituzas, Wikipedia: Site internals, configuration, code examples and management issues (the workbook), MySQL Users Conference 2007.

While LAMP works fairly well for Wikipedia, let's assume that we are building a new Wikimedia-style service designed for news, where (1) updates to articles will happen far more often, (2) access will be required via various protocols, not just HTTP, and (3) clients will tend to be more mobile. In this new service the application server looks like it will be a bottleneck. From a software point of view our application will turn into too much of a pain to add newer servers (e.g., for access via cell phones, where the cell phones are frequently broadcasting their GPS locations). From a systems point of view the response time looks like it will too slow because the Wikimedia application server is a central bottleneck.

Your team has been asked to look into a different architecture called a "BitTorrent herd", where the multiple clients and servers communicate directly to each other as well as servers communicating via the core database and caches. In this approach, clients and applications servers communicate via the BitTorrent protocol, plain HTTP, and perhaps other protocols. These client and application server communications protocols are designed for rapidly-evolving data (ranging from small stuff such as GPS-based locations to larger stuff such as ephemeral video data) whereas the database server will still be used for more stable data that requires transactional semantics. In this model, clients and application servers are not necessarily fully connected. For example, you might have three application servers A, B, C such that A talks with B and C but B and C do not talk to each other. However, the idea is that if a user's cell phone posts its GPS location to any one of the application servers then the other servers will learn of the location after one or two interserver transmissions, without having to talk to the database.

You've been delegated to look into the Twisted event-driven networking framework as a candidate for replacing part or all of the Wikimedia platform for your application. Your boss thinks that this might be a good match for the problem, since Twisted's event-driven nature should allow an update to be processed and forwarded rapidly to other clients and servers in the herd, and since Twisted is already proven as an underlying technology for the BitTorrent client Deluge. However, he doesn't know how well Twisted will really work in practice for the proposed application. In particular, he wants to know how easy is it to write applications using Twisted, how maintainable and reliable those applications will be, and how well one can glue together new applications to existing ones such as Deluge; he's worried that Python's implementation of type checking, memory management, and multithreading may cause problems for larger applications. (The first BitTorrent software was written in Python, but BitTorrent moved to C++ starting with version 6.0, and he's worried that something similar might happen to this project.) Your boss wants you to dig beyond the hype and really understand the pros and cons of using Twisted. He suggests that as an exercise you write a simple and parallelizable proxy for the Twitter API, and think about how to connect that proxy with Deluge.

Twitter is a microblogging service that lets users send and receive short messages, called tweets. It is written in a combination of Ruby on Rails and Scala, but interfaces to it are available from many other languages.

Assignment

Do some research on Twisted as a potential framework for this kind of application, and on Deluge as potential use of that framework. Your research should include an examination of the Twisted and Deluge source code and documentation, and a small prototype or example code of your own that demonstrates whether Twisted would be an effective way to implement a Twitter/BitTorrent herd. Please base your research on Twisted 11.0.0 (dated 2011-04-03) and Deluge 1.3.3 (dated 2011-07-22), even if newer versions come out before the due date; that way we'll all be on the same page. We suggest using Python 2.7.2, as Twisted 11.0.0 is designed for Python 2.7.x and it's not likely to work with Python 3.

Your prototype should consist of a server that accepts TCP connections from clients that emulate mobile devices with IP addresses and DNS names. A client should be able to send its location to the server by sending a message like this:

AT kiwi.cs.ucla.edu 1321166164.561291399 +34.0650-118.4434

The first field AT is name of the command where the client tells the server where it is. Its operands are the client ID (in this case, kiwi.cs.ucla.edu), the client's idea of when it sent the message, and the client's idea of its location. A client ID may be any string of non-white-space characters. (A white space character is space, tab, carriage return, newline, formfeed, or vertical tab.) The time stamp is expressed in Unix time, which consists of seconds and nanoseconds since 1970-01-01 00:00:00 UTC, ignoring leap seconds; for example, 1321166164.561291399 stands for 2011-11-13 06:36:04.561291399 UTC. The location is the latitude and longitude in decimal degrees using ISO 6709 notation. Fields are separated by one or more white space characters and do not contain white space; ignore any leading or trailing white space on the line.

The server should respond to clients with a message like this:

AT Jones +0.563873386 kiwi.cs.ucla.edu 1321166164.561291399 +34.0650-118.4434

where AT is the name of the response, Jones is the ID of the server that got the message from the client (using the same syntax as for client IDs), +0.563873386 is the difference between the server's idea of when it got the message from the client and the client's time stamp, and the remaining fields are a copy of the AT data. In this example (the normal case), the server's time stamp is greater than the client's so the difference is positive, but it might be negative if there was enough clock skew in that direction.

Clients can also query for tweets near other clients' locations, with a query like this:

WHATSAT kiwi.cs.ucla.edu 3 2

The arguments to a WHATSAT message are the name of another client (e.g., kiwi.cs.ucla.edu), a radius (in kilometers) from the client (e.g., 3), and an upper bound on the number of tweets to receive from Twitter senders within that radius of the client (e.g., 2). The upper bound must be at most 100, since that's all that Twitter supports conveniently.

The server responds with a AT message in the same format as before, giving the most recent time and location reported by the client. Following the AT message is a JSON-format message, exactly in the same format that Twitter gives, followed by a newline. Here is an example (this output is line wrapped to print, but the actual output contains no newlines, except for the newline after the AT line, and the newline at the end):

AT Jones +0.563873386 kiwi.cs.ucla.edu 1321166164.561291399 +34.0650-118.4434
{"completed_in":0.069,"max_id":135817493641035776,"max_id_str":"135817493641035776","next_page":"?page=2&max_id=135817493641035776&q=&geocode=%2034.0650%2C-118.4434%2C3km&rpp=2","page":1,"query":"","refresh_url":"?since_id=135817493641035776&q=&geocode=%2034.0650%2C-118.4434%2C3km","results":[{"created_at":"Sun, 13 Nov 2011 20:33:08 +0000","from_user":"dylanschoenbaum","from_user_id":177233394,"from_user_id_str":"177233394","geo":null,"location":"Venice, CA","id":135817493641035776,"id_str":"135817493641035776","iso_language_code":"en","metadata":{"result_type":"recent"},"profile_image_url":"http://a2.twimg.com/profile_images/1488869070/Photo_68_normal.jpg","source":"<a href="http://twitter.com/#!/download/iphone" rel="nofollow">Twitter for iPhone</a>","text":"RT @kassemg: I wonder what Morgan Freeman's face says in Braille.","to_user_id":null,"to_user_id_str":null},{"created_at":"Sun, 13 Nov 2011 20:33:08 +0000","from_user":"daniel1179","from_user_id":22851979,"from_user_id_str":"22851979","geo":null,"location":"Sherman Oaks","id":135817492542136320,"id_str":"135817492542136320","iso_language_code":"en","metadata":{"result_type":"recent"},"profile_image_url":"http://a3.twimg.com/profile_images/802179815/n1471245031_30096443_4920_normal.jpg","source":"<a href="http://twitter.com/devices" rel="nofollow">txt</a>","text":"Thank u!","to_user_id":null,"to_user_id_str":null}],"results_per_page":2,"since_id":0,"since_id_str":"0"}

Servers should be able to handle multiple clients at the same time. A single client should be able to connect to the same server with multiple connections, and these connections should be able to operate simultaneously and independently.

Servers should respond to invalid commands with a line that contains a question mark (?), a space, and then a copy of the invalid command.

The server should log its input and output into a file, using a format of your design. The logs should also contain notices of new and dropped connections from clients. You can use the logs' data in your reports.

Investigate how hard it would be to integrate your prototype with Deluge. Look at Deluge's source code and find the place or the places that you'd use for your integration. If possible, do the integration; if not, sketch how you'd go about it, using examples from both your code and from Deluge's code (and if necessary, from Twisted's code).

Write a report that summarizes your research, recommends whether Twisted is a suitable framework for this kind of application, and justifies your recommendation. Describe any problems you ran into. Your report should directly address your boss's worries about Python's type checking, memory management, and multithreading, compared to a Java-based approach to this problem.

Prepare and give a brief oral presentation of your research, and present it in discussion section. Please make sure your talk fits in the time allotted, as we don't want to cut you off abruptly.

Your research, report, and presentation should focus on language-related issues. For example, how easy is it to write Twisted-based programs that run and exploit server herds? What are the performance implications of using Twisted? Don't worry about nontechnical issues like licensing, or about management issues like software support and retraining programmers.

Style issues

Your report and talk are an important part of this assignment: they are not mere afterthoughts. Please see Resources for oral presentations and written reports for advice on generating high-quality reports and presentations.

Your report should use standard technical academic style, and should have a title, abstract, introduction, body, recommendations/conclusions, references, and any other sections necessary. References should contain active DOIs (preferable; see Resources) or URLs for the cited sources if available. Limit your report to at most ten pages. Use the USENIX style, which uses a two-column format with 10-point font for most of the text, on an 8½"×11" page; an example of the output format and an example student paper are available.

Your paper is not expected to be just like that example student paper! That was written by a graduate student, she was writing a conference paper describing months of full-time research, and the paper is too long for us. It's merely an example of technical style and layout.

Research mechanics

If you need to run servers on SEASnet to do your research, please let the TA know how many TCP ports you need, and we will allocate them for you. Please do not use TCP ports at random, as that might collide with other students' uses.

You can grab a copy of the Twisted and Deluge source code from the respective web sites. If these site are down you can also get copies from the directory ~eggert/src/tarpit/ on SEASnet. For casual inspection, untarred copies are under ~eggert/src/.

You can run Twisted on SEASnet by prepending /usr/local/cs/bin to your PATH in the usual way. For example, you can run the chat server (taken from the Twisted code examples) as follows:

# Replace '12000' with your TCP port number as allocated by the TA.
sed 's/1025/12000/' /u/cs/fac/eggert/src/Twisted-11.0.0/doc/core/examples/chatserver.py >chatserver.py

# We recommend the -n option to forestall runaway servers.
twistd -n -y chatserver.py

# Now you can telnet to port 12000, from another session
# on the same host, to exercise the chat server.
# A log should appear on your screen.

# Type control-C to terminate the chat server.

If you can think of some similar project you'd like to do, that resembles the current project but is cooler, ask the T.A. for permission to do the similar project, and then go for it.

Submitting your work

Submit a file named report.pdf containing a copy of your paper in PDF form.

Submit a file named talk.pdf containing a copy of your presentation's visual aids in PDF form. Also submit a file containing the original format of the aids (talk.odp, talk.ppt, etc.), so that we can have them ready to go for your talk all using the same laptop, and not have to waste time connecting your laptop to the projector. Please consult with the TA well before the talk if your visual aids require special software (e.g., if you have a prototype server that you wish to demonstrate).

Submit any other supporting work (e.g., your source code) in a gzipped tar file named project.tgz.

Reference

In the following references, the O'Reilly sources are Safari online books, so their contents should be viewable by anybody on the UCLA campus.


© 2008–2011 Paul Eggert. See copying rules.
$Id: pr.html,v 1.36 2011/11/14 00:22:16 eggert Exp $