CS 143
Project 1
The CS143 project may be completed individually or in teams of two, the choice is up to each student. Partnership rules consist of the following:
Please make sure to submit your work jointly - do not turn your project in twice, once for each partner.
In order to access the seas machine, lnxsrv03.seas.ucla.edu, you need a secure shell client. All SEASnet machines already have a secure shell client installed, so you simply need to run the client. If you need to access lnxsrv03.seas.ucla.edu from a machine that does not have a secure shell client, you can download a windows secure shell client from ftp://ftp.ssh.com/pub/ssh/SSHSecureShellClient-3.2.9.exe. Mac OS X or Unix machines have a secure shell client preinstalled. Simply type "ssh -l <userid> lnxsrv03.seas.ucla.edu" within your command line interface, where <userid> is your user id on lnxsrv03.seas.ucla.edu. You should receive your userid and password via email, if you have not please contact your TA.
If you find this first part of the project a snap, you're not overlooking something, and don't worry. It will get much more interesting as time goes by. The primary purpose of this first "warm-up" part is for us to provide you with a whole bunch of basic information, and to get everyone up-to-speed on our computing systems and the languages and tools we will be using. Those of you who have done some Web programming before, especially anyone who has used a DBMS behind a Web site, will find this project part nearly trivial. Those of you who haven't will find it merely straightforward. With all that said, please don't start at the last minute as with all programming and systems work, everything takes a bit of time, and unforeseen snafus do crop up.
In this portion of the project you will experiment with MySQL’s command-line interface and its bulk loader. You will create two database tables to contain information about actor information, load the table with a small amount of (real) data we are providing, and run a few commands to get familiar with MySQL.
In this part of the project you will experiment with three features of MySQL: constraints , indexes and triggers. To experiment with these features, you need to load a large dataset that we have prepared. The dataset has been downloaded from the Internet Movie Database (http://imdb.com) and contains information about roughly 30,000 actors and actresses and 4,000 recent movies.
Note that a lot of students will load, create and modify large tables concurrently for this project, so the system can be very slow. Please bear this in mind during your project. We strongly encourage you to start early on this project and finish your project before the system is too heavily loaded. We will not extend the project deadline simply because the system becomes very slow.
For Projects 1, 2 and 3, you will have to play with the following dataset:
The Movie table: It describes information regarding movies in the database. It specifies an identification number unique to each movie, the title of the movie, the year the movie was released, the MPAA rating given to the movie, and the production company that produced the movie. The schema of the Movie table is given as follow:
Movie
(
id:INT, //
Movie ID
title:VARCHAR(100), //
Movie title
year:INT, //
Release year
rating:VARCHAR(10), //
MPAA rating
company:VARCHAR(50) //
Production company
)
The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/movie.del".
The Actor table: It describes information regarding actors and actresses of movies. It specifies an identification number unique to all people (which is shared between actors and directors), the last name of the person, the first name of the person, the sex of the person, the date of birth of the person, and the date of death of the person if applicable. The schema of the Actor table is given as follow:
Actor
(
id:INT, //
Actor ID
last:VARCHAR(20), //
Last name
first:VARCHAR(20), //
First name
sex:VARCHAR(6), //
Sex of the actor
dob:DATE, //
Date of birth
dod:DATE //
Date of death
)
The load files for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/actor1.del", "/w/class.1/cs/cs143v/cs143vta/proj/data/actor2.del" and "/w/class.1/cs/cs143v/cs143vta/proj/data/actor3.del". Load all of the files once.
The Director table: It describes information regarding directors of movies. It specifies an identification number of the director, the last name of the director, the first name of the director, the date of birth of the director, and the date of death to the director if applicable. The schema of the Director table is given as follow:
Director
(
id:INT, //
Director ID
last:VARCHAR(20), //
Last name
first:VARCHAR(20), //
First name
dob:DATE, //
Date of birth
dod:DATE //
Date of death
)
Note that the id is unique to all people (which is shared
between actors and directors). That is, if a person is both an actor and a
director, the person will have the same id both in the Actor and the Director
table.
The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/director.del".
The MovieGenre table: It describes information regarding the genre of movies. It specifies the identification number of a movie, and the genre of that movie. The schema of the MovieGenre table is given as follow:
MovieGenre
(
mid:INT, //
Movie ID
genre:VARCHAR(20) //
Movie genre
)
The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/moviegenre.del".
The MovieDirector table: It describes the information regarding the movie and the director of that movie. It specifies the identification number of a movie, and the identification number of the director of that movie. The schema of the MovieDirector table is given as follow:
MovieDirector
(
mid:INT, //
Movie ID
did:INT //
Director ID
)
The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/moviedirector.del".
The MovieActor table: It describes information regarding the movie and the actor/actress of that movie. It specifies the identification number of a movie, and the identification number of the actor/actress of that movie. The schema of the MovieActor table is given as follow:
MovieActor
(
mid:INT, //
Movie ID
aid:INT, //
Actor ID
role:VARCHAR(50) //
Actor role in movie
)
The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/movieactor1.del" and "/w/class.1/cs/cs143v/cs143vta/proj/data/movieactor2.del". Load both files once.
The Review table: Later in Project 3, you will create a Web interface where the users of your system can add “reviews on a movie (similarly to Amazon product reviews). The Review table stores the reviews added in by the users in the following schema:
Review
(
name:VARCHAR(20), //
Reviewer name
time:TIMESTAMP, //
Review time
mid:INT, //
Movie ID
rating:INT, //
Review rating
comment:VARCHAR(500) //
Reviewer comment
)
Each tuple specifies the name of the reviewer, the timestamp of the review, the movie id, the rating that the reviewer gave the movie (i.e., x out of 5), and additional comments the reviewer gave about the movie.
Since this data will be added by your users, there is no load file.
The MaxPersonID and MaxMovieID Tables: Again, later in Project 3, you will also construct a Web interface where users can add new actor, director or movie information to the database. Once a user adds a new actor/director, your system should assign a new id to the actor/director and insert a tuple to the Actor/Director table. Similarly, your system should assign a new id to a new movie.
In order to assign a new id to, say, an actor/director, your system has to remember what was the largest id that it assigned to a person in the last insertion. The MaxPersonID table is used for this purpose, which has the following schema:
MaxPersonID
(
id:INT //
Max ID assigned to all persons
)
It is a one-tuple-one-attribute table and maintains the largest id number that the system has assigned to a person so far. Whenever a new actor/director is inserted, the system looks up this table, increases the id value of the tuple by one, and assigns the increased id value to the new actor/director. You may consider this MaxPersonID table as a “persistent variable that remembers its value even after your program stops.
The MaxMovieID is used similarly to assign a new id to a new movie inserted by the user. As a new movie is added to the database, the entry in this table is incremented and assigned to the new movie. The schema of the MaxMovieID table is given as follow:
MaxMovieID
(
id:INT //
Max ID assigned to movies
)
Create will have to create these two tables, and insert the tuple (69000) to MaxPersonID table and the tuple (4750) into the MaxMovieID table.
You will want to read or review the following article for this project part:
If the data in your MovieDatabase system at a given point in time represents a correct state of the real world, a number of constraints are expected to hold. Here are a few possible examples, some of which depend on a particular schema:
For this part of the project you should implement two constraints.
Note, these constraints must make sense in the real world, e.g. a movie id should be unique. For each constraint, write down its English description as a comment within your MySQL script, and implement the constraint. For each constraint that you define, demonstrate at least one database modifications that violates the constraint by attempting to run such an operation, and explain why the operation violates the constraint.
As discussed in class and in the textbook, an important technique for improving the performance of queries is to create indexes. An index on an attribute A of relation R allows the DBMS to quickly find all tuples in R matching a given value or range of values for attribute A (useful when evaluating selection or join conditions involving attribute A). An index can be created on any attribute of any relation, or on several attributes combined.
For this part of the project, you are given two queries, and need to create one more "interesting" query. You need to suggest at least two useful indexes for each query, and explain the reasoning behind your suggestions. Compare the running time of the query on each index by implementing one index at a time, and point out the index with the best performance, and the reason.
SELECT COUNT(*)
FROM Movie M, Actor A,
MovieActor MA
WHERE M.id = MA.mid AND
MA.aid = A.id AND
M.year = 2001 AND
A.last = 'Hanks'
SELECT COUNT(*)
FROM Actor
WHERE sex = 'Female'
AND
dob > '1992-01-01'
There are a few ways you can measure the running time of your query. One way is to capture MySQL timing information at the end of query execution. Another way is to use the UNIX time timing utility. Yet another way to measure the running time of your query is to use a Java application to call your query, and measure the time with the Java Timestamp class (You will learn more about JDBC in the next project).
Note that the query performance may vary depending on the system load. In order to get a reliable performance estimate, you may want to execute your query multiple times (at different times) and take an average.
Because our dataset is relatively small in today’s standard, and the machines are time-shared by many different processes, the results from your experiment may not match your initial expectation. You will gain full credit on this part of the project if you provide a reasonable explanation of why your suggested indexes will be helpful, and show that you have actually created and tested those indexes in our dataset. The result does not necessarily have to match with your initial expectation as long as your explanation is reasonable.
Finally, note that MySQL creates an index for every PRIMARY KEY or UNIQUE constraint in your table. If an attribute (or a set of attributes) is defined as PRIMARY KEY or UNIQUE, you do not need to create an index on the attribute.
In this part of the assignment we will create simple triggers. You will want to read or review the following article for this part of the project:
Triggers in MySQL, NOTE: Use an AFTER trigger and not BEFORE trigger for this part
Suppose that you want to keep track of updates that on several tables such as movie, actor etc. For this purpose you are to create a table called updates that will store name of the table that was updated and date when that table was updated. The schema of updates table is as follows:
updates
(
tname:VARCHAR(30), //
Name of the table updated
time:DATE, //
When was the table updated
)
For this excercise create a simple trigger on table movie and actor, such that when a new movie or actor is inserted, the updates table is updated with corresponding tuples.
To accommodate the emergencies that students may encounter, each team has 3-day grace period for late submission. The grace period can be used for any part of the project in the unit of one day. For example, a student may use 1-day grace period for part 1 and 2-day grace period for part 3. Note that even if a team submits a project 12 hours late, they would need to use a full day grace period to avoid late penalty. If your project is submitted late, we will automatically use the available days in your grace period unless you specifically mention otherwise in the README file.
Please remember that a project can be submitted at most 5 times and the total size of all files being submitted should not exceed 2MB. Steps for submitting your project electronically is as follows: