CS 143

Project 1

Partners

The CS143 project may be completed individually or in teams of two, the choice is up to each student. Partnership rules consist of the following:

An identical amount of work is expected and the same grading scale is used for individual and team projects.
If you work in a team, choose your partner carefully. Teams are permitted to "divorce" at any time during the course, and individual students may choose to team up as the project progresses, however students from divorced teams may not form new teams or join other teams.
Both partners in a team will receive exactly the same grade for each project part submitted jointly.
Note: Students are encouraged to install MySQL locally on their laptops to avoid possible downtime with SEAS and other issues.

Please make sure to submit your work jointly - do not turn your project in twice, once for each partner.

Logging in to ugrad.seas.ucla.edu

In order to access the seas machine, lnxsrv03.seas.ucla.edu, you need a secure shell client. All SEASnet machines already have a secure shell client installed, so you simply need to run the client. If you need to access lnxsrv03.seas.ucla.edu from a machine that does not have a secure shell client, you can download a windows secure shell client from ftp://ftp.ssh.com/pub/ssh/SSHSecureShellClient-3.2.9.exe. Mac OS X or Unix machines have a secure shell client preinstalled. Simply type "ssh -l <userid> lnxsrv03.seas.ucla.edu" within your command line interface, where <userid> is your user id on lnxsrv03.seas.ucla.edu. You should receive your userid and password via email, if you have not please contact your TA.

Scope

If you find this first part of the project a snap, you're not overlooking something, and don't worry. It will get much more interesting as time goes by. The primary purpose of this first "warm-up" part is for us to provide you with a whole bunch of basic information, and to get everyone up-to-speed on our computing systems and the languages and tools we will be using. Those of you who have done some Web programming before, especially anyone who has used a DBMS behind a Web site, will find this project part nearly trivial. Those of you who haven't will find it merely straightforward. With all that said, please don't start at the last minute as with all programming and systems work, everything takes a bit of time, and unforeseen snafus do crop up.

Part A

In this portion of the project you will experiment with MySQL’s command-line interface and its bulk loader. You will create two database tables to contain information about actor information, load the table with a small amount of (real) data we are providing, and run a few commands to get familiar with MySQL.

Step 0: A class account at class has been assigned to you. Please see your TA if you do not have an account yet. Once you have your class account, log onto lnxsrv03.seas.ucla.edu. If you are relatively new to Unix, you may first want to read the documents Unix Tutorial . Once you feel comfortable with the Unix environment, read the document Introduction to MySQL . Try some of the examples in the document, and experiment with the help command.
Step 1: Two comma separated value (.csv) files actors-short.csv and actresses-short.csv with actor and actress information are located at /u/cs/class/cs143v/cs143vta/proj1/data/actors-short.csv and /u/cs/class/cs143v/cs143vta/proj1/data/actresses-short.csv. Based on these files, create tables called Actors and Actresses in MySQL into which you will load the data. When you load the files, please load from the provided file directly. Please do not copy the file to your local directory. Both tables for actors and actresses should include attributes of Name (VARCHAR), Movie (VARCHAR), Year (INTEGER), and Role (VARCHAR). Remember to specify the number of characters in your VARCHAR declaration. Load the data files into your MySQL tables.
Step 2: Retrieve some loaded data from the file actresses-short.csv. Write a query that returns the answer to this question: “Give me the names of all the actresses in the movie ‘Chicago’.” Feel free to experiment with other interesting queries.
Step 3: Learn to use joins to find information spanned across different tables. Write a query that returns the answer to this question: “Give me all the actors and actresses who have acted in more than one movie.” The basic approach to this is to join the Actors/Actresses table with itself and taking the union of actors and actresses. The AS statement can be useful to differentiate the instances of the Actors/Actresses table in this query.
Step 4: More SQL queries. Write a query that returns the answer to this question: “Give me all the actresses and the number of movies they acted in, in the decreasing order of the number of movies.”
Step 5: Create a script named P1A that shows every one of the steps in this part of the project. You can use the ‘--' tag to make comments within your SQL script. Make sure you give adequate comments documenting each part of each of the steps. Execute the script and save all outputs in a file. Information on creating scripts and saving its output is provided in MySQL under Executing SQL From a File and Recording a MySQL Session In a File Make sure the name of you script file is P1A.

Part B

In this part of the project you will experiment with three features of MySQL: constraints , indexes and triggers. To experiment with these features, you need to load a large dataset that we have prepared. The dataset has been downloaded from the Internet Movie Database (http://imdb.com) and contains information about roughly 30,000 actors and actresses and 4,000 recent movies.

Note that a lot of students will load, create and modify large tables concurrently for this project, so the system can be very slow. Please bear this in mind during your project. We strongly encourage you to start early on this project and finish your project before the system is too heavily loaded. We will not extend the project deadline simply because the system becomes very slow.

Part 0: Description of New Dataset and Related Tables

For Projects 1, 2 and 3, you will have to play with the following dataset:

The Movie table: It describes information regarding movies in the database. It specifies an identification number unique to each movie, the title of the movie, the year the movie was released, the MPAA rating given to the movie, and the production company that produced the movie. The schema of the Movie table is given as follow:

Movie

(

id:INT, // Movie ID

title:VARCHAR(100), // Movie title

year:INT, // Release year

rating:VARCHAR(10), // MPAA rating

company:VARCHAR(50) // Production company

)

The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/movie.del".

The Actor table: It describes information regarding actors and actresses of movies. It specifies an identification number unique to all people (which is shared between actors and directors), the last name of the person, the first name of the person, the sex of the person, the date of birth of the person, and the date of death of the person if applicable. The schema of the Actor table is given as follow:

Actor

(

id:INT, // Actor ID

last:VARCHAR(20), // Last name

first:VARCHAR(20), // First name

sex:VARCHAR(6), // Sex of the actor

dob:DATE, // Date of birth

dod:DATE // Date of death

)

The load files for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/actor1.del", "/w/class.1/cs/cs143v/cs143vta/proj/data/actor2.del" and "/w/class.1/cs/cs143v/cs143vta/proj/data/actor3.del". Load all of the files once.

The Director table: It describes information regarding directors of movies. It specifies an identification number of the director, the last name of the director, the first name of the director, the date of birth of the director, and the date of death to the director if applicable. The schema of the Director table is given as follow:

Director

(

id:INT, // Director ID

last:VARCHAR(20), // Last name

first:VARCHAR(20), // First name

dob:DATE, // Date of birth

dod:DATE // Date of death

)

Note that the id is unique to all people (which is shared between actors and directors). That is, if a person is both an actor and a director, the person will have the same id both in the Actor and the Director table.

The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/director.del".

The MovieGenre table: It describes information regarding the genre of movies. It specifies the identification number of a movie, and the genre of that movie. The schema of the MovieGenre table is given as follow:

MovieGenre

(

mid:INT, // Movie ID

genre:VARCHAR(20) // Movie genre

)

The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/moviegenre.del".

The MovieDirector table: It describes the information regarding the movie and the director of that movie. It specifies the identification number of a movie, and the identification number of the director of that movie. The schema of the MovieDirector table is given as follow:

MovieDirector

(

mid:INT, // Movie ID

did:INT // Director ID

)

The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/moviedirector.del".

The MovieActor table: It describes information regarding the movie and the actor/actress of that movie. It specifies the identification number of a movie, and the identification number of the actor/actress of that movie. The schema of the MovieActor table is given as follow:

MovieActor

(

mid:INT, // Movie ID

aid:INT, // Actor ID

role:VARCHAR(50) // Actor role in movie

)

The load file for the table is located at "/w/class.1/cs/cs143v/cs143vta/proj/data/movieactor1.del" and "/w/class.1/cs/cs143v/cs143vta/proj/data/movieactor2.del". Load both files once.

The Review table: Later in Project 3, you will create a Web interface where the users of your system can add “reviews on a movie (similarly to Amazon product reviews). The Review table stores the reviews added in by the users in the following schema:

Review

(

name:VARCHAR(20), // Reviewer name

time:TIMESTAMP, // Review time

mid:INT, // Movie ID

rating:INT, // Review rating

comment:VARCHAR(500) // Reviewer comment

)

Each tuple specifies the name of the reviewer, the timestamp of the review, the movie id, the rating that the reviewer gave the movie (i.e., x out of 5), and additional comments the reviewer gave about the movie.

Since this data will be added by your users, there is no load file.

The MaxPersonID and MaxMovieID Tables: Again, later in Project 3, you will also construct a Web interface where users can add new actor, director or movie information to the database. Once a user adds a new actor/director, your system should assign a new id to the actor/director and insert a tuple to the Actor/Director table. Similarly, your system should assign a new id to a new movie.

In order to assign a new id to, say, an actor/director, your system has to remember what was the largest id that it assigned to a person in the last insertion. The MaxPersonID table is used for this purpose, which has the following schema:

MaxPersonID

(

id:INT // Max ID assigned to all persons

)

It is a one-tuple-one-attribute table and maintains the largest id number that the system has assigned to a person so far. Whenever a new actor/director is inserted, the system looks up this table, increases the id value of the tuple by one, and assigns the increased id value to the new actor/director. You may consider this MaxPersonID table as a “persistent variable that remembers its value even after your program stops.

The MaxMovieID is used similarly to assign a new id to a new movie inserted by the user. As a new movie is added to the database, the entry in this table is incremented and assigned to the new movie. The schema of the MaxMovieID table is given as follow:

MaxMovieID

(

id:INT // Max ID assigned to movies

)

Create will have to create these two tables, and insert the tuple (69000) to MaxPersonID table and the tuple (4750) into the MaxMovieID table.

Constraints

You will want to read or review the following article for this project part:

Constraints in MySQL

If the data in your MovieDatabase system at a given point in time represents a correct state of the real world, a number of constraints are expected to hold. Here are a few possible examples, some of which depend on a particular schema:

Every movie has a unique identification number.
Every movie must have a title.
Every actor must have a date of birth and it should be less than the date of death or date of death should be unspecified.

For this part of the project you should implement two constraints.

A primary key constraint and
A referential integrity constraint

Note, these constraints must make sense in the real world, e.g. a movie id should be unique. For each constraint, write down its English description as a comment within your MySQL script, and implement the constraint. For each constraint that you define, demonstrate at least one database modifications that violates the constraint by attempting to run such an operation, and explain why the operation violates the constraint.

Indexes

As discussed in class and in the textbook, an important technique for improving the performance of queries is to create indexes. An index on an attribute A of relation R allows the DBMS to quickly find all tuples in R matching a given value or range of values for attribute A (useful when evaluating selection or join conditions involving attribute A). An index can be created on any attribute of any relation, or on several attributes combined.

For this part of the project, you are given two queries, and need to create one more "interesting" query. You need to suggest at least two useful indexes for each query, and explain the reasoning behind your suggestions. Compare the running time of the query on each index by implementing one index at a time, and point out the index with the best performance, and the reason.

SELECT COUNT(*)

FROM Movie M, Actor A, MovieActor MA

WHERE M.id = MA.mid AND

MA.aid = A.id AND

M.year = 2001 AND

A.last = 'Hanks'

SELECT COUNT(*)

FROM Actor

WHERE sex = 'Female' AND

dob > '1992-01-01'

There are a few ways you can measure the running time of your query. One way is to capture MySQL timing information at the end of query execution. Another way is to use the UNIX time timing utility. Yet another way to measure the running time of your query is to use a Java application to call your query, and measure the time with the Java Timestamp class (You will learn more about JDBC in the next project).

Note that the query performance may vary depending on the system load. In order to get a reliable performance estimate, you may want to execute your query multiple times (at different times) and take an average.

Because our dataset is relatively small in today’s standard, and the machines are time-shared by many different processes, the results from your experiment may not match your initial expectation. You will gain full credit on this part of the project if you provide a reasonable explanation of why your suggested indexes will be helpful, and show that you have actually created and tested those indexes in our dataset. The result does not necessarily have to match with your initial expectation as long as your explanation is reasonable.

Finally, note that MySQL creates an index for every PRIMARY KEY or UNIQUE constraint in your table. If an attribute (or a set of attributes) is defined as PRIMARY KEY or UNIQUE, you do not need to create an index on the attribute.

Triggers

In this part of the assignment we will create simple triggers. You will want to read or review the following article for this part of the project:

Triggers in MySQL, NOTE: Use an AFTER trigger and not BEFORE trigger for this part

Suppose that you want to keep track of updates that on several tables such as movie, actor etc. For this purpose you are to create a table called updates that will store name of the table that was updated and date when that table was updated. The schema of updates table is as follows:

updates

(

tname:VARCHAR(30), // Name of the table updated

time:DATE, // When was the table updated

)

For this excercise create a simple trigger on table movie and actor, such that when a new movie or actor is inserted, the updates table is updated with corresponding tuples.

Late Submission Policy

To accommodate the emergencies that students may encounter, each team has 3-day grace period for late submission. The grace period can be used for any part of the project in the unit of one day. For example, a student may use 1-day grace period for part 1 and 2-day grace period for part 3. Note that even if a team submits a project 12 hours late, they would need to use a full day grace period to avoid late penalty. If your project is submitted late, we will automatically use the available days in your grace period unless you specifically mention otherwise in the README file.

Electronic Submission of Projects

Please remember that a project can be submitted at most 5 times and the total size of all files being submitted should not exceed 2MB. Steps for submitting your project electronically is as follows:

Log into your account at ugrad.seas.ucla.edu (NOT the MySQL host that you were using. Please use the same login credentials for ugrads.seas.ucla.edu as you were using for lnxsrv03.seas.ucla.edu
Create a directory that will contain the files to be submitted. We refer to this directory as the submission directory.
Copy all submission files (MySQL script files, MySQL output, screen capture images and Java source codes) to the submission directory. In addition, create a README file in your submission directory. The README file should contain you and your partner's name, sid, email and a brief description of each file that you submit.
Run the submit script by issuing "submit" The script will walk you through a few questions. After you answer all the questions, it'll copy your work to TA for grading.
If all goes well, you get the "Submit success!" message. If you get a message that says something else, follow its directions. If you get stuck, send mail to the TA.
If you need to resubmit something, just redo these directions. The script will notice if you are attempting to resubmit and allow you to overwrite your previous entry. Only the very last submission will be graded.

What to submit (due Monday, October 19th, 2pm)

Part A: You will submit the electronic copy of your script file using the submit program along with the other files requested in Part B below by 2pm. From your class account, do the following: “submit cs143 <file1> [<file2> <file3> ...]” Also submit your MySQL script and your output in paper format, in class at 4pm on the same day, illustrating your work for Part A. The script should include your MySQL commands and the output file should include all the output after you executed your script.
Part B: You will need to submit

constraints.txt: A file that lists the English description of the constraints that you identified from the tables. For each constraint, write the relevant portion of CREATE TABLE statement that implements the constraint.
load.script: The MySQL script that you used to generate the tables with the appropriate constraints and load the data into the tables.
violations.txt: The list of SQL modification statements (INSERT/DELETE/UPDATE) that violate each constraint. For each statement, write a short explanation on which constraint it violates and why. Also, after the explanation, "cut-and-paste" the output that you get from MySQL when the statement is executed.

index.txt: A file that contains 1) one "interesting" query that you came up with and 2) the list of indexes that you will create and a short explanation on why they are useful.
index.script: The MySQL script that you used to create the indexes in your database.
trigger.script: The MySQL script that you used to define triggers.
timer.extension: The source code that you generated the running time from. The extension depends on your choice of method. e.g. .sh, .java, .c, etc. (except .out)
timer.out: The output for the running time of each of the query that you ran.
update.out: The content of updates table after some insertions are made on the movie and actor tables.

You will also need to include a README file for your electronic submission, that includes you and your partner's name, sid, email, and any other information you think is useful.