Due Friday, January 30, 2015 11:00PM |
Submission deadline: Programming work is submitted electronically and must be submitted by Friday at 11:00PM. However, we recognize that there might be last minute difficulty during submission process, so as long as you started your sumission process before 11:00PM, you have until 11:55PM to completely upload your submission. After 11:55PM, you will have to use grace period as follows.
Late Policy: Programming work submitted after the deadline but less than 24 hours late (i.e., by Saturday 11:00 PM) will be accepted but penalized 25%, and programming work submitted more than 24 hours but less than 48 hours late (i.e., by Sunday 11:00 PM) will be penalized 50%. No programming work will be accepted more than 48 hours late. Since emergencies do arise, each student is allowed a total of four unpenalized late days (four periods up to 24 hours each) for programming work, but no single assignment may be more than two days late.
Honor Code reminder: For more detailed discussion of the Honor Code as it pertains to CS144, please see the Assigned Work page under Honor Code. In summary: You must indicate on all of your submitted work any assistance (human or otherwise) that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding and being able to explain on your own all material that you submit.
By completing this project you will be able to learn:
$ wget http://www.cs.ucla.edu/classes/winter15/cs144/projects/project2/ebay-data.zip $ unzip -d $EBAY_DATA ebay-data.zip $ rm ebay-data.zip
As a small data set for initial experimentation and debugging we suggest you use just one file: items-0.xml
. It contains 500 auctions, comprising about 900kb of plain-text data.
Your system also must work on the large data set, which consists of all 40 files: items-*.xml
, for *
= 0..39. There are a total of about 20,000 auctions, comprising about 38 megabytes of plain-text data.
items.dtd
.
Your first task is to examine the DTD and the XML files to completely understand the data you will be starting with. You will be translating this data into relations and loading it into MySQL server on our VM. Please read the auction data help file in items.txt
. The provided data was captured at a single point in time. (Very specifically, it represents the point in time December 20th, 2001, one second after midnight.) It contains items that have been auctioned off in the past and items that are "currently" up for auction.
Note that UserIDs of sellers and bidders uniquely identify a user. Whenever one user appears in the data, his/her Rating, Location, Country information are the same. For more, read items.txt in the dataset.
List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.
List all completely nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys.
Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.
Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.
To help you get started, we are providing a parser skeleton sample codes, MyParser.java and MyParserPrint.java, in the src directory of the project2.zip file.
Note that the two Java classes are defined as part of
the edu.ucla.cs.cs144 package, which is why
MyParser.java
and MyParserPrint.java
files are located in
the src/edu/ucla/cs/cs144 directory (src is
the base directory of all Java code and the rest corresponds
to the package name hierarchy). If you are not
clear about what is a package and how a package name is related
to the source directory structure, read Section 1.8
of A Crash Course from C++ to
Java.
The provided two Java files are almost identical, except that MyParserPrint.java has additional code to print out the parsed XML data to screen. In the sample codes, all of the parsing is done for you by invoking the JDK's XML parser (JAXP). You need to fill in code that processes the internal representation of the XML tree and produces MySQL load files according to the relational schema you designed in Part B. Detailed information is provided as comments in the skeleton code file MyParser.java
in the zip file.
The zip file also contains a sample Ant build script, build.xml. Ant is a "build tool" similar to make but it is specifically designed for cross-platform Java development. You must read the following Ant tutorial to learn how to use Ant and write and modify an Ant script file (we will use this tool throuout the rest of the quarter):
Since Ant is preinstalled in our VM, you will be able to compile your Java code simply by typing ant in your command line (inside the directory where build.xml is).Note that the provided Ant script have targets "run" and "run-all". The target "run" executes the parser class edu.ucla.cs.cs144.MyParser on the single data file $EBAY_DATA/items-0.xml and the target "run-all" executes the class on all files at $EBAY_DATA/items-*.xml. You can use one of these two targets to test your code on the provided ebay data. We strongly suggest that you fully debug your program on the small data set before unleashing it on the large one.
Notes on MySQL Datatypes
We note that some of the Description
elements in the XML data are quite long. If any text is longer than 4000 characters, you must truncate it at 4000. In particular, we suggest that you use the MySQL data type VARCHAR(4000)
to store descriptions and during data transformation simply truncate any descriptions that will be too long to fit. For attributes representing dates/times, we suggest that you use MySQL's built-in TIMESTAMP
type. For attributes representing money, we suggest you use DECIMAL(8,2)
to specify a numeric type with 2 digits after the decimal point.
Note that in creating MySQL load files, you may have to reformat the input data to the one that MySQL expects. Information on the format of SQL TIMESTAMP type is available on the MySQL Dates and Times page. You can use the Java class java.text.SimpleDateFormat to parse a date/time string in a particular format and to output it in a different format. See the example code on this page to learn how to use SimpleDateFormat.parse() and SimpleDateFormat.format() methods to parse and format date/time strings in Java.
Duplicate elimination
When transforming the XML data to relational tuples you may discover that you generate certain tuples multiple times but only want to retain one copy. There are different ways to eliminate duplicates, including:
Coding duplicate-elimination as part of your transformation program.
Using Unix text utilities (e.g., sort and uniq) directly on the generated load files to eliminate duplicate lines. These tools are preinstalled in our VM.
Notes on CR/LF issue: If your host OS is Windows, you need to pay particular attention to how each line of a text file ends. By convention, Windows uses a pair of CR (carriage return) and LF (line feed) characters to terminate lines. On the other hand, Unix (including Linux, Mac OS X, and tools in cygwin) use only a LF character. Therefore, problems arise when you are feeding a text file generated from a Windows program to a Unix tool (such as sort, uniq and mysql). Since the end of the line of the input file is different from what the tools expect, you may encounter unexpected behavior from these tools. If you encounter this problem, you may want to run the dos2unix command in VM on your Windows-generated text file. This command converts CR and LF at the end of each line in the input file to just LF. Type dos2unix --help to learn how to use this command.
Now create and populate your table(s) in MySQL inside the CS144 database. We suggest that you first debug the schema creation and loading process interactively using the MySQL command-line interface. When you are satisfied that everything is working, follow the instructions to set up for batch loading (see Section D.3 below), which allows a database to be conveniently recreated from scratch with one command.
CREATE TABLE
" commands for all of the relations in your schema. Some suggestions:
For efficiency we suggest that you specify a PRIMARY KEY
for each table that has at least one key.
Once the tables are created, load your data in MySQL.
Your MySQL databases are probably not backed up, so anything you need long-term should be saved on the file system as well.
As you start modifying data in a database, you will undoubtedly want the ability to get a "fresh start" easily from your original data.
DROP TABLE T;If you want to get rid of all tuples in
T
without deleting the table itself, you can issue the command:
DELETE FROM T;
create.sql
that includes the SQL commands for database and table creation that you debugged interactively (recall Section D.1). This file will look something like:
CREATE TABLE Item ( .... ); CREATE TABLE AuctionUser ( ... ); ...Similarly, create a second command file called
drop.sql
that destroys all your tables. This file will look something like:
DROP TABLE Item; DROP TABLE AuctionUser; ...Lastly, create a command file called
load.sql
that contains the appropriate MySQL commands to load data from a file into a table. The file will look something like:
LOAD DATA LOCAL INFILE 'Item.dat' INTO TABLE Item ... LOAD DATA LOCAL INFILE 'AuctionUser.dat' INTO TABLE AuctionUser ... ...
To automate the entire process, create a bash shell script called runLoad.sh
. If you do not know what a shell script is or how to create one, first read our Tutorial on Shell Script. Your script should do the following things.
Drop any existing relevant tables in CS144 database IF EXISTS.
Create all the relevant tables in CS144 database.
Build and run your parser to generate fresh load files.
Load the data into MySQL.
Delete all temporary files (including your load file) that your parser generated
#!/bin/bash # Run the drop.sql batch file to drop existing tables # Inside the drop.sql, you sould check whether the table exists. Drop them ONLY if they exists. mysql CS144 < drop.sql # Run the create.sql batch file to create the database and tables mysql CS144 < create.sql # Compile and run the parser to generate the appropriate load files ant ant run-all ... # If the Java code does not handle duplicate removal, do this now sort ... ... # Run the load.sql batch file to load the data mysql CS144 < load.sql # Remove all temporary files rm ... ...
In your final submission, please make sure that the ant target "run-all" takes all data at $EBAY_DATA/items-*.xml and produces the output. The value of $EBAY_DATA should not be hard-coded in your ant script. It should be obtained from the environment variable. Our provided build.xml follows this requirement.
To run this script, make sure permissions are set properly by executing "chmod u+x runLoad.sh
", then simply type "./runLoad.sh
" at the shell prompt to run it. Again, pay attention to the CR/LF issue if you modify your scripts using a Windows program. Run dos2unix in VM on the scripts if necessary. Otherwise, you may get unexpected errors when you run your shell script.
Find the number of users in the database.
Find the number of items in "New York", (i.e., items whose location is exactly the string "New York"). Pay special attention to case sensitivity. You should match the items in "New York" but not in "new york".
Find the number of auctions belonging to exactly four categories.
Find the ID(s) of current (unsold) auction(s) with the highest bid. Remember that the data was captured at the point in time December 20th, 2001, one second after midnight, so you can use this time point to decide which auction(s) are current. Pay special attention to the current auctions without any bid.
Find the number of sellers whose rating is higher than 1000.
Find the number of users who are both sellers and bidders.
Find the number of categories that include at least one item with a bid of more than $100.
Correctness check: Your answers to the above seven queries over the large data set should be (in order): 13422, 103, 8365, 1046740686, 3130, 6717, 150
Put all seven queries in a MySQL batch script file called queries.sql
. It should look like:
SELECT ... FROM ... ; SELECT ... FROM ... ; ...Make sure that the order of your queries in the file is exactly the same as the order specified above.
project2.zip | +- build.xml | +- create.sql | +- drop.sql | +- load.sql | +- queries.sql | +- runLoad.sh | +- README.txt | +- src | +- java source codes for parser (with your own naming/structure) | +- lib +- external java libraries usedEach file or directory is as following:
create.sql
, drop.sql
, load.sql
, and queries.sql
. Do not submit the output produced by running these files.
README.txt
, containing your relational schema design, answers to the 3 items listed in Part B, as well as anything else you'd like to document. build.xml
script. Your build.xml script MUST meet the following requirements:
runLoad.sh
script developed in Section D.3 above.
Your runLoad.sh script MUST meet the following requirements:
src
directory: Your parser source code file(s). Do not submit the binary version of your parser.
lib
directory: Any external Java library files (.jar files) that you use and are not available in our VM.
Important Notes:
cs144@class-vm:~$ ./p2_test project2.zip(if your project2.zip file is not located in the current directory, you need to add the path to the zip file before project2.zip. You may need to use "chmod +x p2_test" if there is a permission error.)
You MUST test your submission using the script before your final submission to minimize the chance of an unexpected error during grading. When evertything runs properly, you will see an output similar to the following from the grading script:
Running your runLoad.sh script... ... <output from your runLoad.sh> ... Running your query.sql script... COUNT(*) 13422 COUNT(*) 103 COUNT(*) 8365 ITEMID 1046740686 COUNT(*) 3130 COUNT(*) 6717 COUNT(DISTINCT CATEGORY) 150 Finished running your queries. Please make sure that you got correct results from your queries
Once your submission zip file is properly created and passes the grading script, submit it via our submission page at CCLE. You may resubmit as many times as you like, however only the latest submission will be saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.