Big Data Analytics and Management Implement several map reduce design patterns to derive some statistics from IMDB movie data using Hadoop framework. Coursework CS 6301
Licence and Readme included inside the folder ratings.dat UserID::MovieID::Rating::Timestamp users.dat UserID::Gender::Age::Occupation::Zip-code movies.dat MovieID::Title::Genres Map Reduce Design
Given a input zipcode, find all the user-ids that belongs to that zipcode. You must take the input zipcode in command line. Find top 10 average rated movies with descending order of rating. Find all the user ids who has rated at least n movies. Given some movie titles in csv format - find all the genres of the movies. Find the top 10 zipcodes based on the average age of users belong to that zipcode, in the ascending order of the average age. Map Reduce Joins
Given a movieID as input, Find the number of male users who has rated that movie using map side join. Find top 10 average rated movie names with descending order of rating using reduce side join. Find the top 10 users (userID, age, gender) who has rated most number of movies in descending order of the counts. Pig (Assume dataset available on hdfs)
List the unique userid of female users whose age between 20-35 and who has rated the highest rated Action AND War movies. (You should consider all movies that has Action AND War both in its genre list) Print only users whose zip starts with 1. Implement cogroup command on UserID for the datasets ratings_new and users_new. Print first 11 rows. Repeat above (implement join) with cogroup commands. Print first 11 rows. Using Pig Latin script, use the FORMAT_GENRE function on movies_new dataset and print the movie name with its genre(s). Write a UDF (User Define Function) FORMAT_GENRE in Pig which basically formats the genre in movies_new in the following
Before formatting: Children's After formatting: Children's Before formatting: Animation|Children's After formatting: Children's & Animation Before formatting: Children's|Adventure|Animation After formatting: Children's, Adventure & Animation
Using Hive script, find top 11 average rated "Action" movies with descending order of rating. (Show the create table command, load from local, and the Hive query). Using Hive script, List all the movies with its genre where the movie genre is Action or Drama and the average movie rating is in between 4.4 - 4.9 and only the male users rate the movie. (Show the create table command, load from local, and the Hive query). Dataset three files (2009, 2010, 2011). Using Hive script, create one table partitioned by year. (Show the create table one command, load from local three commands, and one Hive query that selects all columns from the table for the virtual column year of 2009). Create three tables that have three columns each (MovieID, MovieName, Genre). Each table will represent a year. The three years are 2009, 2010 and 2011. Using Hive multi-table insert, insert values from the table you created in Q4 to these three tables (each table should have names of movies e.g. movies_2009 etc. for the specified year). Using Hive script, use the FORMAT_GENRE function on movies_new dataset and print the movie name with its genre(s). Write a UDF(User Define Function) FORMAT_GENRE in Hive which basically formats the genre in movies_new in the following
Before formatting: Children's After formatting: Children's - Before formatting: Animation|Children's After formatting: Children's, & Animation - Before formatting: Children's|Adventure|Animation After formatting: Children's, Adventure, & Animation -
Using Cassandra CLI, write commands to do the following. Create a COLUMN FAMILY for this dataset. Insert the following to the column family created in step 1. Use MovieID as the key. "70#From Dusk Till Dawn (1996)#Action|Comedy|Crime|Horror|Thriller" "83#Once Upon a Time When We Were Colored (1995)#Drama" "112#Escape from New York (1981)#Action|Adventure|Sci-Fi|Thriller" with time to live (ttl) clause after 300 seconds Show the following: Get the movie name and genre for the movie id 70 ? Retrieve all rows and columns. Delete column Genres for the movie id 83. Drop the column family. Use describe keyspace command with your netid and show content.
Using Cassandra CQL3, write commands to do the following. Create a table for this dataset. Use (MovieID) as the Primary Key. Load all records in the dataset to this table. Insert record “1162#New Comedy Movie#Comedy" to the table. Select the tuple which has movie id 1150 Delete all rows in the table. Drop the table. Part III
Run nodetool command and determine how much unbalanced the cluster is.