i-Hive-Practice-1 影评练习

现有如此三份数据：
1、users.dat 数据格式为： 2::M::56::16::70072
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat 数据格式为： 1::1193::5::978300760
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳

题目要求：

数据要求：
（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析’:’, 不支持解析’::’，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）
（2）使用Hive能解析的方式进行

Hive要求：
（1）正确建表，导入数据（三张表，三份数据），并验证是否正确

（2）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

思路:
1. 分组
select movieid, count(movieid) rateCount from ratings 
group by movieid limit 10;
2. 排序
select movieid, count(movieid) rateCount from ratings 
group by movieid order by rateCount desc limit 10;
3.join
select a.movieid mvid, b.Title mvtitle, a.rateCount mvratecount from 
(select movieid, count(movieid) rateCount from ratings 
group by movieid order by rateCount desc limit 10) a
join movies b where a.movieid=b.movieid;
==============================

 完整的
------------
select a.movieid mvid, b.Title mvtitle, a.rateCount mvratecount from 
(select movieid, count(movieid) rateCount from ratings 
group by movieid order by rateCount desc limit 10) a
join movies b where a.movieid=b.movieid;

--结果---
+-------+----------------------------------------------------+--------------+
| mvid  |                      mvtitle                       | mvratecount  |
+-------+----------------------------------------------------+--------------+
| 2858  | American Beauty (1999)                             | 3428         |
| 260   | Star Wars: Episode IV - A New Hope (1977)          | 2991         |
| 1196  | Star Wars: Episode V - The Empire Strikes Back (1980) | 2990         |
| 1210  | Star Wars: Episode VI - Return of the Jedi (1983)  | 2883         |
| 480   | Jurassic Park (1993)                               | 2672         |
| 2028  | Saving Private Ryan (1998)                         | 2653         |
| 589   | Terminator 2: Judgment Day (1991)                  | 2649         |
| 2571  | Matrix, The (1999)                                 | 2590         |
| 1270  | Back to the Future (1985)                          | 2583         |
| 593   | Silence of the Lambs, The (1991)                   | 2578         |
+-------+----------------------------------------------------+--------------+

（3）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

# 注意,这里的 avg(r.rating) 是因为group by了 gender,title, 也就是
# 说有 gender,title 分组, 此时可能对应此组的有多个值,
# 这里就是 相同性别评价相同电影的, 评分有多个, 此时就要使用聚合函数,得出唯一值
# 因此就使用了 avg(r.rating)
select u.gender, m.title, avg(r.rating) rr
from ratings r 
join users u on r.userid=u.userid 
join movies m on r.movieid=m.movieid
where u.gender = 'M'
group by gender,title
order by rr desc
limit 10;

+-----------+--------------------------------------------+------+
| u.gender  |                  m.title                   |  rr  |
+-----------+--------------------------------------------+------+
| M         | Schlafes Bruder (Brother of Sleep) (1995)  | 5.0  |
| M         | Small Wonders (1996)                       | 5.0  |
| M         | Lured (1947)                               | 5.0  |
| M         | Bells, The (1926)                          | 5.0  |
| M         | Dangerous Game (1993)                      | 5.0  |
| M         | Baby, The (1973)                           | 5.0  |
| M         | Gate of Heavenly Peace, The (1995)         | 5.0  |
| M         | Follow the Bitch (1998)                    | 5.0  |
| M         | Ulysses (Ulisse) (1954)                    | 5.0  |
| M         | Angela (1995)                              | 5.0  |
+-----------+--------------------------------------------+------+

（4）求movieid = 2116这部电影各年龄段（因为年龄就只有7个，就按这个7个分就好了）的平均影评（年龄段，影评分）

# 注意: 没有歧义的字段, 可以不用指明是谁的属性
select age, avg(r.rating) avgrating from ratings r 
join users u on r.userid=u.userid
where movieid=2116
group by age;

（5）求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的平均影评分（观影者，电影名，影评分）

# 1.先拿到 影评次数最多的女性(id)
0: jdbc:hive2://cs2:10000> select UserID, count(UserID) count from ratings
. . . . . . . . . . . . .> group by UserID
. . . . . . . . . . . . .> order by count desc
. . . . . . . . . . . . .> limit 1;
+---------+--------+
| userid  | count  |
+---------+--------+
| 4169    | 2314   |
+---------+--------+

# 2.拿到此人评分最高的10部电影, 此人 userid, 电影 id, 此人评分
##  一定要去重

select distinct(MovieID) MovieID, Rating from ratings where UserID=4169 
order by Rating desc
limit 10;

+----------+---------+
| movieid  | rating  |
+----------+---------+
| 78       | 5.0     |
| 73       | 5.0     |
| 72       | 5.0     |
| 58       | 5.0     |
| 55       | 5.0     |
| 50       | 5.0     |
| 41       | 5.0     |
| 36       | 5.0     |
| 25       | 5.0     |
| 17       | 5.0     |
+----------+---------+

3.这10部电影的(观影者，电影名，平均影评分)
观影者还没写

select t.MovieID, m.Title, avg(r.rating) avgrating from topten t 
join ratings r on t.MovieID=r.MovieID
join movies m on t.MovieID=m.MovieID
group by t.MovieID,m.Title;

+------------+-----------------------------------------------+---------------------+
| t.movieid  |                    m.title                    |      avgrating      |
+------------+-----------------------------------------------+---------------------+
| 3849       | Spiral Staircase, The (1946)                  | 4.046511627906977   |
| 3870       | Our Town (1940)                               | 3.857142857142857   |
| 3871       | Shane (1953)                                  | 3.839344262295082   |
| 3893       | Nurse Betty (2000)                            | 3.5026833631484795  |
| 3897       | Almost Famous (2000)                          | 4.22635814889336    |
| 3910       | Dancer in the Dark (2000)                     | 3.82                |
| 3927       | Fantastic Voyage (1966)                       | 3.5804597701149423  |
| 3928       | Abbott and Costello Meet Frankenstein (1948)  | 3.441747572815534   |
| 3929       | Bank Dick, The (1940)                         | 3.993197278911565   |
| 3932       | Invisible Man, The (1933)                     | 3.75                |
+------------+-----------------------------------------------+---------------------+

（6）求好片（评分>=4.0）最多的那个年份的最好看的10部电影
（7）求1997年上映的电影中，评分最高的10部Comedy类电影
（8）该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）
（9）各年评分最高的电影类型（年份，类型，影评分）
（10）每个地区最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）