i-Hive-Practice-1 影评练习

现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

题目要求:

数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行

Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确

(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
思路:
1. 分组
select movieid, count(movieid) rateCount from ratings
group by movieid limit 10;
2. 排序
select movieid, count(movieid) rateCount from ratings
group by movieid order by rateCount desc limit 10;
3.join
select a.movieid mvid, b.Title mvtitle, a.rateCount mvratecount from
(select movieid, count(movieid) rateCount from ratings
group by movieid order by rateCount desc limit 10) a
join movies b where a.movieid=b.movieid;
==============================

 完整的
------------
select a.movieid mvid, b.Title mvtitle, a.rateCount mvratecount from
(select movieid, count(movieid) rateCount from ratings
group by movieid order by rateCount desc limit 10) a
join movies b where a.movieid=b.movieid;

--结果---
+-------+----------------------------------------------------+--------------+
| mvid | mvtitle | mvratecount |
+-------+----------------------------------------------------+--------------+
| 2858 | American Beauty (1999) | 3428 |
| 260 | Star Wars: Episode IV - A New Hope (1977) | 2991 |
| 1196 | Star Wars: Episode V - The Empire Strikes Back (1980) | 2990 |
| 1210 | Star Wars: Episode VI - Return of the Jedi (1983) | 2883 |
| 480 | Jurassic Park (1993) | 2672 |
| 2028 | Saving Private Ryan (1998) | 2653 |
| 589 | Terminator 2: Judgment Day (1991) | 2649 |
| 2571 | Matrix, The (1999) | 2590 |
| 1270 | Back to the Future (1985) | 2583 |
| 593 | Silence of the Lambs, The (1991) | 2578 |
+-------+----------------------------------------------------+--------------+

(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 注意,这里的 avg(r.rating) 是因为group by了 gender,title, 也就是
# 说有 gender,title 分组, 此时可能对应此组的有多个值,
# 这里就是 相同性别评价相同电影的, 评分有多个, 此时就要使用聚合函数,得出唯一值
# 因此就使用了 avg(r.rating)
select u.gender, m.title, avg(r.rating) rr
from ratings r
join users u on r.userid=u.userid
join movies m on r.movieid=m.movieid
where u.gender = 'M'
group by gender,title
order by rr desc
limit 10;

+-----------+--------------------------------------------+------+
| u.gender | m.title | rr |
+-----------+--------------------------------------------+------+
| M | Schlafes Bruder (Brother of Sleep) (1995) | 5.0 |
| M | Small Wonders (1996) | 5.0 |
| M | Lured (1947) | 5.0 |
| M | Bells, The (1926) | 5.0 |
| M | Dangerous Game (1993) | 5.0 |
| M | Baby, The (1973) | 5.0 |
| M | Gate of Heavenly Peace, The (1995) | 5.0 |
| M | Follow the Bitch (1998) | 5.0 |
| M | Ulysses (Ulisse) (1954) | 5.0 |
| M | Angela (1995) | 5.0 |
+-----------+--------------------------------------------+------+

(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)

1
2
3
4
5
# 注意: 没有歧义的字段, 可以不用指明是谁的属性
select age, avg(r.rating) avgrating from ratings r
join users u on r.userid=u.userid
where movieid=2116
group by age;

(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# 1.先拿到 影评次数最多的女性(id)
0: jdbc:hive2://cs2:10000> select UserID, count(UserID) count from ratings
. . . . . . . . . . . . .> group by UserID
. . . . . . . . . . . . .> order by count desc
. . . . . . . . . . . . .> limit 1;
+---------+--------+
| userid | count |
+---------+--------+
| 4169 | 2314 |
+---------+--------+

# 2.拿到此人评分最高的10部电影, 此人 userid, 电影 id, 此人评分
## 一定要去重

select distinct(MovieID) MovieID, Rating from ratings where UserID=4169
order by Rating desc
limit 10;

+----------+---------+
| movieid | rating |
+----------+---------+
| 78 | 5.0 |
| 73 | 5.0 |
| 72 | 5.0 |
| 58 | 5.0 |
| 55 | 5.0 |
| 50 | 5.0 |
| 41 | 5.0 |
| 36 | 5.0 |
| 25 | 5.0 |
| 17 | 5.0 |
+----------+---------+

3.这10部电影的(观影者,电影名,平均影评分)
观影者还没写

select t.MovieID, m.Title, avg(r.rating) avgrating from topten t
join ratings r on t.MovieID=r.MovieID
join movies m on t.MovieID=m.MovieID
group by t.MovieID,m.Title;

+------------+-----------------------------------------------+---------------------+
| t.movieid | m.title | avgrating |
+------------+-----------------------------------------------+---------------------+
| 3849 | Spiral Staircase, The (1946) | 4.046511627906977 |
| 3870 | Our Town (1940) | 3.857142857142857 |
| 3871 | Shane (1953) | 3.839344262295082 |
| 3893 | Nurse Betty (2000) | 3.5026833631484795 |
| 3897 | Almost Famous (2000) | 4.22635814889336 |
| 3910 | Dancer in the Dark (2000) | 3.82 |
| 3927 | Fantastic Voyage (1966) | 3.5804597701149423 |
| 3928 | Abbott and Costello Meet Frankenstein (1948) | 3.441747572815534 |
| 3929 | Bank Dick, The (1940) | 3.993197278911565 |
| 3932 | Invisible Man, The (1933) | 3.75 |
+------------+-----------------------------------------------+---------------------+

(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(7)求1997年上映的电影中,评分最高的10部Comedy类电影
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
(9)各年评分最高的电影类型(年份,类型,影评分)
(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

如果帮到你, 可以给我赞助杯咖啡☕️
0%