MapReduce-分组浅探

前言

最近学分组的时候,一直弄不明白,总感觉一头雾水,通过近3个小时的断点调试和log输出。 大概了解了一些表象,先记录。

案例是这个

求出每门课程参考学生成绩最高平均分的学生的信息:

课程,姓名和平均分,详细见MapReduce笔记-练习第二题第3小题

数据格式是这样的:

第一个是课程名称,总共四个课程,computer,math,english,algorithm,

第二个是学生姓名,后面是每次考试的分数

math,huangxiaoming,85,75,85,99,66,88,75,91

english,huanglei,85,75,85,99,66,88,75,91



结论

执行流程结论

  • map每读一行就 write 到 context 一次,按照指定的key进行分发
  • map 把所有的数据都读完了之后,大概执行到67%的时候,开始进入 CustomBean,执行CustomBeancompareTo()方法,会按照自己写的规则一条一条数据比较

  • 上述都比较完毕之后,map阶段就结束了,此时来到了 reduce阶段,但是是到了67%

  • 到了reduce阶段,直接进入了MyGroup中自定义的compare方法。
  • MyGroup的compare()方法,如果返回非0, 就会进入 reduce 方法写出到context

MyGroup进入Reduce的条件是

  • MyReduce中,如果compare的结果不等于0,也就是比较的2者不相同, 此时就进入Reduce, 写出到上下文
  • 如果相同,会一直往下读,直到读到不同的, 此时写出读到上下文。
  • 因为MyGroup会在Reduce阶段执行,而CustomBean中的compareTo()是在map阶段执行,所以需要在CustomBean中就把组排好序,此时分组功能才能正常运作

指定分组类MyGroup和不指定的区别

指定与不指定是指:在Driver类中,是否加上job.setGroupingComparatorClass(MyGrouper.class);这一句。

  • 指定分组类
    • 会按照分组类中,自定义的compare()方法比较,相同的为一组,分完一组就进入一次reduce方法
  • 不指定分组类:(目前存疑)
    • 是否是按照key进行分组
    • 如果是自定义类为key,是否是按照此key中值相同的分为一组
    • 如果是hadoop内置类,是否是按照此类的值分组(Text-String的值,IntWritable-int值等..)
    • 依然是走得以上这套分组逻辑,一组的数据读完才进入到Reduce阶段做归并



Log信息

CustomBean中没有进行分组, 组内排序的log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// ==================MyGroup中compare()方法=======================
computer---MyGroup中比较---computer
computer---MyGroup中比较---math
math---MyGroup中比较---english
english---MyGroup中比较---math
math---MyGroup中比较---algorithm
algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---computer
computer---MyGroup中比较---english
english---MyGroup中比较---algorithm
algorithm---MyGroup中比较---math
math---MyGroup中比较---algorithm
algorithm---MyGroup中比较---math
math---MyGroup中比较---computer
computer---MyGroup中比较---english
english---MyGroup中比较---math
math---MyGroup中比较---computer
computer---MyGroup中比较---math
math---MyGroup中比较---english
english---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---english
english---MyGroup中比较---computer
computer---MyGroup中比较---algorithm
algorithm---MyGroup中比较---english
english---MyGroup中比较---computer
computer---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---math
math---MyGroup中比较---algorithm
algorithm---MyGroup中比较---computer
computer---MyGroup中比较---english



// ======================reduce中的执行log==========================

==================第1次进入reduce
computer huangjiaju 83.2
---------in for write------
computer liutao 83.0
---------in for write------
==================第2次进入reduce
math huangxiaoming 83.0
---------in for write------
==================第3次进入reduce
english huanglei 83.0
---------in for write------
==================第4次进入reduce
math huangjiaju 82.28571428571429
---------in for write------
==================第5次进入reduce
algorithm huangjiaju 82.28571428571429
---------in for write------
algorithm liutao 82.0
---------in for write------
==================第6次进入reduce
computer huanglei 74.42857142857143
---------in for write------
==================第7次进入reduce
english liuyifei 74.42857142857143
---------in for write------
==================第8次进入reduce
algorithm huanglei 74.42857142857143
---------in for write------
==================第9次进入reduce
math huanglei 74.42857142857143
---------in for write------
==================第10次进入reduce
algorithm huangzitao 72.75
---------in for write------
==================第11次进入reduce
math liujialing 72.75
---------in for write------
==================第12次进入reduce
computer huangzitao 72.42857142857143
---------in for write------
==================第13次进入reduce
english huangxiaoming 72.42857142857143
---------in for write------
==================第14次进入reduce
math wangbaoqiang 72.42857142857143
---------in for write------
==================第15次进入reduce
computer huangxiaoming 72.42857142857143
---------in for write------
==================第16次进入reduce
math xuzheng 69.28571428571429
---------in for write------
==================第17次进入reduce
english zhaobenshan 69.28571428571429
---------in for write------
==================第18次进入reduce
computer huangbo 65.25
---------in for write------
computer xuzheng 65.0
---------in for write------
==================第19次进入reduce
english zhouqi 64.18181818181819
---------in for write------
==================第20次进入reduce
computer liujialing 64.11111111111111
---------in for write------
==================第21次进入reduce
algorithm liuyifei 62.142857142857146
---------in for write------
==================第22次进入reduce
english liujialing 62.142857142857146
---------in for write------
==================第23次进入reduce
computer liuyifei 62.142857142857146
---------in for write------
==================第24次进入reduce
english liuyifei 59.57142857142857
---------in for write------
english huangdatou 56.0
---------in for write------
==================第25次进入reduce
math liutao 56.0
---------in for write------
==================第26次进入reduce
algorithm huangdatou 56.0
---------in for write------
==================第27次进入reduce
computer huangdatou 56.0
---------in for write------
==================第28次进入reduce
english huangbo 55.0
---------in for write------
CustomBean中做了分组&组内排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
// **Bean中相同的组进行分数降序, 组进行字典排序,此时MyGroup中执行的**

algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---algorithm
algorithm---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---computer
computer---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---english
english---MyGroup中比较---math
math---MyGroup中比较---math
math---MyGroup中比较---math
math---MyGroup中比较---math
math---MyGroup中比较---math
math---MyGroup中比较---math
math---MyGroup中比较---math

// ======================reduce中执行==========================

==================第1次进入reduce
algorithm huangjiaju 82.28571428571429
---------in for write------
algorithm liutao 82.0
---------in for write------
algorithm huanglei 74.42857142857143
---------in for write------
algorithm huangzitao 72.75
---------in for write------
algorithm liuyifei 62.142857142857146
---------in for write------
algorithm huangdatou 56.0
---------in for write------
===================离开reduce
==================第2次进入reduce
computer huangjiaju 83.2
---------in for write------
computer liutao 83.0
---------in for write------
computer huanglei 74.42857142857143
---------in for write------
computer huangzitao 72.42857142857143
---------in for write------
computer huangxiaoming 72.42857142857143
---------in for write------
computer huangbo 65.25
---------in for write------
computer xuzheng 65.0
---------in for write------
computer liujialing 64.11111111111111
---------in for write------
computer liuyifei 62.142857142857146
---------in for write------
computer huangdatou 56.0
---------in for write------
===================离开reduce
==================第3次进入reduce
english huanglei 83.0
---------in for write------
english liuyifei 74.42857142857143
---------in for write------
english huangxiaoming 72.42857142857143
---------in for write------
english zhaobenshan 69.28571428571429
---------in for write------
english zhouqi 64.18181818181819
---------in for write------
english liujialing 62.142857142857146
---------in for write------
english liuyifei 59.57142857142857
---------in for write------
english huangdatou 56.0
---------in for write------
english huangbo 55.0
---------in for write------
===================离开reduce
==================第4次进入reduce
math huangxiaoming 83.0
---------in for write------
math huangjiaju 82.28571428571429
---------in for write------
math huanglei 74.42857142857143
---------in for write------
math liujialing 72.75
---------in for write------
math wangbaoqiang 72.42857142857143
---------in for write------
math xuzheng 69.28571428571429
---------in for write------
math liutao 56.0
---------in for write------
===================离开reduce


//  如果只取一个每次values的第一个的话 

algorithm huangjiaju 82.28571428571429
==================第1次进入reduce
computer huangjiaju 83.2
==================第2次进入reduce
english huanglei 83.0
==================第3次进入reduce
math huangxiaoming 83.0
==================第4次进入reduce



其它疑点

  • 通过log可以看出来, MyGroup是在读到了不同的数据,才会进入到 reduce 类中写出;
  • 但是 通过 断点调试时, 现象是,第一次读到了2个相同的,就去reduce去写出了;
  • 后面再读的就不做写出操作,直到下一次读到2个相同的,再在reduce中写出。

如果帮到你, 可以给我赞助杯咖啡☕️
0%