时刻PV-MRJob-Python数据分析(4)

HH python时刻PV-MRJob-Python数据分析(4)已关闭评论7,9301字数 1282阅读4分16秒阅读模式

1.1. 前言

这边我们使用Python的M/R框架MRJob来分析.

1.2. M/R步骤

Mapper: 将以行数据解析成 key=hh value=1的形式文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

Shuffle: 通过Shuffle后的结果会生成以 key 的值排序的 value迭代器文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

结果如: 09 [1, 1, 1 ... 1, 1]文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

Reduce: 在这边我们计算出 09 这一小时的访问量文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

输出如: 09 sum([1, 1, 1 ... 1, 1])文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

1.3. 代码

cat mr_pv_hour.py
# -*- coding: utf-8 -*-
 
from mrjob.job import MRJob
from ng_line_parser import NgLineParser
 
class MRPVHour(MRJob):
 
    ng_line_parser = NgLineParser()
 
    def mapper(self, _, line):
        self.ng_line_parser.parse(line)
        dy, tm = str(self.ng_line_parser.access_time).split()
        h, m, s = tm.split(':')
        yield h, 1 # 每小时的
        yield 'total', 1 # 所有的
 
    def reducer(self, key, values):
        yield key, sum(values)
 
def main():
    MRPVHour.run()
 
if __name__ == '__main__':
    main()

运行统计和输出结果文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

python mr_pv_hour.py < www.ttmark.com.access.log
 
No configs found; falling back on auto-configuration
Creating temp directory /tmp/mr_pv_hour.root.20160924.130542.359063
Running step 1 of 1...
reading from STDIN
Streaming final output from /tmp/mr_pv_hour.root.20160924.130542.359063/output...
"00"    31539
"01"    34824
"02"    27895
"03"    29669
"04"    27742
"05"    26797
"06"    29384
"07"    31102
"08"    38257
"09"    43060
"10"    48064
"11"    57923
"12"    56413
"13"    57971
"14"    47260
"15"    46364
"16"    45721
"17"    48884
"18"    49318
"19"    49162
"20"    43641
"21"    42525
"22"    40371
"23"    34953
"total" 988839
Removing temp directory /tmp/mr_pv_hour.root.20160924.130542.359063...

昵称: HH文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

QQ: 275258836文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

ttlsa群交流沟通(QQ群②: 6690706 QQ群③: 168085569 QQ群④: 415230207(新) 微信公众号: ttlsacom)文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

感觉本文内容不错,读后有收获?文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

逛逛衣服店,鼓励作者写出更好文章。文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/ 文章源自运维生存时间-https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/

weinxin
我的微信
微信公众号
扫一扫关注运维生存时间公众号,获取最新技术文章~
HH
  • 本文由 发表于 29/10/2016 00:53:01
  • 转载请务必保留本文链接:https://www.ttlsa.com/python/python-big-data-analysis-point-time-pv-mrjob/
  • mrjob
  • pandas
  • python
  • 数据分析
  • 时刻PV