無逸: Python 하둡 스트리밍 (Hadoop Streaming) #1

2011년 4월 18일 월요일

# 참조
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

# 실행

./bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \

-file .../mapper.py -mapper .../mapper.py \

-file .../reducer.py -reducer .../reducer.py \

-input input_data \

-output output_data

# 기본 예제 : word count

1. mapper.py

#!/usr/bin/env python

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words:

print '%s\t%s' % (word,1)

2. reducer.py

#!/usr/bin/env python

import sys

from operator import itemgetter

word2count = {}

for line in sys.stdin:

line = line.strip()

word, count = line.split('\t', 1)

try:

count = int(count)

word2count[word] = word2count.get(word, 0) + count

except:

pass

# sort

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

for word, count in sorted_word2count:

print '%s\t%s' % (word, count)

#
기본 예제를 조금만 수정하면,
간단한 처리는 대부분 가능하다.

#
참조 링크에 yield 와 groupby를 이용하는 코드가 나오는데 python 스럽기도 하고,
실제도 돌려보니 속도도 조금 빠르다.

無逸