This link has been bookmarked by 108 people . It was first bookmarked on 09 Oct 2007, by someone privately.
-
27 Sep 11
Jochen Frommhow to write a simple MapReduce program for Hadoop in Python
python hadoop mapreduce programming distributed parallel cluster
-
18 Nov 10
-
24 Oct 10
-
22 Oct 10
-
15 Oct 10
-
23 Sep 10
Alfred ReichDescribes how to write a simple MapReduce program for Hadoop in the Python programming language.
-
27 Aug 10
-
15 Jul 10
-
The "trick" behind the following Python code is that we will use HadoopStreaming (see also the wiki entry) for helping us
-
print '%s\t%s' % (word, 1)
-
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
-
-
30 Jun 10
-
29 May 10
-
26 May 10
-
17 May 10
-
24 Apr 10
-
16 Apr 10
-
22 Feb 10
-
14 Dec 09
-
02 Dec 09
-
30 Nov 09
-
27 Nov 09
-
06 Oct 09
-
02 Oct 09
-
25 Sep 09
-
19 Sep 09
-
20 Aug 09
-
31 Jul 09
-
22 Jul 09
pierre arlaisdistributed computing on large data sets
-
08 Jul 09
Nico CoetzeeEven though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++
apache java hadoop hive python programming cluster distributed parallel tutorial
-
24 Jun 09
-
Prerequisites
-
-
30 May 09
-
24 Apr 09
-
17 Apr 09
-
03 Apr 09
-
31 Mar 09
-
30 Jan 09
-
18 Jan 09
-
15 Jan 09
-
10 Dec 08
-
30 Nov 08
-
21 Nov 08
-
07 Oct 08
Andrew PerryThe "trick" behind the following Python code is that we will use HadoopStreaming (see also the wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python's sys
-
28 Sep 08
-
23 Sep 08
-
04 Sep 08
s_bergmannThis tutorial describes how to write a simple MapReduce program for Hadoop in Python without using Jython.
-
03 Sep 08
-
29 Aug 08
-
11 Jul 08
-
10 Jul 08
-
06 Jul 08
-
05 Jul 08
-
06 Jun 08
Scott MoodyWriting An Hadoop MapReduce Program In Python
In this tutorial, I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.
Contents [hide]
1 Motivation
2 What we want to do
3 Prerequisites
4 Python MapReduce Code
4.1 Map: mapper.py
4.2 Reduce: reducer.py
4.3 Test your code (cat data | map | reduce)
5 Running the Python Code on Hadoop
5.1 Download example input data
5.2 Copy local example data to HDFS
5.3 Run the MapReduce job
6 Improved Mapper and Reducer code: using Python iterators and generators
6.1 mapper.py
6.2 reducer.py
7 Feedback
8 Related Links
Motivation
Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop - just have a look at the example in <HADOOP_INSTALL>/src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C++ MapReduce API called Pipes, it's really interesting.
Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with. -
02 Jun 08
-
31 May 08
Nitin HayaranIn this tutorial, I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.
-
30 May 08
-
15 May 08
-
01 May 08
Ben GodfreyA walkthrough of getting into Hadoop MapReduce with Python.
-
17 Apr 08
-
16 Apr 08
-
21 Mar 08
-
01 Mar 08
-
19 Feb 08
-
05 Feb 08
-
15 Dec 07
voidfileshadoop looks like a key in an infrastucture
python hadoop mapreduce programming distributed cluster parallel
-
12 Dec 07
-
17 Nov 07
-
01 Nov 07
-
14 Oct 07
-
13 Oct 07
-
12 Oct 07
-
10 Oct 07
-
09 Oct 07
-
25 Sep 07
-
24 Sep 07
Would you like to comment?
Join Diigo for a free account, or sign in if you are already a member.