Hadoop MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
This tutorial is a step by step demo on how to run a Hadoop MapReduce job on a Hadoop cluster in AWS.
If not done already, we first need to connect to the main node of our cluster.
For this demo, we will use python and mrjob, a python package that helps you write and run Hadoop jobs. First let's install pip on our node. In the Terminal window, type in the following command:
sudo apt install python-pip
Once pip is installed, we can install the mrjob package using:
pip install mrjob
If not done already, we can download our dataset using the following command:
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/2_hdfs_mapreduce/ml-100k/u.data
We also need to download our python script:
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/2_hdfs_mapreduce/RatingsUsersMR.py
Before running themrjobscript on the cluster, we can first run it locally. This is good practice to first run your script locally on a smaller dataset for testing and debugging purposes before running it on the cluster .
To execute your mrjob script locally, the command will have the following form:
python <mrjob_script><dataset>
Which in this case will be:
python RatingsUsersMR.py u.data
If the execution has been successful, we can run the mrjob script on the cluster using the following command:
python RatingsUsersMR.py -r hadoop --hadoop-streaming-jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar u.data --output-dir output
We specified the output directory with the parameter: --output-dir output
We also need to manually specify where to find the hadoop-streaming.jar file with the parameter: --hadoop-streaming-jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
You should now see progress in the terminal. Note that once submitted, the program is running in the cluster as an asynchronous distributed process.
Once the job is completed, we can see a summary and some metrics about the job that has been executed.
The output of our job is also showing in the Terminal.
We can also check the output of the job in our output directory in HDFS:
hadoop fs -ls output
Typically, a MapReduce job will write out data to a target directory in HDFS. In such case each Reduce task will write out its own output file, seen in the target HDFS directory as part-r-nnnnn, where nnnnn is the identifier for the Reducer. For instance, the first nominated Reducer would be 00000; therefore, the file written from this Reduce task would be named part-r-00000.
To inspect the content of each file, we can use:
hadoop fs -tail output/part-00000
hadoop fs -tail output/part-00001
We can also login to HUE and view the files in the File Browser. In a web browser, type in the URL:
http://server_host:8889
Which can be for example:
http://ec2-52-36-64-79.us-west-2.compute.amazonaws.com:8889
And enter login and password.
Click on "Files" in the "Browser" section of the main menu (upper left corner).
We can also view the job summary in a more user-friendly way than through the Terminal. On the upper right corner of the HUE window, click on "Jobs". We will see the list of jobs that ran for the user we are log in with (by default).
Click on the latest job of the list, and we should see the details about that specific job. We can see for example that this mapreduce job had 2 mappers and 2 reducers: