BLOG > Hadoop > Setting up a Hadoop Cluster on Mac OS X Mountain

Setting up a Hadoop Cluster on Mac OS X Mountain


Posted on March 07, 2014 10:02 am | by Krissada Dechokul
Setting up a Hadoop Cluster on Mac OS X Mountain

I have been working on using the Hadoop platform as an infrastructure for distributed testing of iOS applications. So I had to set up a Hadoop cluster for my experiment on Mac OS X. Mac OS X has its root from UNIX, so in theory, we should be able to set up Hadoop under the Mac OS X environment,... right? This post provides steps that I took to set such environment up.


Prerequisites

Java

Java is required to be installed to be able to run Hadoop on each node in the cluster. A working installation of Java SE 6 version 1.60_65 for OS X Mountain Lion was used in this experiment (http://support.apple.com/kb/dl1572).

SSH

Hadoop relies on SSH to communicate between nodes in the cluster and to perform cluster-wide operations. In order to work seamlessly, SSH should be configured to allow keyless/passwordless login for users from machines in the Hadoop cluster.

On Mac OS X, we first need to enable Remote Logins by

  • Go to System Preferences
  • Go to Sharing
  • Check at Remote Logins option
  • Also noted the Computer Name here that it will be used as a HOST_NAME during the set up

Then we need to set up a RSA public/private key pair to be able to ssh into the node. On each node (both master and slave), type the following commands to generate RSA key pair of the node.

ssh-keygen -t rsa -P "" –f $HOME/.ssh/id_rsa

The private key is stored in the file specified by the –f option, in this case is $HOME/.ssh/id_rsa, and the public key is stored in the file with the same name but with a .pub extension appended, in this case will be $HOME/.ssh/id_rsa.pub.

Then, we need to make sure that the public key is authorized by copying the public key into $HOME/.ssh/authorized_keys using the following command.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

We need to make sure that host names in the cluster are being configured correctly in the host file at /etc/host so that each node can be communicated by its name rather than its IP-address. We then ssh to localhost machine and the actual host names to make sure that ssh is working correctly. Both the master node and the slave node must be able to ssh to each other. This step will also add the hosts’ fingerprint into the known host file. Note that the HOST_NAME of a Mac OS X machine can be found in Sharing option in System Preferences.

ssh localhost

ssh USERNAME@HOST_NAME

Finally, we need to distribute the public key of the master node to all slave nodes in the cluster by using the following command to append the public key to a remote host.

cat $HOME/.ssh/id_rsa.pub | ssh USERNAME@HOST_NAME 'cat >> $HOME/.ssh/authorized_keys'

Hadoop Installation

At the time of writing, the Hadoop version 1.2.1 is the current stable version and was used in the experiment. The current release of Hadoop can be downloaded from the Apache Hadoop Releases website (http://hadoop.apache.org/releases.html), then the downloaded package can be unpacked under the location of your choice. The quick installation of Hadoop can be found in Hadoop Wiki (http://wiki.apache.org/hadoop/QuickStart).


However, the easiest way for Mac OS X to install Hadoop is through the Homebrew tool, a package management manager for OS X (http://brew.sh). The tool can be obtained on the fly using the following command through the command-line terminal.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

***Correction 8/11/2014*** sorry guys, the url for getting brew has been moved from

https://raw.github.com/Homebrew/homebrew/go/install

to https://raw.githubusercontent.com/Homebrew/install/master/install

After Homebrew has been installed on the machine, Hadoop can be installed with Homebrew simply by using the following command.

brew install hadoop

Or using the following command to check the available versions first in order to install a specific version of Hadoop.

brew search hadoop

brew install homebrew/versions/hadoop121

Homebrew will install Hadoop in /usr/local/Cellar/hadoop/

and will also set $JAVA_HOME to /usr/libexec/java_home.

Hadoop Configuration

There are 6 configuration files that need to be customized and they are located in /usr/local/Cellar/hadoop/1.2.1/libexec/conf

  • hadoop-env.sh
  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • masters
  • slaves


hadoop-env.sh

This file sets environment variables that are used in the scripts to run Hadoop. Homebrew has already has done all the works for us during that simple installation, but there is an issue appeared in Mac OS X Lion and Mountain Lion that requires some configuration in this file to resolve the issue (https://issues.apache.org/jira/browse/HADOOP-7489). We need to add the following line into the file.

export HADOOP_OPTS="-Djava.security.krb5.realm=-Djava.security.krb5.kdc="


core-site.xml

This file set the configuration setting for Hadoop Core such as I/O settings of the nodes. This file is needed to be configured on every node in the cluster.


 
  fs.default.name
  hdfs://[MASTER_HOST_NAME]:9000
 
  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
 

The fs.default.name must point to the master node only with the correct port that the master node is listening to.
The hadoop.tmp.dir is the directory got Hadoop to write working temporary files into.

hdfs-site.xml

This file controls the configuration for Hadoop Distributed File System process, the name-node, the secondary name-node, and the data-nodes.


 
  dfs.replication
  4
 
  dfs.permissions
  false
 

The dfs.replication control the number of replication when a file is created in HDFS. If we want to utilize all the computing power from every node, this value should equals to the number of nodes available in the cluster.
The dfs.permissions is set to false to avoid permission issue during the execution of our experiment. This value means that any user can do anything to HDFS but since the users need to be able to login to the Mac OS X in the first place, turning this off seems to be reasonable to get rid off all the problems we might encounter (not recommeded under a production environment though).

mapred-site.xml

This file controls the configuration of MapReduce process, the job tracker and the tasktrackers.


 
  mapred.job.tracker
  [MASTER_HOST_NAME]:9001
 
  mapred.tasktracker.map.tasks.maximum
  1
 
  mapred.tasktracker.reduce.tasks.maximum
  1
 
  mapred.max.split.size
  1000
 

Again, the mapred.job.tracker must point to the master node only with the correct port since only the master node runs the job tracker in Hadoop cluster.
The mapred.tasktracker.map.tasks.maximum controls the number of map tasks running per node. For my experiment, there is the limitation of iOS simulator that we can only have 1 simulator run at a time, we cannot have more than one map task that run the test on a single node. We have the mapred.tasktracker.map.tasks.maximum to the value of 1. Under a typical condition of processing, this value might be 4 based on computing capability of the CPU of your machine.
The mapred.tasktracker.reduce.tasks.maximum controls the number of reduce tasks running per node. Again, for my experiment, since the job of the Reduce function is trivial (just collects and marges the test results), the mapred.tasktracker.reduce.tasks.maximum is set to the value of 1 as well. Under a typical condition, this value could be 2.
The mapred.max.split.size is directly correlated to how Hadoop splits and distributes the input file throughout the HDFS. Since our input file, which is just a list of test execution commands, is not big (probably in a unit of a few MB rather than GB or TB). Each test execution is considered computing extensive and time-consuming. We need to set this value so small that it will force Hadoop to distribute the job to the other nodes as well, otherwise, Hadoop won’t even bother distributing the jobs and just run on a single node. For this experiment, we set this value to 1000 or 1 KB. You can leave this to default can do not include this property in this file.

masters

The masters file is a list of machine’s host names or IP-address that each run a secondary name-node (not the machine that runs as the master name-node but the secondary name-node, this could be the master node though). In this experiment, the master node not only acts as the name-node but also as a secondary name-node as well, probably not the best practice since the job of a secondary node is to provide checkpoints of the name-node in case if the name-node fails. So the masters file contains just the following line. This file needs only be set on the master node.

MASTER_USERNAME@MASTER_HOST_NAME


slaves

The slaves file is a list of machine’s host names or IP-address that each run a data-node and tasktracker in the cluster. The master node can also act as a data-node so the master node can appear in this list as well. This file needs only be set on the master node.

MASTER_USERNAME@MASTER_HOST_NAME
SLAVE_USERNAME_1@SLAVE_HOST_NAME_1
SLAVE_USERNAME_2@SLAVE_HOST_NAME_2
SLAVE_USERNAME_3@SLAVE_HOST_NAME_3

Test Running Hadoop

Before starting our Hadoop cluster, we need to initialize the HDFS first by using the following command.

hadoop namenode -format

Then to start our Hadoop cluster, we simply need to execute the following command.

/usr/local/Cellar/hadoop/1.2.1/libexec/bin/start-all.sh

We can test whether our Hadoop cluster is working correctly by running some samples provided with Hadoop installation, for example, with the following command. The command should give us the result “Estimated value of Pi is 3.14800000000000000000”.

hadoop jar /usr/local/Cellar/hadoop/1.2.1/libexec/hadoop-examples-*.jar pi 10 100

To stop our Hadoop cluster, we can use the following commands.

/usr/local/Cellar/hadoop/1.2.1/libexec/bin/stop-all.sh

And that’s it! Congratulations, you’ve successfully set up your own Hadoop cluster on the Mac!

Conclusion

The instructions in this post isn’t probably the best practice on how to set a Hadoop cluster up under the Mac OS X environment. It was customized for my own experiment. If I misunderstood somthing or if you have any suggestion, some comments under this post are welcome!


Related Posts

 
blog comments powered by Disqus