Saturday, 2 January 2016

Hadoop: Single Node Installation

Step 1: Go to Hadoop website. and download latest Hadoop stable release.

After downloading, extract the tar file.

Step 2: Update following variables in either your ~/.bashrc (or) ~/.zshrc (or) ~/.profile.

export HADOOP_PREFIX="/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0"

“HADOOP_PREFIX” is the directory location, where you unzipped Hadoop.

export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX

Step 3: Configure HDFS

HDFS stands for Hadoop Distributed File System. It comprises two components.

a.   Name Node: It holds all the meta data about Hadoop cluster. Meta data is related to the stored files in Hadoop cluster.
b.   Data Nodes: This is the place, where data actually stores.

HDFS configuration file is located in “$HADOOP_PREFIX/etc/hadoop/hdfs-site.xml”. By default this file is empty. It use the default configurations specified in following link.



To work with Hadoop on single node, we require only minimal configurations. So add following configurations to hdfs-site.xml.
<configuration>
 <property>
  <name>dfs.datanode.data.dir</name>
  <value>file:///Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/hdfs/datanode</value>
  <description>Determines where on the local filesystem an DFS data node
   should store its blocks</description>
 </property>

 <property>
  <name>dfs.namenode.name.dir</name>
  <value>file:///Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/hdfs/namenode</value>
  <description>Determines where on the local filesystem the DFS name
   node should store the name table(fsimage)</description>
 </property>

</configuration>

Note
Make sure to replace /Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/ with whatever you set $HADOOP_PREFIX to.

Hadoop document says
Property
Description
dfs.datanode.data.dir
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
dfs.namenode.name.dir
Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
  

Update “$HADOOP_PREFIX/etc/hadoop/core-site.xml” with following code, to know Hadoop modules where name node located.
<configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost/</value>
  <description>NameNode URI</description>
 </property>
</configuration>


Step 4: Configure YARN
To configure YARN, you have to update “$HADOOP_PREFIX/etc/hadoop/yarn-site.xml” file. By default it is empty and use the default configurations specified in following link.


Add following configurations to yarn-site.xml.
<configuration>
 <property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>128</value>
  <description>The minimum allocation for every container request at the
   Resource manager, in MBs</description>
 </property>
 <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
  <description>The maximum allocation for every container request at the
   Resourec Manager, in MBs.</description>
 </property>
 <property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
  <description>The minimum allocation for every container request at the
   RM, in terms of virtual CPU cores</description>
 </property>
 <property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>2</value>
  <description>The maximum allocation for every container request at the
   RM, in terms of virtual CPU cores</description>
 </property>
 <property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
  <description>Amount of physical memory, in MB, that can be allocated
   for containers.</description>
 </property>
 <property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>4</value>
  <description>Number of vcores that can be allocated for containers.</description>
 </property>
</configuration>


As per Hadoop documentation
Property
Description
yarn.scheduler.minimum-allocation-mb
The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this will throw a InvalidResourceRequestException.
yarn.scheduler.maximum-allocation-mb
The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException.
yarn.scheduler.minimum-allocation-vcores
The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this will throw a InvalidResourceRequestException.
yarn.scheduler.maximum-allocation-vcores
The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this will throw a InvalidResourceRequestException.
yarn.nodemanager.resource.memory-mb
Amount of physical memory, in MB, that can be allocated for containers.
yarn.nodemanager.resource.cpu-vcores
Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of physical cores used by YARN containers.

Note:
You can adjust above configurations depends on your computer hardware configuration.

Step 5: Add Hadoop bin, sbin  directories  path of Hadoop to your system path. For example, I unzipped Hadoop at “/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/”.

So i added following statement in ~/.profile file.

export PATH=$PATH:/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/bin:/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/sbin

Step 6: Open new terminal, Format name node directory (Do this only once).

$ hdfs namenode –format

Step 7: Start name node daemon
$ hadoop-daemon.sh start namenode

Step 8: Start data node daemon
$ hadoop-daemon.sh start datanode

Step 9: Start YARN daemons
Start Resource manager daemon.
$ yarn-daemon.sh start resourcemanager

Start nodemanager daemon
$ yarn-daemon.sh start nodemanager

Run “jps” command to check, whether all daemons are running or not.

$ jps
1583 ResourceManager
1200 NameNode
1717 Jps
1466 DataNode
1637 NodeManager

If any thing goes wrong, go through the error log and rectify it. Error log located at
“$HADOOP_PREFIX/logs/<daemon with problems>.log” for any errors.

Step 10: Test whether your set up is correct or not.

Run following command (Refer Debugging section, if you face any problems)

hadoop jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar --shell_command date --num_containers 2 --master_memory 1024

Check for the jar names “hadoop-yarn-applications-distributedshell-2.6.0.jar” and “hadoop-yarn-applications-distributedshell-2.6.0.jar” update as per the hadoop version installed.

Hurray… You successfully install Hadoop. It is time to play around with Hadoop.

Debugging
1. If you saw “java.net.UnknownHostException” in error log, make sure that following statement is added in /etc/hosts, else add following statement.

127.0.0.1 localhost localhost

2. In MAC, you have to change the hostname to localhost after adding above snippet.

scutil --set HostName localhost

Above command set the host name to localhost.

scutil --get HostName

Above command get the localhost.

3. In mac, create a symbolic link to java like following.
sudo ln -s /usr/bin/java /bin/java.

Referred links






Previous                                                 Next                                                 Home

No comments:

Post a Comment