Step 1: Go to Hadoop website. and download latest Hadoop
stable release.
After
downloading, extract the tar file.
Step 2: Update following variables in either your ~/.bashrc
(or) ~/.zshrc (or) ~/.profile.
export
HADOOP_PREFIX="/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0"
“HADOOP_PREFIX”
is the directory location, where you unzipped Hadoop.
export
HADOOP_HOME=$HADOOP_PREFIX
export
HADOOP_COMMON_HOME=$HADOOP_PREFIX
export
HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export
HADOOP_HDFS_HOME=$HADOOP_PREFIX
export
HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export
HADOOP_YARN_HOME=$HADOOP_PREFIX
Step 3: Configure HDFS
HDFS stands
for Hadoop Distributed File System. It comprises two components.
a.
Name Node: It holds all the meta data about Hadoop cluster.
Meta data is related to the stored files in Hadoop cluster.
b. Data
Nodes: This is the place, where data actually
stores.
HDFS
configuration file is located in “$HADOOP_PREFIX/etc/hadoop/hdfs-site.xml”. By
default this file is empty. It use the default configurations specified in
following link.
To work with
Hadoop on single node, we require only minimal configurations. So add following
configurations to hdfs-site.xml.
<configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/hdfs/datanode</value> <description>Determines where on the local filesystem an DFS data node should store its blocks</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/hdfs/namenode</value> <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage)</description> </property> </configuration>
Note
Make sure to
replace /Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/ with whatever
you set $HADOOP_PREFIX to.
Hadoop
document says
Property
|
Description
|
dfs.datanode.data.dir
|
Determines
where on the local filesystem an DFS data node should store its blocks. If
this is a comma-delimited list of directories, then data will be stored in
all named directories, typically on different devices. Directories that do
not exist are ignored.
|
dfs.namenode.name.dir
|
Determines
where on the local filesystem the DFS name node should store the name
table(fsimage). If this is a comma-delimited list of directories then the
name table is replicated in all of the directories, for redundancy.
|
Update
“$HADOOP_PREFIX/etc/hadoop/core-site.xml” with following code, to know Hadoop
modules where name node located.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost/</value> <description>NameNode URI</description> </property> </configuration>
Step 4: Configure YARN
To configure
YARN, you have to update “$HADOOP_PREFIX/etc/hadoop/yarn-site.xml” file. By
default it is empty and use the default configurations specified in following
link.
Add
following configurations to yarn-site.xml.
<configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> <description>The minimum allocation for every container request at the Resource manager, in MBs</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <description>The maximum allocation for every container request at the Resourec Manager, in MBs.</description> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores</description> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> <description>Amount of physical memory, in MB, that can be allocated for containers.</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <description>Number of vcores that can be allocated for containers.</description> </property> </configuration>
As per
Hadoop documentation
Property
|
Description
|
yarn.scheduler.minimum-allocation-mb
|
The
minimum allocation for every container request at the RM, in MBs. Memory
requests lower than this will throw a InvalidResourceRequestException.
|
yarn.scheduler.maximum-allocation-mb
|
The
maximum allocation for every container request at the RM, in MBs. Memory
requests higher than this will throw a InvalidResourceRequestException.
|
yarn.scheduler.minimum-allocation-vcores
|
The
minimum allocation for every container request at the RM, in terms of virtual
CPU cores. Requests lower than this will throw a InvalidResourceRequestException.
|
yarn.scheduler.maximum-allocation-vcores
|
The
maximum allocation for every container request at the RM, in terms of virtual
CPU cores. Requests higher than this will throw a
InvalidResourceRequestException.
|
yarn.nodemanager.resource.memory-mb
|
Amount of
physical memory, in MB, that can be allocated for containers.
|
yarn.nodemanager.resource.cpu-vcores
|
Number of
vcores that can be allocated for containers. This is used by the RM scheduler
when allocating resources for containers. This is not used to limit the
number of physical cores used by YARN containers.
|
Note:
You can
adjust above configurations depends on your computer hardware configuration.
Step 5: Add Hadoop bin, sbin directories path of Hadoop to your system path. For
example, I unzipped Hadoop at “/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/”.
So i added
following statement in ~/.profile file.
export
PATH=$PATH:/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/bin:/Users/harikrishna_gurram/softwares/Hadoop/hadoop-2.6.0/sbin
Step 6: Open new terminal, Format name node directory (Do
this only once).
$ hdfs
namenode –format
Step 7: Start name node daemon
$
hadoop-daemon.sh start namenode
Step 8: Start data node daemon
$
hadoop-daemon.sh start datanode
Step 9: Start YARN daemons
Start
Resource manager daemon.
$
yarn-daemon.sh start resourcemanager
Start
nodemanager daemon
$
yarn-daemon.sh start nodemanager
Run “jps”
command to check, whether all daemons are running or not.
$ jps
1583 ResourceManager
1200
NameNode
1717 Jps
1466
DataNode
1637
NodeManager
If any thing
goes wrong, go through the error log and rectify it. Error log located at
“$HADOOP_PREFIX/logs/<daemon
with problems>.log” for any errors.
Step 10: Test whether your set up is correct or not.
Run
following command (Refer Debugging section, if you face any problems)
hadoop jar
$HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar
org.apache.hadoop.yarn.applications.distributedshell.Client --jar
$HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar
--shell_command date --num_containers 2 --master_memory 1024
Check for
the jar names “hadoop-yarn-applications-distributedshell-2.6.0.jar”
and “hadoop-yarn-applications-distributedshell-2.6.0.jar” update as per the
hadoop version installed.
Hurray… You
successfully install Hadoop. It is time to play around with Hadoop.
Debugging
1. If you
saw “java.net.UnknownHostException” in error log, make sure that following statement
is added in /etc/hosts, else add following statement.
127.0.0.1 localhost localhost
2. In MAC,
you have to change the hostname to localhost after adding above snippet.
scutil --set HostName localhost
Above
command set the host name to localhost.
scutil --get HostName
Above
command get the localhost.
3. In mac,
create a symbolic link to java like following.
sudo ln -s
/usr/bin/java /bin/java.
Referred links
No comments:
Post a Comment