Install Hadoop 3.x: Single Node Hadoop Cluster

We are going to setup  all the NameNode, DataNode, ResourceManager and NodeManager on a single machine.

Step 1: Create user and group:

groupadd hadoop
useradd hadoop -g hadoop

setup passwordless SSH, refer to http://www.techguru.my/linux-admin/ssh/passwordless-ssh/

Step 2: Install java sdk:

yum install java

Step 3: Download the Hadoop Package.

Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-3.1.0/hadoop-3.1.0.tar.gz

Step 4: Extract the Hadoop tar File and setup symlinks

Command:
cd /usr/local
tar -xvf hadoop-3.1.0.tar.gz
ln -s hadoop-3.1.0 hadoop
ln -s /usr/local/hadoop/etc/hadoop /etc/hadoop
chown -R hadoop:hadoop hadoop

Step 5: Add the Hadoop and Java paths

Open /etc/profile file. Now, add Hadoop and Java Path as shown below:

export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export YARN_HOME=/usr/local/hadoop
export HADOOP_SSH_OPTS="-p 22 -l hadoop"
export PATH=$PATH:/usr/local/hadoop/bin:/etc/alternatives/java_sdk_1.8.0/bin
Then, save the file and close it, and run: source /etc/profile

Step 6: Edit the Hadoop Configuration files

cd /etc/hadoop
vi core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

vi hdfs-site.xml
<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/name</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
    <final>true</final>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file///home/hadoop/data</value>
    <description>Determines where on the local filesystem an DFS data node
       should store its blocks.  If this is a comma-delimited
       list of directories, then data will be stored in all named
       directories, typically on different devices.
       Directories that do not exist are ignored.
    </description>
    <final>true</final>
  </property>

  <property>
    <name>dfs.heartbeat.interval</name>
    <value>3</value>
    <description>Determines datanode heartbeat interval in seconds.
    </description>
  </property>
  <property>
    <name>dfs.safemode.threshold.pct</name>
    <value>1.0f</value>
    <description>
        Specifies the percentage of blocks that should satisfy
        the minimal replication requirement defined by dfs.replication.min.
        Values less than or equal to 0 mean not to start in safe mode.
        Values greater than 1 will make safe mode permanent.
        </description>
  </property>

  <property>
    <name>dfs.datanode.address</name>
    <value>0.0.0.0:1004</value>
  </property>

  <property>
    <name>dfs.datanode.http.address</name>
    <value>0.0.0.0:1006</value>
  </property>

  <property>
    <name>dfs.http.address</name>
    <value>0.0.0.0:50070</value>
    <description>The name of the default file system.  Either the
       literal string "local" or a host:port for NDFS.
    </description>
    <final>true</final>
  </property>

  <property>
    <name>dfs.datanode.ipc.address</name>
    <value>0.0.0.0:8025</value>
    <description>
      The datanode ipc server address and port.
      If the port is 0 then the server will start on a free port.
    </description>
  </property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
  <property>
    <name>dfs.umaskmode</name>
    <value>077</value>
    <description>
      The octal umask used when creating files and directories.
    </description>
  </property>
  <property>
    <name>dfs.datanode.data.dir.perm</name>
    <value>700</value>
<description>The permissions that should be there on dfs.data.dir
directories. The datanode will not come up if the permissions are
different on existing dfs.data.dir directories. If the directories
don't exist, they will be created with this permission.</description>
  </property>
</configuration>

Step 7: Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the script to run Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
export JAVA_HOME=/etc/alternatives/java_sdk_1.8.0

Step 8: Go to Hadoop home directory and format the NameNode.

Command: cd
Command: cd hadoop
Command: bin/hadoop namenode -format
This formats the HDFS via NameNode. This command is only executed for the first time. Formatting the file system means initializing the directory specified by the dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored in the HDFS.

Step 9: Once the NameNode is formatted, go to hadoop/sbin directory and start/stop all the daemons.

Command: cd hadoop/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh, ./stop-all.sh

To start services individually:

Start/Stop NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files stored in the HDFS and tracks all the file stored across the cluster. Command: ./hadoop-daemon.sh start/stop namenode

Start/Stop DataNode:

On startup, a DataNode connects to the Namenode and it responds to the requests from the Namenode for different operations. Command: ./hadoop-daemon.sh start/stop datanode

Start/Stop ResourceManager:

ResourceManager is the master that arbitrates all the available cluster resources and thus helps in managing the distributed applications running on the YARN system. Its work is to manage each NodeManagers and the each application’s ApplicationMaster. Command: ./yarn-daemon.sh start/stop resourcemanager

Start/Stop NodeManager:

The NodeManager in each machine framework is the agent which is responsible for managing containers, monitoring their resource usage and reporting the same to the ResourceManager. Command: ./yarn-daemon.sh start/stop nodemanager

Start/Stop JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client. Command: ./mr-jobhistory-daemon.sh start/stop historyserver

Step 10: Check if all services are running

prompt> jps
53088 SecondaryNameNode
53940 Jps
53620 NodeManager
52663 NameNode
53211 DataNode

Important, start all services using root user, when you run command: jps, you will see the below:
62272 NodeManager
61680 SecondaryNameNode
62657 Jps
61333 DataNode
61125 NameNode
34533 ResourceManager

if you login using hadoop user and run command: jps, you will only see:
62272 NodeManager
61680 SecondaryNameNode
61333 DataNode
61125 NameNode
62808 Jps

Older hadoop version jobTracker and TaskTracker has been replaced by ResourceManager

Step 11: check hdfs and ResourceManager

run the following at command prompt to check disk status:
hdfs fsck /hbase
hdfs dfsadmin -report

Visit the URL below to see Resource Manager Status:
http://localhost:8088

NOTE:
if the DataNode is not running, run the below command to check:

/usr/local/hadoop/bin/hadoop datanode

Error reported:
java.net.SocketException: Call From 0.0.0.0 to null:0 failed on socket exception: java.net.SocketException: Permission denied; For more details see:  http://wiki.apache.org/hadoop/SocketException

This is because datanode is running of port 1004 and 1006. It needs to run on higher port e.g. 51004, 51006.
Edit the datanode.address to high port solves the problem.

Be the first to comment

Leave a Reply

Your email address will not be published.


*