Skip to content

Hadoop Development Onboarding (Linux, Single Cluster)

Burra Abhishek edited this page Dec 11, 2021 · 8 revisions

The following instructions outline how to set up your Hadoop development environment. The instructions are aimed to be agnostic of the platform the stack is installed on, so a working knowledge of the specifics of your GNU/Linux distribution or other such Unix based operating system is assumed.

Prerequisites

Before beginning, please ensure that you have the following tools installed, using your favorite package manager to install them where applicable.

Hardware

  • At least 4 GB of RAM
  • At least 20 GB of free storage space
  • A CPU with 64-bit architecture

Tools and dependency managers

  • java (JDK >= 8 and JDK <= 16*)
  • ssh (openssh-server)
  • hadoop (Hadoop version >= 2.7.1) (download the binary release)

NOTE

If you are using jabba to install java, your installation is located at /home/<username>/.jabba/jdk/<version>/bin/java. For example: /home/username/.jabba/jdk/[email protected]/bin/java

Setup

  1. Java setup:
    • Type `java -version` command in Terminal. If your Java Installation is successful, you will see the corresponding version. For example,
      $ java -version
      openjdk version "15.0.2" 2021-01-19
      OpenJDK Runtime Environment (build 15.0.2+7-27)
      OpenJDK 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)
      

      indicates that OpenJDK 15.0.2 was installed successfully into the user's computer.

    • *NOTE: JDK Version 16 is not supported by Hadoop as of 5 June 2021
  2. ssh setup
    • Check your ssh version in your terminal. If you get an output, for example,
      1:8.2p1-4
      

      then ssh was successfully installed.

    • Verify that ssh is working, using:
      sudo service ssh status
      

      The active parameter should be active and running.

    • Hadoop requires ssh access to manage its nodes. To generate a passwordless RSA key pair, use:
      ssh-keygen -t rsa -P ""
      

      When prompted to enter the file in which to save the key, just hit Enter without typing anything. Wait until you get the key's randomart image.

    • Save the public key into authorized keys using:
      cat /home/<username>/.ssh/id_rsa.pub >> /home/<username>/.ssh/authorized_keys
      

      Replace <username> with your username (Click on files and go to home directory to see your username).

    • Change the permissions of ssh and authorized_keys using
      sudo chmod 700 ~/.ssh
      sudo chmod 600 ~/.ssh/authorized_keys
      

      Now your ssh configuration is ready.

  3. Hadoop Setup
    • Go to
      <hadoop_directory>/etc/hadoop
      

      Replace <hadoop_directory> with your Hadoop installation location.

    • Open hadoop_env.sh
      • Set the java path in the file in the following, by removing the comment the line :
        export JAVA_HOME=
        

        (in Hadoop 3.3.0, this is line #54) This would be in the following location (Do NOT add these to the end of your file):

        # The java implementation to use. By default, this environment
        # variable is REQUIRED on ALL platforms except OS X!
        export JAVA_HOME=<jdk_location>
        

        (in Hadoop 3.3.0, this is line #52-54) Replace <jdk_location> with the location of your Java installation. Save the file.

    • Open core_site.xml
      • Replace the configuration tags with:
        <configuration>
          <property>
            <name>fs.default.name</name>
            <value>hdfs://localhost:9000</value>
          </property>
        </configuration>
        

        Save the file.

    • Open hdfs_site.xml
      • Replace the configuration tags with:
        <configuration>
          <property>
            <name>dfs.replication</name>
            <value>1</value>
          </property>
          <property>
            <name>dfs.name.dir</name>
            <value>file:///{hdfs_location}/hdfs/namenode</value>
          </property>
          <property>
            <name>dfs.data.dir</name>
            <value>file:///{hdfs_location}/hdfs/datanode</value>
          </property>
        </configuration>
        
      • Replace {hdfs_location} with the directory containing hdfs. For example: {hdfs_location} = /home/{username}/hadoop_tmp, where username is the username of the computer. You can create this directory if it does not exist.
      • Save the file.
    • Open mapred_site.xml
      • Replace the configuration tags with the following:
        <configuration>
          <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
          </property>
          <property>
            <name>yarn.app.mapreduce.am.env</name>
            <value>HADOOP_MAPRED_HOME={path_to_hadoop_installation}</value>
          </property>
          <property>
            <name>mapreduce.map.env</name>
            <value>HADOOP_MAPRED_HOME={path_to_hadoop_installation}</value>
          </property>
          <property>
            <name>mapreduce.reduce.env</name>
            <value>HADOOP_MAPRED_HOME={path_to_hadoop_installation}</value>
          </property>
        </configuration>
        
      • Replace {path_to_hadoop_installation} with the path of your hadoop installation.
      • Save the file.
    • Open yarn_site.xml
      • Replace the configuration tags with the following:
        <configuration>
          <!-- Site specific YARN configuration properties -->
          <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
          </property>
          <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
          </property>
        </configuration>
        

        Save the file.

    • Creating the HDFS directories
      1. Create the directories using:
        sudo mkdir -p {hdfs_directory}/hdfs/namenode
        sudo mkdir -p {hdfs_directory}/hdfs/datanode
        

        Replace {hdfs_directory} with the location of {hdfs_location} mentioned while editing hdfs_site.xml,

        In the following steps, replace {hdfs_directory} with this location.

      2. Give all permissions recursively to this {hdfs_directory} using (don't forget to perform the above replacement!):
        sudo chmod 777 -R {hdfs_directory}
        
      3. Format the namenode using:
        hdfs namenode -format
        
    • Finally, add a few configuration lines to your ~/.bashrc file (in your home directory):
      • Opening the file requires elevated privileges! Open the file as a super user using your favorite text editor.
      • Add the following to the end of the file:
        # Hadoop path setting
        export HADOOP_HOME=<hadoop_directory_location>/hadoop
        export HADOOP_CONF_DIR=<hadoop_directory_location>/hadoop/etc/hadoop
        export HADOOP_COMMON_HOME=$HADOOP_HOME
        export HADOOP_MAPRED_HOME=$HADOOP_HOME
        export HADOOP_HDFS_HOME=$HADOOP_HOME
        export HADOOP_YARN_HOME=$HADOOP_HOME
        export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
        export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
        export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
        

        Replace the <hadoop_directory_location> with the location of your hadoop installation.

      • Either run
        source ~/.bashrc
        

        or restart your system.

    • Verify correct setup
      • In your terminal, type hd
      • Press Tab
      • Now, press Tab+Enter
      • If you can see entries like hd, hdfs, hdfs.cmd, hdparm and any other entries, your Hadoop setup is ready!