💥 Configuration of Hadoop Using Ansible. 💥

Uditanshu pandey

6 min readApr 4, 2021

In this article we will be configuring Hadoop using Ansible.

Task Description:

🔰 We will configure Hadoop and start cluster services using Ansible Playbook.

There are some prequisite before performing this task:

Ansible should be preinstalled on the system.

So, let’s discuss first what is Big Data and Hadoop.

What is BIG DATA?

BIG DATA is nothing but only DATA which is very huge. Also, this data is becoming a very huge day by day. Big data is actually the name of a problem.

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.

What is HADOOP?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

What is NAMENODE?

NameNode works as Master in Hadoop cluster. Below listed are the main function performed by NameNode:

Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks, Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files, renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for its operation. This should be hosted on reliable hardware.

What is DATANODE?

DataNode works as Slave in Hadoop cluster . Below listed are the main function performed by DataNode:

Actually stores Business data.
This is actual worker node were Read/Write/Data processing is handled.
Upon instruction from Master, it performs creation/replication/deletion of data blocks.
As all the Business data is stored on DataNode, the huge amount of storage is required for its operation. Commodity hardware can be used for hosting DataNode.

Now, Let’s jump into the task. We will be performing this task step by step.

Step-1)

First we have to configure our inventory file. In inventory file we will be defining IP address of our namenode and datanode.

Now, we will test that If our hosts are connected or not. We can check this by using PING command.

Here all the hosts are pingable.

Step-2)

Now, we will configure NAMENODE first.

Before Installing Hadoop on the NameNode we have to install JDK on the NameNode.

- hosts: namenode  
  tasks:   
       - name: Copy JDK file         
         copy:  
                   src: /root/Desktop/jdk-8u171-linux-x64.rpm                    
                   dest: /root   
      - name: Copy HADOOP file   
        copy:        
            src: /root/Desktop/hadoop-1.2.1-1.x86_64.rpm
            dest: /root      - name: Installation of JDK      
        shell: rpm -i /root/jdk-8u171-linux-x64.rpm         
      - name: Installation of Hadoop       
        shell: rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force         
      - file:   
                 path: /nn  
                 state: directory

In this block of code, we will be installing Hadoop and JDK and also we will create an empty directory.

After this now we will configure our namenode. To do this we have to make some changes to the hdfs-site.xml and core-site.xml.

- lineinfile:            
        path: /etc/hadoop/hdfs-site.xml
        insertafter: "<configuration>"   
        line:     <property>                   
                  <name>dfs.name.dir</name>    
                  <value>/nn</value>        
                  </property>  
- lineinfile:  
                  path: /etc/hadoop/core-site.xml                   
                  insertafter:  "<configuration>"   
                  line:  <property> 
                         <name>fs.default.name</name>   
                         <value>hdfs://0.0.0.0:9001</value>      
                         </property>

In these files, we will add some lines through which we tell our node that it should work as NameNode and control other nodes.

For adding lines we will be using lineinfile module.

After the addition of lines in the hdfs-site.xml and core-site.xml now we will format our directory which we have created above and after formatting, we will finally start our namenode.

- shell: echo Y | hadoop namenode -format      
- shell: hadoop-daemon.sh start namenode

Our NameNode has been successfully started.

Full code for NameNode:

Now just run the Playbook and our NameNode will be configured.

To run the playbook:

ansible-playbook <File_Name.yml>

Step-3)

DataNode Configuration

As we have configured NameNode. Similarly, we will configure DataNode.

To connect DataNode to NameNode we need NameNode Ip.

- hosts: datanode  
  vars_prompt:      
     - name: namenode_IP  
       prompt: "Enter NameNode IP Address:"     
       private: no

Here we will ask the IP of NameNode from the user and store it in a Variable.

tasks:         
     - name: Copy JDK file     
       copy:              
             src: /root/Desktop/jdk-8u171-linux-x64.rpm   
             dest: /root     
     - name: Copy HADOOP file    
       copy:     
               src: /root/Desktop/hadoop-1.2.1-1.x86_64.rpm  
               dest: /root        
     - name: Installation of JDK      
       shell: rpm -i /root/jdk-8u171-linux-x64.rpm   
     - name: Installation of Hadoop          
       shell: rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force     
     - file:                
             path: /dn                 
             state: directory

This Block of code will Copy Jdk and Hadoop file for further installation.

In this block of code, we will be installing Hadoop and JDK and also we will create an empty directory.

After this now we will configure our namenode. To do this we have to make some changes to the hdfs-site.xml and core-site.xml.

In these files, we will add some lines through which we tell our node that it should work as DataNode and who is NameNode.

For adding lines we will be using lineinfile module.

- lineinfile:               
         path: /etc/hadoop/hdfs-site.xml     
         insertafter: "<configuration>"              
         line:    <property>          
                  <name>dfs.data.dir</name>  
                  <value>/dn</value>             
                 </property>      
- lineinfile:               
           path: /etc/hadoop/core-site.xml         
           insertafter:  "<configuration>"               
           line:   <property>      
                   <name>fs.default.name</name>
                   <value>hdfs://{{ namenode_IP }}:9001</value>    
                   </property>

After the addition of lines in the hdfs-site.xml and core-site.xml, we will finally start our datanode.

- shell: hadoop-daemon.sh start datanode

Full code for DataNode Playbook:

Now just run the Playbook and our NameNode will be configured.

To run the playbook:

ansible-playbook <File_Name.yml>

Let’s check in our system that if it works or not.

Yes, We have successfully configured NameNode and DataNode.

Github link to the repo:

Uditanshu0110/Hadoop_Ansible

Contribute to Uditanshu0110/Hadoop_Ansible development by creating an account on GitHub.

github.com

Thanks For Reading. 😇