How to Set up Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 20.04 (Inclusive Terminology)

10 min readOct 26, 2020

This article it’s not the first one about this topic, although it’s one of the few that include the new inclusive terminology presented by the Linux community. Instead of using Master/Slave, I will use Primary/Secondary. We as computer scientists are agents of progress and should represent that progress in everything we do.

What is Hadoop?

On the Apache Website, Hadoop is presented as a “open-source software for reliable, scalable, distributed computing”, this means that is a framework for distributed processing.

It’s simply several computers working together in order to improve processing power and assure decentralize operations.

With Hadoop, it’s possible to process large data sets across computers without complex coding.

This software is use by a large number of companies that look for processing enormous datasets (also known as big data). Companies like Facebook, Amazon or Netflix depend on the processing capacity of Hadoop to have an efficient data analytics system.

How it works? What are Hadoop components?

Hadoop distributed file system (HDFS) allows distributed data storing on all associated machines. Even if the way of using it’s similar to the Linux file system, the purpose isn’t the same. HDFS should be use to handles big data running on commodity hardware.

MapReduce it’s other big functionality of Hadoop. It gives a scalability component to the system. The data is split and process in parallel. This allows operations to be executed in a faster way. The software it’s responsible for “scheduling tasks, monitoring them and re-executes the failed tasks.”.

Yarn purpose is to split the resource management functionalities into separate deamons.

What do you need to start the set-up?

I will use VirtualBox and Ubuntu Server 20.04. The recommended space on your computer would be 24GB (can be less, but if you would like to use and test it after 8GB per machine would be the ideal).

Ubunto Server 20.04 ISO download link: https://releases.ubuntu.com/20.04/

Oracle VirtualBox download link: https://www.virtualbox.org/wiki/Downloads

Install Ubuntu on a virtual box tutorial: https://brb.nci.nih.gov/seqtools/installUbuntu.html

If you are ready let’s start!

Step 1: Virtual box Network Settings

On VB settings make sure that your Network adapter is set for Bridged.

Step 2: Install ssh

Install ssh with the following command:

sudo apt install ssh

step 3: Install psdh

Install psdh with the following command:

sudo apt install pdsh

step 4: Set pdsh environment to ssh

Open the Bashrc file with nano:

sudo nano .bashrc

Add to the end of the file:

export PDSH_RCMD_TYPE=ssh

Step 5: Generate a SSH key

Generate a ssh key with the following command:

ssh-keygen -t rsa -P ""

Press Enter when asked to choose the storage file.

Step 6: Clone the key into authorized_keys files

To give the right permissions to your ssh key you should create a copy on authorized_keys files:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Make sure that everything is well set, doing a ssh to our machine.

ssh localhost

*Your output may be different

Step 7: Install Java 8

In order to run Hadoop you need to have Java 8 install on your machine. To do so, use the follow command:

sudo apt install openjdk-8-jdk

Check if Java it’s install with the following command:

java -version

Step 8: It’s time to download and install Hadoop

First download the tar file that contains Hadoop with the following command:

sudo wget -P ~ https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

When the download is finish unzip it:

tar xzf hadoop-3.2.1.tar.gz

And move it to a folder that will be called hadoop (it’s a practical solution, do not change the mechanics of the rest):

mv hadoop-3.2.1 hadoop

Step 9: Set up Hadoop

Start to configure the Java path on Hadoop’s virtual environment:

sudo nano ~/hadoop/etc/hadoop/hadoop-env.sh

Then look for Java_Home’s line and replace it by:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

Save the file with Ctrl + O and then exit with Ctrl + X.

Step 10: Move the hadoop directory to our user local file

Move it with the following command:

sudo mv hadoop /usr/local/hadoop

Step 11: Set up hadoop path

To set up hadoop path on machine’s environment, open the environment file with:

sudo nano /etc/environment

And then replace everything with:

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

Save the file with Ctrl + O and then exit with Ctrl + X.

Step 12: Create a specific user for Hadoop

Start by creating a new user: (I choose to call it h-user, but feel free to choose other name)

sudo adduser h-user

Now you need to give this user permissions to work within hadoop’s folder.

sudo usermod -aG hadoopuser h-user
sudo chown h-user:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
sudo adduser h-user sudo

Step 13: Clone the primary machine in order to create two secondary machines

Make two full clones of the primary machine:

Repeat this operation 2 times.

Now make sure that all the machines have different Mac Addresses in order to have different IP’s:

Step 14: Change hostnames

Start for changing the hostnames: (computer’s name)

sudo nano /etc/hostname

Do the same for secondary machines:

Now reboot all machines.

Step 15: Identify machine’s ip

To know the machine’s IP use:

ip addr

Write down all the IP’s .

Now change the hosts file on all machines:

sudo nano /etc/hosts

And add other machine’s identification:

*This process can be made before the creation of the virtual machines, however in this way wecan actually see how the connection between machines happen

Step 16: Set up ssh on Primary with our user

Start for changing user:

su - h-user

Now you need to generate a ssh key for this user:

ssh-keygen -t rsa

Step 17: Copy the ssh key our secondary machines

Copy the already generated key to all the machines:

ssh-copy-id h-user@h-primary
ssh-copy-id h-user@h-secondary1
ssh-copy-id h-user@h-secondary2

Step 18: Configure Hadoop Service Port

Change hadoop port configurations: (only on primary)

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

And then add to file’s configuration:

<property>
<name>fs.defaultFS</name>
<value>hdfs://h-primary:9000</value>
</property>

Step 19: Configuration of HDFS system

Change HDFS configurations: (only on primary)

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

And then add to file’s configuration:

<property>
<name>dfs.namenode.name.dir</name><value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name><value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

Step 20: Identify the workers

Add the secondary machines to workers file: (only on primary)

sudo nano /usr/local/hadoop/etc/hadoop/workers

Step 21: Copy configurations into secondary machines

You need to make sure that all the configurations that you just change are going to all machines, to do so, execute the following commands:

scp /usr/local/hadoop/etc/hadoop/* h-secondary1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* h-secondary2:/usr/local/hadoop/etc/hadoop/

Step 22: Formatting and Starting HDFS system (only primary)

Start for making sure that all changes are applied:

source /etc/environment

Then format the hdfs system with:

hdfs namenode -format

Make sure that your .bashrc file is configure:

sudo nano .bashrc

And check if in the end of the file have the following path:

export PDSH_RCMD_TYPE=ssh

Update the changes:

source ~/.bashrc

When this operations are done, start the service:

start-dfs.sh

To check if all the machines are using the correct resources use:

jps

Output primary:

Output secondary:

Step 23: Nodes management tool

It’s time to check if all is working fine. Write primary’s IP on your browser using port 9870. (Ex: 192.168.43.12:9870 ).

Step 24: Yarn configuration

To set up yarn you need to start for exporting all paths: (on primary)

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

Now just change yarn’s configuration on both secondarys:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

And then add the following configurations:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>h-primary</value>
</property>

Step 25: Start Yarn

To start the Yarn service use:

start-yarn.sh

To have acess to Yarn’s management tool use your browser to acess to primary IP on port 8088:

Final notes

We just did our first steps on big data management using inclusive terminology, because modern technologies demand modern mindsets.

Resources where I based my work:

Apache Hadoop YARN

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring…

hadoop.apache.org

What is HDFS? Apache Hadoop Distributed File System

The Apache HDFS is a distributed file system that makes it possible to scale a single Apache Hadoop cluster to hundreds…

www.ibm.com

Apache Hadoop - Instalação e configuração de um cluster no Ubuntu

O Apache Hadoop é uma framework desenvolvida em Java, para computação distribuída, usada para processamento de grandes…

pplware.sapo.pt