CODEMONDAY

Installing Hadoop on Ubuntu 20.04

move up

move up

Installing Hadoop on Ubuntu 20.04

Jul 11, 2023

Programming

Below I wrap up how to installing process. This is good for experimental NOT production at all.

Credit: wikipedia

What you need to do

Install Java
Download Hadoop
Set environment
Edit Hadoop XML
start-dfs.sh
start-yarn.sh

If success you will see

localhost:8088 → See Hadoop icon screen
localhost:9870 → See cluster status screen

Install Java

Update and search for the new JDK.

If you are not familiar with Java, ignore its term we only need JDK.

sudo apt update
sudo apt-cache search openjdk

Latest LTS is 11 so I will install 11

sudo apt install openjdk-11-jdkjava -version
javac -version

Download Hadoop

Visit link below. In the command line you will need wget <link> to download it. Extract it to your home directory.

Choose the newer version. Here is 3.3.1 then choose the tar.gz

Setup Environment

Setting the variable for Hadoop and also path for convenient calling of Hadoop command in .bashrc

export HADOOP_HOME=/home/hadoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

At this point you must be able to call the following binary from anywhere

hadoop
hdfs

Edit Hadoop XML and Start

I think Hadoop page is good already. Link below.

Some quick overview will make reading easier

Hadoop will ssh to localhost so you will need to setup SSH key
You need pseudo distribution mode
Copy paste XML from Hadoop guide
Start DFS
Start YARN

Apache Hadoop 3.3.6 – Hadoop: Setting up a Single Node Cluster.

Common error: JAVE_HOME not found

JAVA_HOME need to be set in

etc/hadoop/hadoop-env.sh

NOT in .bashrc !

Common error 2: Cannot start YARN

Error when start-yarn.sh

resourcemanager is running as process 48888. Stop it first and ensure /tmp/hadoop-hadoop-resourcemanager.pid file is empty before retry

Resource manager still running despite stop-dfs.sh so you need to stop ALL

stop-all.sh

Note
Just leave the process like that seem like we do not need to run it with systemctl or service as we usually do.

Check the status page

Finally you must see the result like below

localhost:8088

localhost:9870

Some tips if you deploy it on the server.

Use ssh to forward it down to localhost then open it with your browser.

ssh -L 9870:localhost:9870 -nNT ubuntu@<your-server-ip>ssh -L 8088:localhost:8088 -nNT ubuntu@<your-server-ip>

Hope this helps !

move up

Tag :

Programming Big data DevOps

Recommended for you

Ready-made web VS Web application Which is the best choice for your project?

Ready-made web VS Web application Which is the best choice for your project?

Kotlin language first appeared July 22,2011

12th Kotlin language JAVA is not JAVA anymore

Azure DevOps Pipeline— Blocked by Network Rules of Storage Account

Azure DevOps Pipeline— Blocked by Network Rules of Storage Account

Golang Unit Testing with Gorm and Sqlmock PostgreSQL — Simplest setup

Golang Unit Testing with Gorm and Sqlmock PostgreSQL — Simplest setup

JavaScript Implement Timeout to Promise with Race in 10 lines and Common Misunderstanding

JavaScript Implement Timeout to Promise with Race in 10 lines and Common Misunderstanding

To future me, here is the way to set JAVA_HOME

To future me, here is the way to set JAVA_HOME

4 Instant Tips to Immediately Type Faster

4 Instant Tips to Immediately Type Faster

SSH agent setup for multiple Bitbucket account

SSH agent setup for multiple Bitbucket account

Networking: Proxy Vocabulary

Networking: Proxy Vocabulary

Web development

Jupyter lab dark theme

Jupyter lab dark theme

Machine learning