Demo HDFS
Overview of HDFS commands.
Local filesystem vs HDFS
HDFS and Linux commands have a lot in common. If you are familiar with Linux commands, HDFS commands will be easy to grasp. We will see some of the well known commands to work with your local file system in linux and HDFS, such as mkdir to create a directory, cp to copy, ls to list the contents of a directory, etc.
If not already done, we first need to connect to the main node of our cluster.
All HDFS commands start with hadoop fs. Regular ls command on root directory will bring the files from root directory in the local file sytem. hadoop fs -ls / list the files from the root directory in HDFS.
In the terminal, type in both commands and see what happens:
ls /
hadoop fs -ls /
While the normal ls command lists all directory and files from your local filesystem, HDFS gives an overview of the directory and files stored in the Hadoop cluster across all nodes.
Now let's download two files to the home directory in the local filesystem using the following commands:
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/datasets/ml-100k/u.data
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/datasets/ml-100k/u.item
Note that the home directory in your local file sytem will be /home/<username> whereas the home directory in HDFS will be /user/<username>.
Since we are logged in as ubuntu user, the files will be saved under /home/ubuntu in the local file system.
Type ls to check the files have been downloaded successfully.
Let's connect to another node of our cluster while keeping our initial connection open. Type in the ls commands again:
ls
hadoop fs -ls /
The /home/ubuntu directory is empty, there is no trace of the files we just downloaded on the local file system (of the first node). On the other hand, the output of the HDFS command is the same on both nodes.
No matter what nodes you are connected to in your cluster, the view of your HDFS filesystem will be exactly the same. HDFS gives a global overview of your cluster.
Let's go back to our first node and create a new directory in HDFS using the following command:
hadoop fs -mkdir myNewDir
Note that this new folder will be created in the home directory of the ubuntu user in HDFS (not in the local filesystem): /user/ubuntu
We can check that the new directory has been created successfully with:
hadoop fs -ls
If we go back to the second node, we can see the newly created directory as well:
If we try to look for this newly created directory in our local filesystem (on either node), we won't be able to find it since this directory has been created in HDFS.
Copying files to and from HDFS
To copy files from the local filesystem to HDFS, we can use copyFromLocal command and to copy files from HDFS to the local filesystem, we can use copyToLocal command.
hadoop fs -copyFromLocal <source_location_in_local filesystem><destination_location_in_HDFS>
hadoop fs -copyToLocal <source_location_in_HDFS><destination_location_in_local filesystem>
Let's connect to the node where we previously downloaded the u.data and u.item files in /home/ubuntu and copy the u.data file from the local file system to the new directory myNewDir in HDFS. In the terminal, type the following command (using relative path):
hadoop fs -copyFromLocal u.data myNewDir
or type this command (using absolute path):
hadoop fs -copyFromLocal /home/ubuntu/u.data /user/ubuntu/myNewDir
hadoop fs -ls myNewDir
The hadoop fs -copyToLocal command works in a similar way:
hadoop fs -copyToLocal myNewDir/u.data u.data.copy
ls
We can also use HDFS commands such as hadoop fs -cp or hadoop fs -mv to copy or move files within HDFS.
To copy a file:
hadoop fs -cp <source_location_in_HDFS><destination_location_in_HDFS>
To move a file:
hadoop fs -mv <source_location_in_HDFS><destination_location_in_HDFS>
For example, let's create 2 new directories in HDFS:
hadoop fs -mkdir myNewDir2
hadoop fs -mkdir myNewDir3
Copy the file u.data in myNewDir to myNewDir2 using:
hadoop fs -cp myNewDir/u.data myNewDir2
hadoop fs -ls myNewDir2
Move the file u.data in myNewDir to myNewDir3 using:
hadoop fs -mv myNewDir/u.data myNewDir3
hadoop fs -ls myNewDir
hadoop fs -ls myNewDir3
Replication factor in HDFS
We have seen that HDFS supports block replication to provide fault tolerance and availability. By default HDFS replicates a block to 3 nodes.
Open the terminal, connect to the main node and list the content of the myNewDir2 directory using:
hadoop fs -ls myNewDir2
The number 3 after the right permissions is actually the replication factor for the file u.data. It means that each block of this file will be replicated 3 times in the cluster.
We can change the replication factor using the dfs.replication property. In the terminal, type in the following command:
hadoop fs -Ddfs.replication=2 -cp myNewDir2/u.data myNewDir/u.data.rep2
hadoop fs -ls myNewDir
Changing the dfs.replication property only applies to new files (we created a copy of u.data in the previous command). To change the replication factor of existing files we can use the following command:
hdfs dfs -setrep -w 2 myNewDir2/u.data
Block storage
When we copy a file to HDFS, the file is divided in to blocks and the blocks are stored in individual nodes. HDFS has a global view of the file eventhough the file is spread across the nodes in the cluster, whereas the local filesystem has local view of the blocks.
The fsck command gives us more information about block storage in the different nodes. We need root privileges to run this command.
First, let's download a large dataset and upload it into HDFS. Once connected to the main node of your cluster, type in the following commands into the Terminal:
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/datasets/crime_data_la/Crime_Data_from_2010_to_Present.csv
hadoop fs -copyFromLocal Crime_Data_from_2010_to_Present.csv myNewDir
Now that our dataset has been uploaded into HDFS, we can run the following command in the Terminal:
sudo -u hdfs hdfs fsck /user/ubuntu/myNewDir/Crime_Data_from_2010_to_Present.csv -files -blocks -locations
We can see interesting information in here.
BP-521960221-172.31.35.231-1517967318079: This is Block Pool ID. Block pool is a set of blocks that belong to single name space. For simplicity, you can say that all the blocks managed by a Name Node are under the same Block Pool
The size of the file being 381 MB, it is divided into 3 blocks:
blk_1073743228_2413
blk_1073743229_2414
blk_1073743230_2415
len: Length of the block: Number of bytes in the block
Live_repl=2means there are 2 replicas for this block.
[DatanodeInfoWithStorage[172.31.14.147:50010,DS-6ac398a5-8621-4f60-bc93-5e61735a5505,DISK]: This includes IP address of the Data Node holding this block.
Delete files and directories
To clean up our /user/ubuntu directory in HDFS, we can use the following commands to recursively delete files and directories:
hadoop fs -rm myNewDir2/u.data
hadoop fs -rm -r myNewDir
hadoop fs -rm -r myNewDir2
hadoop fs -rm -r myNewDir3