Block storage
When we copy a file to HDFS, the file is divided in to blocks and the blocks are stored in individual nodes. HDFS has a global view of the file eventhough the file is spread across the nodes in the cluster, whereas the local filesystem has local view of the blocks.
The fsck command gives us more information about block storage in the different nodes. We need root privileges to run this command.
First, let's download a large dataset and upload it into HDFS. Once connected to the main node of your cluster, type in the following commands into the Terminal:
wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/datasets/crime_data_la/Crime_Data_from_2010_to_Present.csv
hadoop fs -copyFromLocal Crime_Data_from_2010_to_Present.csv myNewDir
Now that our dataset has been uploaded into HDFS, we can run the following command in the Terminal:
sudo -u hdfs hdfs fsck /user/ubuntu/myNewDir/Crime_Data_from_2010_to_Present.csv -files -blocks -locations
We can see interesting information in here.
BP-521960221-172.31.35.231-1517967318079: This is Block Pool ID. Block pool is a set of blocks that belong to single name space. For simplicity, you can say that all the blocks managed by a Name Node are under the same Block Pool
The size of the file being 381 MB, it is divided into 3 blocks:
blk_1073743228_2413
blk_1073743229_2414
blk_1073743230_2415
len: Length of the block: Number of bytes in the block
Live_repl=2 means there are 2 replicas for this block.
[DatanodeInfoWithStorage[172.31.14.147:50010,DS-6ac398a5-8621-4f60-bc93-5e61735a5505,DISK]: This includes IP address of the Data Node holding this block.