Block storage

When we copy a file to HDFS, the file is divided in to blocks and the blocks are stored in individual nodes. HDFS has a global view of the file eventhough the file is spread across the nodes in the cluster, whereas the local filesystem has local view of the blocks.

The fsck command gives us more information about block storage in the different nodes. We need root privileges to run this command.

First, let's download a large dataset and upload it into HDFS. Once connected to the main node of your cluster, type in the following commands into the Terminal:

wget https://s3-us-west-1.amazonaws.com/julienheck/hadoop/datasets/crime_data_la/Crime_Data_from_2010_to_Present.csv

hadoop fs -copyFromLocal Crime_Data_from_2010_to_Present.csv myNewDir

Now that our dataset has been uploaded into HDFS, we can run the following command in the Terminal:

sudo -u hdfs hdfs fsck /user/ubuntu/myNewDir/Crime_Data_from_2010_to_Present.csv -files -blocks -locations

We can see interesting information in here.

  • BP-521960221-172.31.35.231-1517967318079: This is Block Pool ID. Block pool is a set of blocks that belong to single name space. For simplicity, you can say that all the blocks managed by a Name Node are under the same Block Pool

  • The size of the file being 381 MB, it is divided into 3 blocks:

  • blk_1073743228_2413

  • blk_1073743229_2414

  • blk_1073743230_2415

  • len: Length of the block: Number of bytes in the block

  • Live_repl=2 means there are 2 replicas for this block.

  • [DatanodeInfoWithStorage[172.31.14.147:50010,DS-6ac398a5-8621-4f60-bc93-5e61735a5505,DISK]: This includes IP address of the Data Node holding this block.

results matching ""

    No results matching ""