CheatSheet | Hadoop


Introduction

  • Hadoop : A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

  • Modules :

    • Hadoop Distributed File System : A distributed file system that provides high-throughput access to application data

    • Hadoop YARN : A framework for job scheduling and cluster resource management

    • Hadoop MapReduce : A YARN-based system for parallel processing of large data sets.



1. Hadoop Distributed File System (HDFS)

  • Components :

    • Master (Name-node)

      • (1) Managing Meta data

      • (2) Monitoring data-node : data-node sends heartbeat signal for status and block-information every 3secs.

      • (3) process client request

    • Slave (Data-node)

  • Characteristics:

    • Block File System :

      • Save data into splited block (128MB/block - in Hadoop 2.0)

      • Metadata (file name, trees, … ) of these blocks are saved in Name node

$ hdfs dfs -ls
$ hdfs dfs -put [LOCAL_FILE_NAME] [DEST]
$ hdfs dfs -get [FILE_NAME or FOLDER_PATH] [DEST_LOCAL]

$ hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]

# Create 0byte file
$ hdfs dfs -touchz ${OUT_PATH}