Introduction
-
Hadoop : A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
-
Based on Java
-
Save data in Google File System (2003) and process this big data with Map Reduce (2004)
-
-
Modules :
-
Hadoop Distributed File System : A distributed file system that provides high-throughput access to application data
-
Hadoop YARN : A framework for job scheduling and cluster resource management
-
Hadoop MapReduce : A YARN-based system for parallel processing of large data sets.
-
1. Hadoop Distributed File System (HDFS)
-
Components :
-
Master (Name-node)
-
(1) Managing Meta data
-
(2) Monitoring data-node : data-node sends heartbeat signal for status and block-information every 3secs.
-
(3) process client request
-
-
Slave (Data-node)
-
-
Characteristics:
-
Block File System :
-
Save data into splited block (128MB/block - in Hadoop 2.0)
-
Metadata (file name, trees, … ) of these blocks are saved in Name node
-
-
$ hdfs dfs -ls
$ hdfs dfs -put [LOCAL_FILE_NAME] [DEST]
$ hdfs dfs -get [FILE_NAME or FOLDER_PATH] [DEST_LOCAL]
$ hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]
# Create 0byte file
$ hdfs dfs -touchz ${OUT_PATH}