Hadoop : A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
Based on Java
Save data in Google File System (2003) and process this big data with Map Reduce (2004)
Modules :
Hadoop Distributed File System : A distributed file system that provides high-throughput access to application data
Hadoop YARN : A framework for job scheduling and cluster resource management
Hadoop MapReduce : A YARN-based system for parallel processing of large data sets.
1. Hadoop Distributed File System (HDFS)
Components :
Master (Name-node)
(1) Managing Meta data
(2) Monitoring data-node : data-node sends heartbeat signal for status and block-information every 3secs.
(3) process client request
Slave (Data-node)
Block File System :
Save data into splited block (128MB/block - in Hadoop 2.0)
Metadata (file name, trees, … ) of these blocks are saved in Name node
$ hdfs dfs -ls
$ hdfs dfs -put [LOCAL_FILE_NAME] [DEST]
$ hdfs dfs -get [FILE_NAME or FOLDER_PATH] [DEST_LOCAL]
$ hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]
# Create 0byte file
$ hdfs dfs -touchz ${OUT_PATH}