Introduction
- 
Hadoop : A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
- 
Based on Java
 - 
Save data in Google File System (2003) and process this big data with Map Reduce (2004)
 
 - 
 - 
Modules :
- 
Hadoop Distributed File System : A distributed file system that provides high-throughput access to application data
 - 
Hadoop YARN : A framework for job scheduling and cluster resource management
 - 
Hadoop MapReduce : A YARN-based system for parallel processing of large data sets.
 
 - 
 
1. Hadoop Distributed File System (HDFS)
- 
Components :
- 
Master (Name-node)
- 
(1) Managing Meta data
 - 
(2) Monitoring data-node : data-node sends heartbeat signal for status and block-information every 3secs.
 - 
(3) process client request
 
 - 
 - 
Slave (Data-node)
 
 - 
 - 
Characteristics:
- 
Block File System :
- 
Save data into splited block (128MB/block - in Hadoop 2.0)
 - 
Metadata (file name, trees, … ) of these blocks are saved in Name node
 
 - 
 
 - 
 
$ hdfs dfs -ls
$ hdfs dfs -put [LOCAL_FILE_NAME] [DEST]
$ hdfs dfs -get [FILE_NAME or FOLDER_PATH] [DEST_LOCAL]
$ hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]