Following steps are tested only on Ubuntu 12.04 LTS server. << Download these labs >> sudo apt-get install git git clone https://github.com/jazzwang/hadoop_labs << Local Mode >> lab000/hadoop-local-mode jps netstat -nap | grep java export PATH=${HOME}/hadoop/bin:$PATH hadoop fs -ls hadoop fs -mkdir tmp hadoop fs -ls hadoop fs -put ${HOME}/hadoop/conf input Exercise: (1) How many java process in local mode ? (2) What do you see after running "hadoop fs -ls" in local mode ? (3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ? (4) What happend to your working directory affter running "hadoop fs -put ..." ? << Pseudo-Distributed Mode >> lab001/hadoop-pseudo-mode jps netstat -nap | grep java export PATH=${HOME}/hadoop/bin:$PATH hadoop fs -ls hadoop fs -mkdir tmp hadoop fs -ls hadoop fs -put ${HOME}/hadoop/conf input Exercise: (1) How many java process in pseudo-distributed mode ? (2) What do you see after running "hadoop fs -ls" in pseudo-distributed mode ? (3) What happend to your working directory affter running "hadoop fs -mkdir tmp" ? (4) What happend to your working directory affter running "hadoop fs -put ..." ? << Full Distributed Mode >> lab002/hadoop-full-mode jps netstat -nap | grep java export PATH=${HOME}/hadoop/bin:$PATH hadoop fs -ls Exercise: (1) How many java process in full distributed mode ? (2) What do you see after running "hadoop fs -ls" in full distributed mode ? (3) In netstat results, what's the difference between full distributed mode and pseudo-distributed mode ? << DEBUG: Shell Script >> file `which hadoop` bash -x `which hadoop` fs -ls bash -x `which hadoop` jar bash -x `which hadoop` fsck Exercise: (1) Which java class will "hadoop fs" command call? (2) Which java class will "hadoop jar" command call? (3) Which java class will "hadoop fsck" command call? << DEBUG: Log4J >> export HADOOP_ROOT_LOGGER=INFO,console hadoop fs -ls export HADOOP_ROOT_LOGGER=WARN,console hadoop fs -ls export HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls unset HADOOP_ROOT_LOGGER Exercise: (1) In the result of "hadoop fs -ls", is there any difference between INFO and WARN ? (2) In the result of "hadoop fs -ls", is there any difference between INFO and DEBUG ? << DEBUG: Changing Modes >> export HADOOP_CONF_DIR=~/hadoop/conf.pseudo/ hadoop fs -ls export HADOOP_CONF_DIR=~/hadoop/conf.local/ hadoop fs -ls unset HADOOP_CONF_DIR hadoop fs -ls Exercise: (1) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to pseudo-distributed mode configuration directory ? Why are there some errors ? (2) If you're currently running full distributed mode, what is the result of "hadoop fs -ls" after changing HADOOP_CONF_DIR to local mode configuration directory ? << DEBUG/Monitoring: jconsole >> jconsole Exercise: (1) If you're currently running in full distributed mode, please try to connect to namenode java process. (2) If you're currently running in full distributed mode, please try to connect to datanode java process. << HDFS: FsShell >> lab003/FsShell Excercise: (1) What is the result of Path.CUR_DIR ? (2) In local mode, which class is srcFs object ? (3) In full distributed mode, which class is srcFs object ? (4) Which classes are updated in hadoop-core-*.jar according to the difference between two jar files? (5) Please observe the source code architecture in ${HOME}/hadoop/src/core/org/apache/hadoop/fs. Which File Systems are supported by Hadoop 1.0.4? (A) HDFS (hdfs://namenode:port) (B) Amazon S3 (s3:// , s3n://) (C) KFS (D) Local File System (file:///) (F) FTP (ftp://user:passwd@ftp-server:port) (G) RAMFS (ramfs://) (H) HAR (Hadoop Archive Filesystem, har://underlyingfsscheme-host:port/archivepath or har:///archivepath ) Reference: (1) http://answers.oreilly.com/topic/456-get-to-know-hadoop-filesystems/ (2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/package-tree.html << Develop: Eclipse >> wget https://raw.github.com/mkamithkumar/hadoop-eclipse-plugins/master/hadoop-eclipse-plugin-1.0.4.jar eclipse Reference: (1) https://github.com/mkamithkumar/hadoop-eclipse-plugins << HDFS: FileSystem.copyFromLocal >> cd ~/hadoop_labs/lab004 ant hadoop fs -ls hadoop jar copyFromLocal.jar doc input hadoop fs -ls touch test hadoop jar copyFromLocal.jar test file hadoop fs -ls export HADOOP_CONF_DIR=~/hadoop/conf.local/ hadoop fs -ls hadoop jar copyFromLocal.jar doc input hadoop fs -ls hadoop jar copyFromLocal.jar test file hadoop fs -ls unset HADOOP_CONF_DIR ant clean Excercise: (1) In full distributed mode, what do you see after running "hadoop fs -ls"? (2) Change to local mode, what do you see after running "hadoop fs -ls"? << HDFS: FileSystem.copyToLocal >> cd ~/hadoop_labs/lab005 ant hadoop fs -ls ls hadoop jar copyToLocal.jar input input ls hadoop jar copyToLocal.jar file file ls ant clean Excercise: (1) In full distributed mode, what do you see after running "ls"? (2) Try to switch to local mode and see what's the difference between local mode and full distributed mode. << HDFS: FileSystem.exists >> << HDFS: FileSystem.isFile >> << HDFS: FileSystem.isDirectory >> cd ~/hadoop_labs/lab006 ant hadoop jar isFile.jar input hadoop jar isFile.jar file hadoop jar isFile.jar empty Excercise: (1) What are the difference in the results of this lab? Reference: (1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/fs/FileSystem.html << MapReduce: WordCount (after 0.19) >> cd ~/hadoop_labs/lab007 ant hadoop fs -rmr input output hadoop fs -put ~/hadoop/conf input hadoop jar WordCount.jar input output (open another console) jps watch -n 1 jps hadoop job -list all (open a browser) http://localhost:50030 export HADOOP_CONF_DIR=~/hadoop/conf.local/ hadoop fs -put ~/hadoop/conf local-input hadoop jar WordCount local-input local-output unset HADOOP_CONF_DIR Excerise: (1) In results of "jps", which Java process stands for the main() function defined by WordCount.java ? (2) Do you see "Child" in "jps" results? How many java process named by "Child" do you see ? (3) Change to local mode. Do you see "mapred.LocalJobRunner" after running "hadoop jar WordCount ...." ? << MapReduce: WordCount (before 0.19) >> cd ~/hadoop_labs/lab008 ant hadoop fs -rmr input output hadoop fs -put ~/hadoop/conf input hadoop jar WordCount.jar input output Excerise: (1) Try to compare two version of WordCount example and draw a UML class diagram for WordCount example. << MapReduce: Inner Class v.s. Public Classes >> cd ~/hadoop_labs/lab009 ant hadoop fs -rmr input output hadoop fs -put ~/hadoop/conf input hadoop jar WordCount.jar input output hadoop fs -ls output hadoop fs -cat output/part-r-00000 Excerise: In this example, we would like to show you the difference between inner class and public classes. (1) How many java source files are there in lab009/src folder ? (2) How many class files are there after running ant ? Please compare the class name between lab007 and lab007, then name the difference between these two examples. (3) How many mapper tasks are there in this job? How many reducer tasks are there in this job ? How many task attempts are there in each task ? (4) What is the value of "mapred.reduce.tasks" property shown in ${HOME}/hadoop/src/mapred/mapred-default.xml ? << MapReduce: Job.setNumReduceTasks() >> cd ~/hadoop_labs/lab010 ant hadoop fs -rmr input output hadoop fs -put ~/hadoop/conf input hadoop jar WordCount.jar input output hadoop fs -ls output hadoop fs -cat output/part-* Excerise: (1) How many reducer task are there in this job ? (2) How many output results in this job ? (3) Please observe the order of output. Assume the mapper output key are {A,B,C,D}, and we know A < B < C < D. Let's set the number of reducers to 2. What will be the output? (A) {A, B}, {C, D} (B) {A, B, C}, {D} (C) {A, C}, {B, D} (D) {A, D}, {B, C} Reference: (1) org.apache.hadoop.mapreduce.Job.setNumReduceTasks(int) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int) << Debug: setNumReduceTasks(0) >> cd ~/hadoop_labs/lab010 sed -i 's#setNumReduceTasks(2)#setNumReduceTasks(0)#g' ~/hadoop_labs/lab010/src/WordCount.java ant hadoop fs -rmr output hadoop jar WordCount.jar input output hadoop fs -ls output hadoop fs -cat output/part* Excerise: Try to modify the java source in lab010 and set the number of reducer to 0. Run "ant" to compile. Observe the result in output. (1) How many mapper task are there in this job? (2) How many output files are there in the output? << MapReduce: TextInputFormat >> cd ~/hadoop_labs/lab011 ant cd ~/hadoop_labs/lab010 mkdir -p my_input echo "A B C D" > my_input/input1 echo "C D A B" > my_input/input2 hadoop fs -put my_input my_input sed -i 's#setNumReduceTasks(0)#setNumReduceTasks(1)#g' ~/hadoop_labs/lab010/src/WordCount.java hadoop jar WordCount.jar my_input my_output export HADOOP_CONF_DIR=~/hadoop/conf.local/ hadoop jar WordCount.jar my_input my_output unset HADOOP_CONF_DIR Excercise: (1) In full distributed mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job? (2) In full distributed mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job? (3) In local mode, how many lines of "TextInputFormat.isSplitable" message do you see while running the job? (4) In local mode, how many lines of "TextInputFormat.createRecordReader" message do you see while running the job? Reference: (1) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/InputFormat.html (2) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html (3) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html << MapReduce: KeyValueTextInputFormat >> cd ~/hadoop_labs/lab012 ant mkdir -p kv_input printf "A\t1\n" > kv_input/input1 printf "B\t2\n" >> kv_input/input1 printf "C\t3\n" >> kv_input/input1 printf "A\t1\n" > kv_input/input2 printf "C\t2\n" >> kv_input/input2 printf "B\t1\n" >> kv_input/input2 hadoop fs -put kv_input kv_input hadoop jar WordCount.jar kv_input kv_output hadoop fs -ls kv_output hadoop fs -cat kv_output/part-* export HADOOP_CONF_DIR=~/hadoop/conf.local/ hadoop jar WordCount.jar kv_input kv_output ls -al kv_output cat kv_output/part-* unset HADOOP_CONF_DIR Excercise: (1) Reference: (1) org.apache.hadoop.mapreduce.Job.setInputFormatClass(Class) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setInputFormatClass%28java.lang.Class%29 (2) org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/lib/input/KeyValueTextInputFormat.html << MapReduce: Configuration >> cd ~/hadoop_labs/lab013 ant hadoop fs -put l13_input l13_input hadoop jar WordCount.jar l13_input l13_output hadoop fs -cat l13_output/part-r-00000 hadoop jar WordCount.jar -Dwordcount.case.sensitive=false l13_input l13_output2 hadoop fs -cat l13_output2/part-r-00000 Reference: (1) org.apache.hadoop.conf.Configuration.setBoolean(String name, boolean value) (2) org.apache.hadoop.conf.Configuration.getBoolean(String name, boolean defaultValue) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/conf/Configuration.html (3) org.apache.hadoop.mapreduce.Mapper.setup(Mapper.Context context) http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Mapper.html#setup%28org.apache.hadoop.mapreduce.Mapper.Context%29 (4) org.apache.hadoop.mapreduce.JobContext.getConfiguration() http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/JobContext.html#getConfiguration%28%29 (5) http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_7_1.html << MapReduce: Distribtued Cache >> cd ~/hadoop_labs/lab014 ant Reference: (1) org.apache.hadoop.filecache.DistributedCache.addCacheFile() http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html << Query: DBInputFormat >> << Pig: Apache Log Analysis >> Reference: (1) << Pig: DBStorage >> Reference: (1) << Pig: HBaseStorage >> Reference: (1) << Other Idea >> 1. lab for concurrent read, concurrent write - local mode V.S. full distribtued mode - local disk 1 to local disk 2 - ram disk to local disk - multiple ram disk to full distributed mode HDFS 2. generate files to show the limitation due to HDFS namenode HEAPSIZE 3. Use Pig XMLLoader to load XML, and HBaseStorage to store into HBase. Or DBStorage to store results into MySQL/SQLite 4. mapper or reducer task running more than 10 minutes. or change