Second Screen Re-targeting: Hadoop 1.0 Quick Start Update

Purpose
There seems to be quite a few changes in Hadoop 1.0, which was not reflected in Hadoop's official setup guide. This document attempts to supplement Hadoop's setup guide, with updates for version 1.0.

Please follow the official Hadoop setup guide first and then check the specific sections for 1.0 update.

Standalone Operations
In Hadoop 1.0, all configuration files have been moved to etc/hadoop directory.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Execution
Start the hadoop daemons:
$ sbin/start-all.sh

Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put etc/hadoop/ input

Run some of the examples provided:
$ bin/hadoop jar share/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

Should see something like this:

12/02/23 16:48:04 INFO mapred.FileInputFormat: Total input paths to process : 16
12/02/23 16:48:05 INFO mapred.JobClient: Running job: job_201202231031_0001
12/02/23 16:48:06 INFO mapred.JobClient: map 0% reduce 0%
12/02/23 16:48:19 INFO mapred.JobClient: map 12% reduce 0%
12/02/23 16:48:28 INFO mapred.JobClient: map 25% reduce 0%
12/02/23 16:48:34 INFO mapred.JobClient: map 25% reduce 4%
12/02/23 16:48:37 INFO mapred.JobClient: map 37% reduce 4%
12/02/23 16:48:40 INFO mapred.JobClient: map 37% reduce 8%
12/02/23 16:48:43 INFO mapred.JobClient: map 50% reduce 8%
12/02/23 16:48:50 INFO mapred.JobClient: map 56% reduce 12%
12/02/23 16:48:53 INFO mapred.JobClient: map 62% reduce 12%
12/02/23 16:48:56 INFO mapred.JobClient: map 68% reduce 18%
12/02/23 16:48:58 INFO mapred.JobClient: map 75% reduce 22%
12/02/23 16:49:01 INFO mapred.JobClient: map 81% reduce 22%
12/02/23 16:49:04 INFO mapred.JobClient: map 87% reduce 22%
12/02/23 16:49:07 INFO mapred.JobClient: map 93% reduce 27%
12/02/23 16:49:10 INFO mapred.JobClient: map 100% reduce 27%
12/02/23 16:49:13 INFO mapred.JobClient: map 100% reduce 29%
12/02/23 16:49:20 INFO mapred.JobClient: map 100% reduce 100%
12/02/23 16:49:25 INFO mapred.JobClient: Job complete: job_201202231031_0001
12/02/23 16:49:25 INFO mapred.JobClient: Counters: 30
12/02/23 16:49:25 INFO mapred.JobClient: Job Counters
12/02/23 16:49:25 INFO mapred.JobClient: Launched reduce tasks=1
12/02/23 16:49:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=99019
12/02/23 16:49:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/02/23 16:49:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/02/23 16:49:25 INFO mapred.JobClient: Launched map tasks=16
12/02/23 16:49:25 INFO mapred.JobClient: Data-local map tasks=16
12/02/23 16:49:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=60471
12/02/23 16:49:25 INFO mapred.JobClient: File Input Format Counters
12/02/23 16:49:25 INFO mapred.JobClient: Bytes Read=26852
12/02/23 16:49:25 INFO mapred.JobClient: File Output Format Counters
12/02/23 16:49:25 INFO mapred.JobClient: Bytes Written=180
12/02/23 16:49:25 INFO mapred.JobClient: FileSystemCounters
12/02/23 16:49:25 INFO mapred.JobClient: FILE_BYTES_READ=82
12/02/23 16:49:25 INFO mapred.JobClient: HDFS_BYTES_READ=28574
12/02/23 16:49:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=367327
12/02/23 16:49:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=180
12/02/23 16:49:25 INFO mapred.JobClient: Map-Reduce Framework
12/02/23 16:49:25 INFO mapred.JobClient: Map output materialized bytes=172
12/02/23 16:49:25 INFO mapred.JobClient: Map input records=758
12/02/23 16:49:25 INFO mapred.JobClient: Reduce shuffle bytes=166
12/02/23 16:49:25 INFO mapred.JobClient: Spilled Records=6
12/02/23 16:49:25 INFO mapred.JobClient: Map output bytes=70
12/02/23 16:49:25 INFO mapred.JobClient: Total committed heap usage (bytes)=2596864000
12/02/23 16:49:25 INFO mapred.JobClient: CPU time spent (ms)=12500
12/02/23 16:49:25 INFO mapred.JobClient: Map input bytes=26852
12/02/23 16:49:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=1722
12/02/23 16:49:25 INFO mapred.JobClient: Combine input records=3
12/02/23 16:49:25 INFO mapred.JobClient: Reduce input records=3
12/02/23 16:49:25 INFO mapred.JobClient: Reduce input groups=3
12/02/23 16:49:25 INFO mapred.JobClient: Combine output records=3
12/02/23 16:49:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=2788790272
12/02/23 16:49:25 INFO mapred.JobClient: Reduce output records=3
12/02/23 16:49:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=9619705856
12/02/23 16:49:25 INFO mapred.JobClient: Map output records=3
12/02/23 16:49:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/02/23 16:49:25 INFO mapred.FileInputFormat: Total input paths to process : 1
12/02/23 16:49:25 INFO mapred.JobClient: Running job: job_201202231031_0002
12/02/23 16:49:26 INFO mapred.JobClient: map 0% reduce 0%
12/02/23 16:49:41 INFO mapred.JobClient: map 100% reduce 0%
12/02/23 16:49:53 INFO mapred.JobClient: map 100% reduce 100%
12/02/23 16:49:58 INFO mapred.JobClient: Job complete: job_201202231031_0002
12/02/23 16:49:58 INFO mapred.JobClient: Counters: 30
12/02/23 16:49:58 INFO mapred.JobClient: Job Counters
12/02/23 16:49:58 INFO mapred.JobClient: Launched reduce tasks=1
12/02/23 16:49:58 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13874
12/02/23 16:49:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/02/23 16:49:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/02/23 16:49:58 INFO mapred.JobClient: Launched map tasks=1
12/02/23 16:49:58 INFO mapred.JobClient: Data-local map tasks=1
12/02/23 16:49:58 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10575
12/02/23 16:49:58 INFO mapred.JobClient: File Input Format Counters
12/02/23 16:49:58 INFO mapred.JobClient: Bytes Read=180
12/02/23 16:49:58 INFO mapred.JobClient: File Output Format Counters
12/02/23 16:49:58 INFO mapred.JobClient: Bytes Written=52
12/02/23 16:49:58 INFO mapred.JobClient: FileSystemCounters
12/02/23 16:49:58 INFO mapred.JobClient: FILE_BYTES_READ=82
12/02/23 16:49:58 INFO mapred.JobClient: HDFS_BYTES_READ=296
12/02/23 16:49:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42387
12/02/23 16:49:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=52
12/02/23 16:49:58 INFO mapred.JobClient: Map-Reduce Framework
12/02/23 16:49:58 INFO mapred.JobClient: Map output materialized bytes=82
12/02/23 16:49:58 INFO mapred.JobClient: Map input records=3
12/02/23 16:49:58 INFO mapred.JobClient: Reduce shuffle bytes=82
12/02/23 16:49:58 INFO mapred.JobClient: Spilled Records=6
12/02/23 16:49:58 INFO mapred.JobClient: Map output bytes=70
12/02/23 16:49:58 INFO mapred.JobClient: Total committed heap usage (bytes)=220725248
12/02/23 16:49:58 INFO mapred.JobClient: CPU time spent (ms)=1990
12/02/23 16:49:58 INFO mapred.JobClient: Map input bytes=94
12/02/23 16:49:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=116
12/02/23 16:49:58 INFO mapred.JobClient: Combine input records=0
12/02/23 16:49:58 INFO mapred.JobClient: Reduce input records=3
12/02/23 16:49:58 INFO mapred.JobClient: Reduce input groups=1
12/02/23 16:49:58 INFO mapred.JobClient: Combine output records=0
12/02/23 16:49:58 INFO mapred.JobClient: Physical memory (bytes) snapshot=235311104
12/02/23 16:49:58 INFO mapred.JobClient: Reduce output records=3
12/02/23 16:49:58 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1171832832
12/02/23 16:49:58 INFO mapred.JobClient: Map output records=3

Examine output.

Copy the output files from the distributed filesystem to the local filesytem and examine them:

$ bin/hadoop fs -get output output

$ cat output/*

View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*

When you're done, stop the daemons with:
$ sbin/stop-all.sh

Second Screen Re-targeting

Friday, February 24, 2012

Hadoop 1.0 Quick Start Update

19 comments: