Using Mapreduce to GZIP hundreds of Gigabytes in HDFS

I have a bunch of text files sitting in HDFS that I need to compress. It’s on the order of several hundred files comprising several hundred gigabytes of data. There are several ways to do this.

I could individually copy down each file, compress it, and re-upload it to HDFS. This takes an excessively long amount of time.
I could run a hadoop streaming job that turns on mapred.output.compress and mapred.compress.map.output, sets my mapred.output.compress.codec to org.apache.hadoop.io.compress.GzipCodec, and then just cats the output.
I could create a shell script that gets uploaded via a hadoop streaming job that copies each file down to the local data node where the task is executing, runs gzip, and re-uploads it.

Option 1 was a no-go from the start. It would have taken days to complete the operation and I would have lost the opportunity to learn something new.

Option 2 was attempted, but what I found was that the inputs were split along their block size, causing the resulting file to turn into a multiple gzipped parts. This was a no-go because I needed each file assembled back in one piece. Additionally, I still ended up having to run one mapreduce job per file, which was going to take about a day the way I had written it.

Option 3 was attempted because it could easily guarantee that the files would be compressed quickly, in parallel, and end up coming back out of the mapreduce job as one file per input. That’s exactly what I needed.

To do this, I needed to do several things.

First, create the input file list.

$ hadoop fs -ls /tmp/filedir/*.txt > gzipped.out

Next, I create a simple shell script to invoke as the map task by the hadoop streaming job. The script looks like this:

#!/bin/sh -e
set -xv
while read dummy filename ; do
        echo "Reading $filename"
        hadoop fs -copyToLocal $filename .
        base=`basename $filename`
        gzip $base
        hadoop fs -copyFromLocal ${base}.gz /tmp/jobvis/${base}.gz
done

A short breakdown of what this is doing:

The input to the map task is going to be fed to us by the hadoop streaming job. The input has two columns, the key and the filename. We don’t care about the key, so we’re just going to ignore it. Next, we copy the file from hadoop to the current working directory for the task on the datanode’s local disk location for mapreduce. In this case, it ended up somewhere in /hdfs/mapred/local/ttprivate/taskTracker. Since we’re operating on a filepath, we need to get the basename so we can have gzip operate on it once it’s in the local temporary directory. Once gzip is complete, we upload it back to hadoop.

This particular cluster runs simple authentication, so the jobs actually run as the mapred user. Because of this, the files that are actually getting written down into the local datanode temporary directory will be owned by the mapred user. The directory that we want to upload the data back to needs to have access from the mapred user to create the uploaded data.

Note: it’s probably best to run the shell script with -e so that if any operation fails, the task will fail. Using set -xv will also allow some better output in the mapreduce task logs so you can see what the script is doing during the run.

Next, we create the output directory on HDFS.

$ hadoop fs -mkdir /tmp/jobvis
$ hadoop fs -chmod 777 /tmp/jobvis

Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference.

$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar 
  -Dmapred.reduce.tasks=0 -mapper gzipit.sh 
  -input ./gzipped.txt 
  -output /user/hcoyote/gzipped.log 
  -verbose 
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat 
  -file gzipit.sh

Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced.

Then it just looks like this:

STREAM: addTaskEnvironment=
STREAM: shippedCanonFiles_=[/home/hcoyote/gzipit.sh]
STREAM: shipped: true /home/hcoyote/gzipit.sh
STREAM: cmd=gzipit.sh
STREAM: cmd=null
STREAM: cmd=null
STREAM: Found runtime classes in: /tmp/hadoop-hcoyote/hadoop-unjar3549927095616185719/
packageJobJar: [gzipit.sh, /tmp/hadoop-hcoyote/hadoop-unjar3549927095616185719/] [] /tmp/streamjob6232490588444082861.jar tmpDir=null
JarBuilder.addNamedStream gzipit.sh
JarBuilder.addNamedStream org/apache/hadoop/streaming/DumpTypedBytes.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/UTF8ByteArrayUtils.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamKeyValUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeReducer.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamBaseRecordReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBuilder.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/IdentifierResolver.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesOutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesInputWriter.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MROutputThread.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/AutoInputFormat.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$TaskId.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeCombiner.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRunner.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$StreamConsumer.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamXmlRecordReader.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/HadoopStreaming.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MRErrorThread.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapper.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamInputFormat.class
JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritable.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class
JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class
JarBuilder.addNamedStream META-INF/MANIFEST.MF
STREAM: ==== JobConf properties:
STREAM: dfs.access.time.precision=3600000
STREAM: dfs.balance.bandwidthPerSec=943718400
STREAM: dfs.block.access.key.update.interval=600
STREAM: dfs.block.access.token.enable=false
STREAM: dfs.block.access.token.lifetime=600
STREAM: dfs.block.size=67108864
STREAM: dfs.blockreport.initialDelay=0
STREAM: dfs.blockreport.intervalMsec=3600000
STREAM: dfs.client.block.write.retries=3
STREAM: dfs.data.dir=${hadoop.tmp.dir}/dfs/data
STREAM: dfs.datanode.address=0.0.0.0:50010
STREAM: dfs.datanode.data.dir.perm=700
STREAM: dfs.datanode.directoryscan.threads=1
STREAM: dfs.datanode.dns.interface=default
STREAM: dfs.datanode.dns.nameserver=default
STREAM: dfs.datanode.du.reserved=53687091200
STREAM: dfs.datanode.failed.volumes.tolerated=0
STREAM: dfs.datanode.handler.count=3
STREAM: dfs.datanode.http.address=0.0.0.0:50075
STREAM: dfs.datanode.https.address=0.0.0.0:50475
STREAM: dfs.datanode.ipc.address=0.0.0.0:50020
STREAM: dfs.datanode.max.xcievers=4096
STREAM: dfs.datanode.plugins=org.apache.hadoop.thriftfs.DatanodePlugin
STREAM: dfs.default.chunk.view.size=32768
STREAM: dfs.df.interval=60000
STREAM: dfs.heartbeat.interval=3
STREAM: dfs.hosts=/etc/hadoop/conf/hosts.include
STREAM: dfs.hosts.exclude=/etc/hadoop/conf/hosts.exclude
STREAM: dfs.http.address=0.0.0.0:50070
STREAM: dfs.https.address=0.0.0.0:50470
STREAM: dfs.https.client.keystore.resource=ssl-client.xml
STREAM: dfs.https.enable=false
STREAM: dfs.https.need.client.auth=false
STREAM: dfs.https.server.keystore.resource=ssl-server.xml
STREAM: dfs.max-repl-streams=16
STREAM: dfs.max.objects=0
STREAM: dfs.name.dir=/hdfs/01/name,/hdfs/02/name,/mnt/remote_namenode_failsafe/name
STREAM: dfs.name.edits.dir=${dfs.name.dir}
STREAM: dfs.namenode.decommission.interval=30
STREAM: dfs.namenode.decommission.nodes.per.interval=5
STREAM: dfs.namenode.delegation.key.update-interval=86400000
STREAM: dfs.namenode.delegation.token.max-lifetime=604800000
STREAM: dfs.namenode.delegation.token.renew-interval=86400000
STREAM: dfs.namenode.handler.count=10
STREAM: dfs.namenode.logging.level=info
STREAM: dfs.namenode.plugins=org.apache.hadoop.thriftfs.NamenodePlugin
STREAM: dfs.permissions=true
STREAM: dfs.permissions.supergroup=supergroup
STREAM: dfs.replication=3
STREAM: dfs.replication.considerLoad=true
STREAM: dfs.replication.interval=3
STREAM: dfs.replication.max=512
STREAM: dfs.replication.min=1
STREAM: dfs.safemode.extension=30000
STREAM: dfs.safemode.min.datanodes=0
STREAM: dfs.safemode.threshold.pct=0.999f
STREAM: dfs.secondary.http.address=0.0.0.0:50090
STREAM: dfs.support.append=true
STREAM: dfs.thrift.address=0.0.0.0:10090
STREAM: dfs.web.ugi=webuser,webgroup
STREAM: fs.automatic.close=true
STREAM: fs.checkpoint.dir=/hdfs/01/checkpoint,/hdfs/02/checkpoint
STREAM: fs.checkpoint.edits.dir=${fs.checkpoint.dir}
STREAM: fs.checkpoint.period=3600
STREAM: fs.checkpoint.size=67108864
STREAM: fs.default.name=hdfs://namenode.example.net:9000/
STREAM: fs.file.impl=org.apache.hadoop.fs.LocalFileSystem
STREAM: fs.ftp.impl=org.apache.hadoop.fs.ftp.FTPFileSystem
STREAM: fs.har.impl=org.apache.hadoop.fs.HarFileSystem
STREAM: fs.har.impl.disable.cache=true
STREAM: fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem
STREAM: fs.hftp.impl=org.apache.hadoop.hdfs.HftpFileSystem
STREAM: fs.hsftp.impl=org.apache.hadoop.hdfs.HsftpFileSystem
STREAM: fs.inmemory.size.mb=192
STREAM: fs.kfs.impl=org.apache.hadoop.fs.kfs.KosmosFileSystem
STREAM: fs.ramfs.impl=org.apache.hadoop.fs.InMemoryFileSystem
STREAM: fs.s3.block.size=67108864
STREAM: fs.s3.buffer.dir=${hadoop.tmp.dir}/s3
STREAM: fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem
STREAM: fs.s3.maxRetries=4
STREAM: fs.s3.sleepTimeSeconds=10
STREAM: fs.s3n.block.size=67108864
STREAM: fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
STREAM: fs.trash.interval=0
STREAM: hadoop.http.authentication.kerberos.keytab=${user.home}/hadoop.keytab
STREAM: hadoop.http.authentication.kerberos.principal=HTTP/localhost@LOCALHOST
STREAM: hadoop.http.authentication.signature.secret.file=${user.home}/hadoop-http-auth-signature-secret
STREAM: hadoop.http.authentication.simple.anonymous.allowed=true
STREAM: hadoop.http.authentication.token.validity=36000
STREAM: hadoop.http.authentication.type=simple
STREAM: hadoop.kerberos.kinit.command=kinit
STREAM: hadoop.logfile.count=10
STREAM: hadoop.logfile.size=10000000
STREAM: hadoop.native.lib=true
STREAM: hadoop.permitted.revisions=
03b655719d13929bd68bb2c2f9cee615b389cea9,
217a3767c48ad11d4632e19a22897677268c40c4

STREAM: hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.StandardSocketFactory
STREAM: hadoop.security.authentication=simple
STREAM: hadoop.security.authorization=false
STREAM: hadoop.security.group.mapping=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
STREAM: hadoop.security.uid.cache.secs=14400
STREAM: hadoop.tmp.dir=/tmp/hadoop-${user.name}
STREAM: hadoop.util.hash.type=murmur
STREAM: hadoop.workaround.non.threadsafe.getpwuid=false
STREAM: io.bytes.per.checksum=512
STREAM: io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec
STREAM: io.file.buffer.size=131702
STREAM: io.map.index.skip=0
STREAM: io.mapfile.bloom.error.rate=0.005
STREAM: io.mapfile.bloom.size=1048576
STREAM: io.seqfile.compress.blocksize=1000000
STREAM: io.seqfile.lazydecompress=true
STREAM: io.seqfile.sorter.recordlimit=1000000
STREAM: io.serializations=org.apache.hadoop.io.serializer.WritableSerialization
STREAM: io.skip.checksum.errors=false
STREAM: io.sort.factor=10
STREAM: io.sort.mb=100
STREAM: io.sort.record.percent=0.05
STREAM: io.sort.spill.percent=0.80
STREAM: ipc.client.connect.max.retries=10
STREAM: ipc.client.connection.maxidletime=10000
STREAM: ipc.client.idlethreshold=4000
STREAM: ipc.client.kill.max=10
STREAM: ipc.client.tcpnodelay=false
STREAM: ipc.server.listen.queue.size=128
STREAM: ipc.server.tcpnodelay=false
STREAM: job.end.retry.attempts=0
STREAM: job.end.retry.interval=30000
STREAM: jobclient.completion.poll.interval=5000
STREAM: jobclient.output.filter=FAILED
STREAM: jobclient.progress.monitor.poll.interval=1000
STREAM: jobtracker.thrift.address=0.0.0.0:9290
STREAM: keep.failed.task.files=false
STREAM: local.cache.size=10737418240
STREAM: map.sort.class=org.apache.hadoop.util.QuickSort
STREAM: mapred.acls.enabled=false
STREAM: mapred.child.java.opts=-Xmx512m -Xms512m
STREAM: mapred.child.tmp=./tmp
STREAM: mapred.cluster.map.memory.mb=-1
STREAM: mapred.cluster.max.map.memory.mb=-1
STREAM: mapred.cluster.max.reduce.memory.mb=-1
STREAM: mapred.cluster.reduce.memory.mb=-1
STREAM: mapred.compress.map.output=false
STREAM: mapred.create.symlink=yes
STREAM: mapred.disk.healthChecker.interval=60000
STREAM: mapred.fairscheduler.preemption=true
STREAM: mapred.healthChecker.interval=60000
STREAM: mapred.healthChecker.script.timeout=600000
STREAM: mapred.heartbeats.in.second=100
STREAM: mapred.hosts=/etc/hadoop/conf/hosts.include
STREAM: mapred.hosts.exclude=/etc/hadoop/conf/hosts.exclude
STREAM: mapred.inmem.merge.threshold=1000
STREAM: mapred.input.dir=hdfs://namenode.example.net:9000/user/hcoyote/gzipped.txt
STREAM: mapred.input.format.class=org.apache.hadoop.mapred.lib.NLineInputFormat
STREAM: mapred.jar=/tmp/streamjob6232490588444082861.jar
STREAM: mapred.job.map.memory.mb=-1
STREAM: mapred.job.queue.name=default
STREAM: mapred.job.reduce.input.buffer.percent=0.0
STREAM: mapred.job.reduce.memory.mb=-1
STREAM: mapred.job.reuse.jvm.num.tasks=1
STREAM: mapred.job.shuffle.input.buffer.percent=0.70
STREAM: mapred.job.shuffle.merge.percent=0.66
STREAM: mapred.job.tracker=namenode.example.net:54311
STREAM: mapred.job.tracker.handler.count=10
STREAM: mapred.job.tracker.http.address=0.0.0.0:50030
STREAM: mapred.job.tracker.jobhistory.lru.cache.size=5
STREAM: mapred.job.tracker.persist.jobstatus.active=false
STREAM: mapred.job.tracker.persist.jobstatus.dir=/jobtracker/jobsInfo
STREAM: mapred.job.tracker.persist.jobstatus.hours=0
STREAM: mapred.job.tracker.retiredjobs.cache.size=1000
STREAM: mapred.jobtracker.completeuserjobs.maximum=20
STREAM: mapred.jobtracker.instrumentation=org.apache.hadoop.mapred.JobTrackerMetricsInst
STREAM: mapred.jobtracker.job.history.block.size=3145728
STREAM: mapred.jobtracker.maxtasks.per.job=-1
STREAM: mapred.jobtracker.plugins=org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin
STREAM: mapred.jobtracker.restart.recover=false
STREAM: mapred.jobtracker.taskScheduler=org.apache.hadoop.mapred.FairScheduler
STREAM: mapred.line.input.format.linespermap=1
STREAM: mapred.local.dir=${hadoop.tmp.dir}/mapred/local
STREAM: mapred.local.dir.minspacekill=0
STREAM: mapred.local.dir.minspacestart=0
STREAM: mapred.map.max.attempts=4
STREAM: mapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
STREAM: mapred.map.runner.class=org.apache.hadoop.streaming.PipeMapRunner
STREAM: mapred.map.tasks=2
STREAM: mapred.map.tasks.speculative.execution=true
STREAM: mapred.mapoutput.key.class=org.apache.hadoop.io.Text
STREAM: mapred.mapoutput.value.class=org.apache.hadoop.io.Text
STREAM: mapred.mapper.class=org.apache.hadoop.streaming.PipeMapper
STREAM: mapred.max.tracker.blacklists=4
STREAM: mapred.max.tracker.failures=4
STREAM: mapred.merge.recordsBeforeProgress=10000
STREAM: mapred.min.split.size=0
STREAM: mapred.output.compress=false
STREAM: mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
STREAM: mapred.output.compression.type=RECORD
STREAM: mapred.output.dir=hdfs://namenode.example.net:9000/user/hcoyote/gzipped.log
STREAM: mapred.output.format.class=org.apache.hadoop.mapred.TextOutputFormat
STREAM: mapred.output.key.class=org.apache.hadoop.io.Text
STREAM: mapred.output.value.class=org.apache.hadoop.io.Text
STREAM: mapred.queue.default.state=RUNNING
STREAM: mapred.queue.names=default
STREAM: mapred.reduce.max.attempts=4
STREAM: mapred.reduce.parallel.copies=33
STREAM: mapred.reduce.slowstart.completed.maps=0.75
STREAM: mapred.reduce.tasks=0
STREAM: mapred.reduce.tasks.speculative.execution=false
STREAM: mapred.skip.attempts.to.start.skipping=2
STREAM: mapred.skip.map.auto.incr.proc.count=true
STREAM: mapred.skip.map.max.skip.records=0
STREAM: mapred.skip.reduce.auto.incr.proc.count=true
STREAM: mapred.skip.reduce.max.skip.groups=0
STREAM: mapred.submit.replication=10
STREAM: mapred.system.dir=/mapred/system
STREAM: mapred.task.cache.levels=2
STREAM: mapred.task.profile=false
STREAM: mapred.task.profile.maps=0-2
STREAM: mapred.task.profile.reduces=0-2
STREAM: mapred.task.timeout=600000
STREAM: mapred.task.tracker.http.address=0.0.0.0:50060
STREAM: mapred.task.tracker.report.address=127.0.0.1:0
STREAM: mapred.task.tracker.task-controller=org.apache.hadoop.mapred.DefaultTaskController
STREAM: mapred.tasktracker.dns.interface=default
STREAM: mapred.tasktracker.dns.nameserver=default
STREAM: mapred.tasktracker.expiry.interval=600000
STREAM: mapred.tasktracker.indexcache.mb=10
STREAM: mapred.tasktracker.instrumentation=org.apache.hadoop.mapred.TaskTrackerMetricsInst
STREAM: mapred.tasktracker.map.tasks.maximum=8
STREAM: mapred.tasktracker.reduce.tasks.maximum=8
STREAM: mapred.tasktracker.taskmemorymanager.monitoring-interval=5000
STREAM: mapred.tasktracker.tasks.sleeptime-before-sigkill=5000
STREAM: mapred.temp.dir=${hadoop.tmp.dir}/mapred/temp
STREAM: mapred.used.genericoptionsparser=true
STREAM: mapred.user.jobconf.limit=5242880
STREAM: mapred.userlog.limit.kb=128
STREAM: mapred.userlog.retain.hours=24
STREAM: mapred.working.dir=hdfs://namenode.example.net:9000/user/hcoyote
STREAM: mapreduce.job.acl-modify-job=
STREAM: mapreduce.job.acl-view-job=
STREAM: mapreduce.job.complete.cancel.delegation.tokens=true
STREAM: mapreduce.job.counters.limit=120
STREAM: mapreduce.job.jar.unpack.pattern=(?:classes/|lib/).*|(?:\Qgzipit.sh\E)
STREAM: mapreduce.jobtracker.split.metainfo.maxsize=10000000
STREAM: mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging
STREAM: mapreduce.reduce.input.limit=-1
STREAM: mapreduce.reduce.shuffle.connect.timeout=180000
STREAM: mapreduce.reduce.shuffle.maxfetchfailures=10
STREAM: mapreduce.reduce.shuffle.read.timeout=180000
STREAM: mapreduce.tasktracker.cache.local.numberdirectories=10000
STREAM: mapreduce.tasktracker.outofband.heartbeat=false
STREAM: stream.addenvironment=
STREAM: stream.map.input.writer.class=org.apache.hadoop.streaming.io.TextInputWriter
STREAM: stream.map.output.reader.class=org.apache.hadoop.streaming.io.TextOutputReader
STREAM: stream.map.streamprocessor=gzipit.sh
STREAM: stream.numinputspecs=1
STREAM: stream.reduce.input.writer.class=org.apache.hadoop.streaming.io.TextInputWriter
STREAM: stream.reduce.output.reader.class=org.apache.hadoop.streaming.io.TextOutputReader
STREAM: tasktracker.http.threads=40
STREAM: topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping
STREAM: topology.script.file.name=/etc/hadoop/conf/rack-topology.sh
STREAM: topology.script.number.args=100
STREAM: webinterface.private.actions=false
STREAM: ====
STREAM: submitting to jobconf: namenode.example.net:54311
13/09/24 17:10:27 INFO mapred.FileInputFormat: Total input paths to process : 1
13/09/24 17:10:27 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hcoyote/mapred/local]
13/09/24 17:10:27 INFO streaming.StreamJob: Running job: job_201307061907_108574
13/09/24 17:10:27 INFO streaming.StreamJob: To kill this job, run:
13/09/24 17:10:27 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=namenode.example.net:54311 -kill job_201307061907_108574
13/09/24 17:10:27 INFO streaming.StreamJob: Tracking URL: http://namenode.example.net:50030/jobdetails.jsp?jobid=job_201307061907_108574
13/09/24 17:10:28 INFO streaming.StreamJob:  map 0%  reduce 0%
13/09/24 17:10:41 INFO streaming.StreamJob:  map 4%  reduce 0%
13/09/24 17:10:42 INFO streaming.StreamJob:  map 37%  reduce 0%
13/09/24 17:10:43 INFO streaming.StreamJob:  map 55%  reduce 0%
13/09/24 17:10:44 INFO streaming.StreamJob:  map 61%  reduce 0%
13/09/24 17:10:45 INFO streaming.StreamJob:  map 66%  reduce 0%
13/09/24 17:10:46 INFO streaming.StreamJob:  map 70%  reduce 0%
13/09/24 17:10:47 INFO streaming.StreamJob:  map 72%  reduce 0%
13/09/24 17:10:48 INFO streaming.StreamJob:  map 73%  reduce 0%
13/09/24 17:10:49 INFO streaming.StreamJob:  map 74%  reduce 0%
13/09/24 17:11:18 INFO streaming.StreamJob:  map 75%  reduce 0%
13/09/24 17:11:27 INFO streaming.StreamJob:  map 76%  reduce 0%
13/09/24 17:11:29 INFO streaming.StreamJob:  map 77%  reduce 0%
13/09/24 17:11:32 INFO streaming.StreamJob:  map 78%  reduce 0%
13/09/24 17:11:34 INFO streaming.StreamJob:  map 79%  reduce 0%
13/09/24 17:11:37 INFO streaming.StreamJob:  map 80%  reduce 0%
13/09/24 17:11:38 INFO streaming.StreamJob:  map 82%  reduce 0%
13/09/24 17:11:40 INFO streaming.StreamJob:  map 83%  reduce 0%
13/09/24 17:11:41 INFO streaming.StreamJob:  map 84%  reduce 0%
13/09/24 17:11:42 INFO streaming.StreamJob:  map 86%  reduce 0%
13/09/24 17:11:43 INFO streaming.StreamJob:  map 88%  reduce 0%
13/09/24 17:11:44 INFO streaming.StreamJob:  map 89%  reduce 0%
13/09/24 17:11:45 INFO streaming.StreamJob:  map 90%  reduce 0%
13/09/24 17:11:46 INFO streaming.StreamJob:  map 91%  reduce 0%
13/09/24 17:11:47 INFO streaming.StreamJob:  map 92%  reduce 0%
13/09/24 17:11:48 INFO streaming.StreamJob:  map 94%  reduce 0%
13/09/24 17:11:49 INFO streaming.StreamJob:  map 95%  reduce 0%
13/09/24 17:11:50 INFO streaming.StreamJob:  map 96%  reduce 0%
13/09/24 17:11:51 INFO streaming.StreamJob:  map 97%  reduce 0%
13/09/24 17:11:52 INFO streaming.StreamJob:  map 99%  reduce 0%
13/09/24 17:11:53 INFO streaming.StreamJob:  map 100%  reduce 0%
13/09/24 17:14:14 INFO streaming.StreamJob:  map 100%  reduce 100%
13/09/24 17:14:14 INFO streaming.StreamJob: Job complete: job_201307061907_108574
13/09/24 17:14:14 INFO streaming.StreamJob: Output: /user/hcoyote/gzipped.log

Travis Campbell

Staff Systems Engineer at ghostar

Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.

Published

2013/09/24

Updated

2014/05/09

Travis Campbell in DevOps, Hadoopery | 2013/09/24

Mass-gzip files inside HDFS using the power of Hadoop

Published

2013/09/24

Updated

2014/05/09

Share this:

Published

2013/09/24

Updated

2014/05/09