Using cobbler with a fast file system creation snippet for Kickstart %post install of Hadoop nodes

I run Hadoop servers with 12 2TB hard drives in them. One of the bottlenecks with this occurs during kickstart when we’re using anaconda to create the filesystems. Previously, I just had a specific partition configuration that was brought in during %pre, but this caused the filesystem formatting section of kickstart to take several hours to complete. With some additional changes that required us to begin hand-signing puppet certificates that were created during %post, this entire process because too unwieldy. I got tired of having to wait hours for systems to get to the %post install, just so I could turn around and sign things.

Instead, what I did was to move the partitioning of the filesystems into a %post snippet for cobbler and tweak a few filesystem creation options (which you can’t do with partition in the ks.cfg). This brought creation time down to about 10 minutes from several hours.

The snippet is below.

I’d like to point a few things out with our configuration.

  • We have 12 2TB drives, as I stated above. Each drive is dedicated to HDFS. Each drive is mounted to /hdfs/XX where XX is 01 through 12.
  • All HDFS drives are given an e2label of the form /hdfs/XX. We use this as a signal within a puppet fact to do configuration of the data node. If we lose a drive, we can just re-run puppet and have the configuration fixed temporarily while we wait for a replacement.
  • We’re going to use ext4 with these systems since we’re using a sufficiently new enough kernel in the 2.6 line. We chose this based on various recommendations now coming out of places like Cloudera, AMD, and Intel.
  • We disable atime and diratime on all HDFS drives to prevent useless writes that occur with atime updates.
  • We use an FS profile called “largefile”, during mkfs.ext4 creation that reduces the number of inodes that gets created on the filesystem. Since we’re generally dealing with large files, this is acceptable to us.
  • We use several ext4 features:
    • dir_index – speeds up directory lookups in large directories
    • extent – use extent-based block allocation which has a benefit for large files, allowing them to be laid out more contiguously on disk
    • sparse_super – create fewer superblock backups which aren’t needed on large filesystems.
  • We have a RAID1 OS drive on /dev/sda. This was done because we wanted to dedicate as much space to HDFS and prevent the first drive from taking an I/O hit due to logging or other non-HDFS activities. This is presented to the OS with a model of “Virtual disk”, which allows us to detect and skip operating on it.

Finally, because Cobbler uses Cheetah as it’s backend templating system, I want to point out that there are some additional escaped dollar signs ($) in the snippet to prevent cobbler from choking on them. Otherwise, you could use this straight in a shell script.

DIR="/sys/block"

MINSIZE=1000


# list-harddrives doesn't exist in the chroot post install environment. bummer.
for DEV in `cd $DIR ; ls -d sd*`; do
    if [ -d $DIR/$DEV ] ; then
        REMOVABLE=`cat $DIR/$DEV/removable`
        if (( $REMOVABLE == 0 )) ; then
            MODEL=`cat $DIR/$DEV/device/model`
            if [[ "$MODEL" =~ ^Virtual.* ]] ; then
                echo "Found a virtual disk on /dev/$DEV, skipping"
            else
                echo "Found $DEV"
                SIZE=`cat $DIR/$DEV/size`
                GB=$(($SIZE/2**21))
                if [ $GB -gt $MINSIZE ] ; then
                    # we are a non-root drive
                    echo "Found a rightsize drive on $DEV for hadoop"

                    for partition in `parted -s /dev/$DEV print | awk '/^ / {print $1}'`; do
                        parted -s /dev/$DEV rm /dev/${partition}
                    done

                    parted -s /dev/$DEV mklabel gpt
                    parted -s /dev/$DEV mkpart -- primary ext4 1 -1
                    partprobe

                    

                    # we are going to map /dev/sdX to /hdfs/YY with this
                    HDFS_PART_ASCII=`echo $DEV | sed -e 's/sd//' | od -N 1 -i | head -1 | tr -s " " | cut -d" " -f 2`
                    HDFS_PART_NUMBER=\$(($HDFS_PART_ASCII - 97))
                    HDFS_LABEL=\$(printf "/hdfs/%02g" $HDFS_PART_NUMBER)


                    mkfs.ext4 -T largefile -m 1 -O dir_index,extent,sparse_super -L $HDFS_LABEL /dev/${DEV}1 

                    eval `blkid -o export /dev/${DEV}1`

                    if [ -n "${LABEL}" ] ; then
                        echo "Creating $LABEL mountpoint"
                        mkdir -p "${LABEL}"
                        echo "Adding $LABEL to /etc/fstab"
                        echo "LABEL=$LABEL $LABEL ext4 defaults,noatime,nodiratime 1 2" >> /etc/fstab
                        tail -1 /etc/fstab
                    fi 
                fi
            fi
        fi
    fi
done

Aw. It’s a wedding.

Aw.  It's a wedding.

(more…)

Seeing RabbitMQ memory usage

While working in our RabbitMQ environment this week, we noticed that there was a large, unexplained amount of memory in use by RabbitMQ that we couldn’t account for by normal queue and message use. One of the first tools we use when poking around Erlang and RabbitMQ is to do a memory dump.


$ sudo  rabbitmqctl eval 
  'lists:sublist(lists:reverse( 
     lists:sort([{process_info(Pid, memory), Pid, 
     process_info(Pid)} || Pid <- processes()])), 1).'

This command is evaluated directly into the running erlang instance within our RabbitMQ and dumps the top memory users from largest to smallest. In the above case, we’re telling erlang to show us the single top user of memory. You can change the 1 to any positive number you like to get more and more erlang processes returned to evaluate the memory usage of.

This produces something like the following:


[{{memory,12344224},
  <4865.28344.41>,
  [{current_function,{gen_server2,process_next_msg,1}},
   {initial_call,{proc_lib,init_p,5}},
   {status,waiting},
   {message_queue_len,0},
   {messages,[]},
   {links,[<4865.28338.41>,<4865.28353.41>,#Port<4865.433154>]},
   {dictionary,[{{credit_to,<4865.5033.42>},28},
                {{#Ref<4865.0.1377.83012>,fhc_handle},
                 {handle,{file_descriptor,prim_file,{#Port<4865.433154>,20}},
                         335872,false,0,infinity,[],true,
                         "/var/lib/rabbitmq/mnesia/rabbit@rabbitmqhost/queues/14AI6IEJDH4TZNL8R69HT5Y6K/journal.jif",
                         [write,binary,raw,read],
                         [{write_buffer,infinity}],
                         true,true,
                         {1351,12777,400557}}},
                {{ch,<4865.28440.41>},
                 {cr,<4865.28440.41>,#Ref<4865.0.1377.84010>,
                     {set,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                     1,
                     {[],[]},
                     {token,<4865.28439.41>,true},
                     false,19}},
                {{credit_to,<4865.4869.42>},15},
                {'$ancestors',[rabbit_amqqueue_sup,rabbit_sup,
                               <4865.28225.41>]},
                {{credit_to,<4865.4228.42>},49},
                {fhc_age_tree,{1,
                               {{1351,12777,400557},
                                #Ref<4865.0.1377.83012>,nil,nil}}},
                {{credit_to,<4865.28474.41>},35},
                {{credit_to,<4865.5066.42>},11},
                {credit_blocked,[]},
                {{credit_to,<4865.28507.41>},3},
                {{credit_to,<4865.28490.41>},33},
                {{"/var/lib/rabbitmq/mnesia/rabbit@rabbitmqhost/queues/14AI6IEJDH4TZNL8R69HT5Y6K/journal.jif",
                  fhc_file},
                 {file,1,true}},
                {{credit_to,<4865.28498.41>},14},
                {{credit_to,<4865.28482.41>},31},
                {guid,{{4270534438,2738637279,2258006136,1881461985},0}},
                {'$initial_call',{gen,init_it,6}}]},
   {trap_exit,true},
   {error_handler,error_handler},
   {priority,normal},
   {group_leader,<4865.28224.41>},
   {total_heap_size,1542687},
   {heap_size,196418},
   {stack_size,7},
   {reductions,332233374},
   {garbage_collection,[{min_bin_vheap_size,46368},
                        {min_heap_size,233},
                        {fullsweep_after,65535},
                        {minor_gcs,20706}]},
   {suspending,[]}]}]
...done.

I don’t know what all of the bits mean yet, but we can point out a few useful things to look at. A good reference on what is being returned can be found in the Erlang documentation for process_info/2

  • memory – the current amount of memory in use by this erlang process. This directly effects the size of the RabbitMQ process size in the OS.
  • <4865.28344.41> – the erlang pid inside the erlang kernel
  • current_function – the function call currently running in the process
  • registered_name – (Note: note seen above). this is the name of the process associated with this memory. If the process isn’t named, this won’t show up in the output
  • status – What the process is currently doing. It can be something like exiting, garbage_collecting, waiting, running, runnable, or suspended.
  • messages – messages associated with this process.
  • dictionary

There’s more than this, so you should definitely look at the erlang docs to figure out what’s going on here, but this should get you started in understanding what memory usage is for your RabbitMQ environment.

Hadoop DataNode logs filling with clienttrace messages

So, you’re probably like me.

You have a shiny, new Cloudera Hadoop cluster. Everything is zooming along smoothly. Until you find that your /var/log/hadoop datanode logs are growing at a rate of a bazillion gigabytes per day. What do you do, hot shot? WHAT DO YOU DO?

Actually, it’s pretty simple.

We were getting alerts on our cluster that /var was filling across various datanodes (where we also happen to have HBase running). We were doing about four gigabytes a day just in the datanode logs alone. That seemed excessive. Peering through the logs, we found that a large percentage of entries looked something like this:

2012-03-26 23:59:57,411 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.114:44972, bytes: 20548, op: HDFS_READ, cliID: DFSClient_1020093912, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_1530028621327446680_9137244, duration: 309000
2012-03-26 23:59:57,415 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.108:40058, bytes: 27507, op: HDFS_READ, cliID: DFSClient_345923441, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_4340309148758316988_10579186, duration: 909000
2012-03-26 23:59:57,415 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.114:44973, bytes: 23734, op: HDFS_READ, cliID: DFSClient_1020093912, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_4111106876676617467_8649685, duration: 568000
2012-03-26 23:59:57,423 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.119:45662, bytes: 20840, op: HDFS_READ, cliID: DFSClient_-635119343, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_4655242753989962972_6418861, duration: 310000
2012-03-26 23:59:57,434 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.108:40061, bytes: 17728, op: HDFS_READ, cliID: DFSClient_345923441, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_6737196232027719961_7900492, duration: 1071000
2012-03-26 23:59:57,470 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.1.134:50010, dest: /10.1.1.105:53569, bytes: 19998, op: HDFS_READ, cliID: DFSClient_-1106211444, offset: 0, srvID: DS-1061365566-10.1.1.134-50010-1331849447734, blockid: blk_801375605494816473_8641680, duration: 856000

Doing a handy-dandy google search, we found this thread discussing the very problem we were seeing.  Looks like this is some sort of performance data emitted by the datanode for  blocks associated due to how the HBase META region interacts with it.

No big deal.

The fix?  We’ll just turn down logging on this specific java class.  We achieve this by doing the following.  This fixes the runtime logging that the datanodes are doing so we don’t have to restart the datanodes right now.

$ sudo -u hdfs hadoop daemonlog -setlevel localhost:50075 org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace WARN

Later, we’ll add the following to /etc/hadoop/conf/log4j.properties on all datanodes and restart them.

log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=WARN

Hadoop, facter, and the puppet marionette

I’ve been working with puppet a lot lately.  A lot.  It’s part of my job.  We’ve been setting up a new hadoop cluster in our Xen environment.  Nothing big.  It started out with 4 nodes, all configured the same way (3 drives each).  We added an additional 2 nodes with 2 drives each.  This, of course, broke how we were having to manage the hadoop configuration files.  I spent the last few days figuring out how to create a custom Puppet fact (for our Puppet 2.6.4 installation) that would search for the HDFS drives on our nodes and configure the hdfs-site.xml and mapred-site.xml appropriately.

First, in our hadoop module, I created the modules/hadoop/lib/facter path and added the following facter code to hdfs_path.rb in there.

require 'facter'

paths = []

if FileTest.exists?("/sbin/e2label")


    drives = Dir.glob("/dev/sd?1")
    drives += Dir.glob("/dev/*vg/hdfs*_lv")

    drives.each do |drive|
        #if FileTest.exists?("/dev/#{drive}1")
        if FileTest.exists?(drive)
            output = %x{/sbin/e2label #{drive} 2>/dev/null}.chomp

            # only run this fact if we find hdfs in the label name.
            if output =~ /hdfs/ and output =~ /_/
                device = output.split("_")[1]
                devicenumber = device.split("hdfs")[1]
                path = "/hdfs/" + devicenumber
                paths.push(path)
                Facter.add("hdfs_#{device}") do
                    setcode do
                        path
                    end
                end
            end
        end
    end

	allpaths = paths.join(",")
    Facter.add("hdfs_all_paths") do
        setcode do
            allpaths
        end
    end
end

This would look for drives on the hadoop node that had an ext3 label of hostname_hdfs#. For example, one of these would be vhdn01_hdfs1. The fact would be populated with the value /hdfs/1. The facter module would create one entry of these per device found. As well, it would create a fact called hdfs_all_paths that contained a comma separated list of every path we found. These would be used in the templates for mapred-site.xml and hdfs-site.xml to setup the list of paths we’re using.

In our template, we created

<property>
 <name>dfs.data.dir</name>
 <value><%= hdfs_all_paths.split(",").map { |n| n + "/hdfs"}.join(",") -%></value>
 <final>true</final>
</property>

which creates a comma separated list of paths and tacks on /hdfs to each item. For example, this takes /hdfs/1,/hdfs/2 and turns it into /hdfs/1/hdfs,/hdfs/2/hdfs.

Some things to note:  I run in an agent/master setup.  In order for the fact to be sent down to the agent, we need to enable pluginsync on both ends.  You can do this by adding the option to your puppet.conf.  It needs to go in the [main] section.  For example.

[main]
pluginsync = true

The other caveat I ran into is that in order for this to work, you need to make sure you have the fact on the client system before you use it in your catalog (such as in your templates), otherwise your catalog might not compile when you go to run your agent.

Once I got all these things figure out, I got a much more easily configured hadoop datanode setup.

Playing with the perl RTM client.

I began playing with a perl-based RememberTheMilk command line tool today from http://www.rutschle.net/rtm/.  There aren’t any RPMs of it that I found, so I ended up building some.  This is what I had to do to get it to the stage of at least working.

:;  sudo cpan2rpm WebService::RTMAgent

-- cpan2rpm - Ver: 2.028 --
Upgrade check
Fetch: HTTP

-- module: WebService::RTMAgent --
Using cached URL: http://search.cpan.org//CPAN/authors/id/R/RU/RUTSCHLE/WebService-RTMAgent-0.5_1.tar.gz
Tarball found - not fetching
Metadata retrieval
Tarball extraction: [/home/tcampbell/src/rpm/SOURCES/WebService-RTMAgent-0.5_1.tar.gz]

Can't locate object method "interpolate" via package "Pod::Text" at /usr/bin/cpan2rpm line 525.
cannot remove path when cwd is /tmp/yFnnvY7DA3/WebService-RTMAgent-0.5_1 for /tmp/yFnnvY7DA3:  at /usr/share/perl5/File/Temp.pm line 902
-- Done --

But, I ran into this dumb perl interpolate bug. Turns out that cpan2rpm has an issue with newer versions of perl where Pod::Text no longer has this method. Changing Pod::Text to Pod::Parser in cpan2rpm allowed this to succeed.

sudo cpan2rpm WebService::RTMAgent  --no-sign

-- cpan2rpm - Ver: 2.028 --
Upgrade check
Fetch: HTTP

-- module: WebService::RTMAgent --
Using cached URL: http://search.cpan.org//CPAN/authors/id/R/RU/RUTSCHLE/WebService-RTMAgent-0.5_1.tar.gz
Tarball found - not fetching
Metadata retrieval
Tarball extraction: [/home/tcampbell/src/rpm/SOURCES/WebService-RTMAgent-0.5_1.tar.gz]
Generating spec file
SPEC: /home/tcampbell/src/rpm/SPECS/WebService-RTMAgent.spec
Generating package
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.xs7L3d
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ cd /home/tcampbell/src/rpm/BUILD
+ rm -rf WebService-RTMAgent-0.5_1
+ /usr/bin/gzip -dc /home/tcampbell/src/rpm/SOURCES/WebService-RTMAgent-0.5_1.tar.gz
+ /bin/tar -xf -
+ STATUS=0
+ '[' 0 -ne 0 ']'
+ cd WebService-RTMAgent-0.5_1
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ chmod -R u+w /home/tcampbell/src/rpm/BUILD/WebService-RTMAgent-0.5_1
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.sKaxgX
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ cd WebService-RTMAgent-0.5_1
+ grep -rsl '^#!.*perl' .
+ grep -v '.bak$'
+ xargs --no-run-if-empty /usr/bin/perl -MExtUtils::MakeMaker -e 'MY->fixin(@ARGV)'
+ CFLAGS='-O2 -g -march=i386 -mtune=i686'
++ /usr/bin/perl -MExtUtils::MakeMaker -e ' print qq|PREFIX=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr| if $ExtUtils::MakeMaker::VERSION =~ /5\.9[1-6]|6\.0[0-5]/ '
+ /usr/bin/perl Makefile.PL
Checking if your kit is complete...
Looks good
WARNING: Setting ABSTRACT via file 'lib/WebService/RTMAgent.pm' failed
 at /usr/share/perl5/ExtUtils/MakeMaker.pm line 603
Writing Makefile for WebService::RTMAgent
+ /usr/bin/make
cp lib/WebService/RTMAgent.pm blib/lib/WebService/RTMAgent.pm
Manifying blib/man3/WebService::RTMAgent.3pm
+ /usr/bin/make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load.t ....... 1/1 # Testing WebService::RTMAgent 0.5_1, Perl 5.010001, /usr/bin/perl
t/00-load.t ....... ok   
t/auth.t .......... frobbed -- getting token
t/auth.t .......... 1/7 frobbed -- getting token
token token
t/auth.t .......... ok   
t/boilerplate.t ... ok   
t/init.t .......... ok   
t/pod-coverage.t .. skipped: Test::Pod::Coverage 1.08 required for testing POD coverage
t/pod.t ........... ok   
t/requests.t ...... 1/12 request:
POST http://www.rememberthemilk.com/services/rest/
Content-Type: application/x-www-form-urlencoded

method=rtm.tasks.add&nam=adding&api_key=key&auth_token=10438&timeline=114114&api_sig=3340edd30a22e9b2c67ff206283d0b67


response:
HTTP/1.1 200 OK
Connection: keep-alive
Date: Mon, 24 Dec 2007 11:49:10 GMT
Server: nginx/RTM
Vary: Accept-Encoding
Content-Type: text/xml; charset="utf-8"
Client-Date: Mon, 24 Dec 2007 11:50:39 GMT
Client-Peer: 75.126.232.204:80
Client-Response-Num: 1
Client-Transfer-Encoding: chunked
Keep-Alive: timeout=300







t/requests.t ...... ok     
t/undo.t .......... ok   
All tests successful.

Test Summary Report
-------------------
t/boilerplate.t (Wstat: 0 Tests: 3 Failed: 0)
  TODO passed:   1-3
Files=8, Tests=35,  0 wallclock secs ( 0.05 usr  0.01 sys +  0.59 cusr  0.06 csys =  0.71 CPU)
Result: PASS
+ exit 0
Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.7ZK9SK
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ cd WebService-RTMAgent-0.5_1
+ '[' /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386 '!=' / ']'
+ rm -rf /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386
++ /usr/bin/perl -MExtUtils::MakeMaker -e ' print $ExtUtils::MakeMaker::VERSION <= 6.05 ? qq|PREFIX=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr| : qq|DESTDIR=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386| '
+ make prefix=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr exec_prefix=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr bindir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/bin sbindir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/sbin sysconfdir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/etc datadir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share includedir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/include libdir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/lib libexecdir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/libexec localstatedir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/var sharedstatedir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/var/lib mandir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/man infodir=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/info install DESTDIR=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386
Manifying blib/man3/WebService::RTMAgent.3pm
Installing /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/local/share/perl5/WebService/RTMAgent.pm
Installing /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/local/share/man/man3/WebService::RTMAgent.3pm
Appending installation info to /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/lib/perl5/perllocal.pod
+ cmd=/usr/share/spec-helper/compress_files
+ '[' -x /usr/share/spec-helper/compress_files ']'
+ cmd=/usr/lib/rpm/brp-compress
+ '[' -x /usr/lib/rpm/brp-compress ']'
+ /usr/lib/rpm/brp-compress
+ '[' -e /etc/SuSE-release -o -e /etc/UnitedLinux-release ']'
+ find /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386 -name perllocal.pod -o -name .packlist -o -name '*.bs'
+ xargs -i rm -f '{}'
+ find /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr -type d -depth -exec rmdir '{}' ';'
+ /usr/bin/perl -MFile::Find -le '
    find({ wanted => \&wanted, no_chdir => 1}, "/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386");
    print "%doc  Changes README";
    for my $x (sort @dirs, @files) {
        push @ret, $x unless indirs($x);
        }
    print join "\n", sort @ret;

    sub wanted {
        return if /auto$/;

        local $_ = $File::Find::name;
        my $f = $_; s|^\Q/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386\E||;
        return unless length;
        return $files[@files] = $_ if -f $f;

        $d = $_;
        /\Q$d\E/ && return for reverse sort @INC;
        $d =~ /\Q$_\E/ && return
            for qw|/etc /usr/man /usr/bin /usr/share|;

        $dirs[@dirs] = $_;
        }

    sub indirs {
        my $x = shift;
        $x =~ /^\Q$_\E\// && $x ne $_ && return 1 for @dirs;
        }
    '
+ '[' -z WebService-RTMAgent-0.5_1-filelist ']'
+ /usr/lib/rpm/brp-compress
+ /usr/lib/rpm/brp-strip
+ /usr/lib/rpm/brp-strip-static-archive
+ /usr/lib/rpm/brp-strip-comment-note
Processing files: perl-WebService-RTMAgent-0.5_1-1.noarch
Executing(%doc): /bin/sh -e /var/tmp/rpm-tmp.25mTCz
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ cd WebService-RTMAgent-0.5_1
+ DOCDIR=/home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/doc/perl-WebService-RTMAgent-0.5_1
+ export DOCDIR
+ rm -rf /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/doc/perl-WebService-RTMAgent-0.5_1
+ /bin/mkdir -p /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/doc/perl-WebService-RTMAgent-0.5_1
+ cp -pr Changes README /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386/usr/share/doc/perl-WebService-RTMAgent-0.5_1
+ exit 0
Provides: perl(WebService::RTMAgent) = 0.5
Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(VersionedDependencies) <= 3.0.3-1
Requires: perl(Carp) perl(Digest::MD5) perl(LWP::UserAgent) perl(XML::Simple) perl(strict) perl(vars)
Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386
Wrote: /home/tcampbell/src/rpm/SRPMS/perl-WebService-RTMAgent-0.5_1-1.src.rpm
Wrote: /home/tcampbell/src/rpm/RPMS/noarch/perl-WebService-RTMAgent-0.5_1-1.noarch.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.xrdBo2
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ cd WebService-RTMAgent-0.5_1
+ '[' /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386 '!=' / ']'
+ rm -rf /home/tcampbell/src/rpm/BUILDROOT/perl-WebService-RTMAgent-0.5_1-1.i386
+ exit 0
Executing(--clean): /bin/sh -e /var/tmp/rpm-tmp.eKyVlR
+ umask 022
+ cd /home/tcampbell/src/rpm/BUILD
+ rm -rf WebService-RTMAgent-0.5_1
+ exit 0
RPM: /home/tcampbell/src/rpm/RPMS/noarch/perl-WebService-RTMAgent-0.5_1-1.noarch.rpm
SRPM: /home/tcampbell/src/rpm/SRPMS/perl-WebService-RTMAgent-0.5_1-1.src.rpm
-- Done --

Sweet! now we can install it.

:;  sudo rpm -ivh /home/tcampbell/src/rpm/RPMS/noarch/perl-WebService-RTMAgent-0.5_1-1.noarch.rpm
Preparing...                ########################################### [100%]
   1:perl-WebService-RTMAgen########################################### [100%]

Great. Now we grab the rtm command from the website.

:;  wget http://www.rutschle.net/rtm/rtm-0.5.gz
--2011-01-04 13:47:20--  http://www.rutschle.net/rtm/rtm-0.5.gz
Resolving www.rutschle.net... 82.235.147.6
Connecting to www.rutschle.net|82.235.147.6|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4212 (4.1K) [application/x-gzip]
Saving to: `rtm-0.5.gz'

100%[===========================================================================>] 4,212       24.3K/s   in 0.2s    

2011-01-04 13:47:21 (24.3 KB/s) - `rtm-0.5.gz' saved [4212/4212]

:; gunzip rtm-0.5.gz

Now, we run it and ... boom.

:;  ./rtm-0.5 
Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl5/WebService/RTMAgent.pm line 312.
98: Login failed / Invalid auth token
Use of uninitialized value at /usr/local/share/perl5/WebService/RTMAgent.pm line 368

Looks like we need to get an auth token. Looking inside the script, it appears you have to run rtm with the --authorise option. This creates a URL that you put in the browser to authorize the client with RememberTheMilk. After that was done, I can now run rtm from the command line.

:;  ./rtm-0.5 
frobbed -- getting token
token [my token]
0: Turabelle photos
1: do something incredible
2: Ticket to manage notification of changes to bobqueue files.
3: Create ticket to Place bob configs under puppet/avn
4: Rebound payment

Looks like that frobbed/token header go away after the first run.

Now I'm good to go with hitting RememberTheMilk from the command line. Sweet!

Finally updated this.

Final testing with a post from the iPhone. Yay!

The “Enterprise” …

From a discussion with a few peers in the industry.  I was entertained.

peer> Now when I hear someone use the word “enterprise”
   as an adjective, I have to ask them which of the four meanings
   they intend:
peer> 1.  defunct and destroyed (the Enterprise aircraft
   carrier from WW2)
peer> 2.  ancient and nearly dead (the Enterprise nuclear
    aircraft carrier)
peer> 3.  a nonfunctional mockup (the Enterprise space shuttle)
peer> 4.  imaginary (the starship Enterprise)
peer> Typically “enterprise software” fits perfectly in one of
   those four categories.

Name withheld to, of course, protect the guilty.

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch

“Every time we’re down for an hour, that’s about 2,500 people inconvenienced,” Smit said. “They’re blaming my people for it and [state IT officials] have an obligation to fix it.”

Lack of backup foils Va.’s new IT system | Richmond Times-Dispatch.

One of the things we’ve been grappling with lately is some unfortunate unplanned outages of services.  You know what those are … random event blips caused by butterflies flapping their wings in the South Pacific that stir up turbulence which creates a small wind, that then turns into a hurricane, which rampages over a submarine cable used by the crucial bit of networking that connects you with the rest of the Internet civilization.

A blip.

Sometimes they’re momentary, sometimes they’re bad.  What all blips have in common is that they affect a class of your customers in a way that inconviences them in some manner.  The hard part of dealing with an outage is understanding and quantifying what the business impact really is.  When a database server goes out, you implicitly undrestand that it potentially affects all database users plus all services and users downstream that depend upon the database being up.  So how do you realisitically quantify that into a valuable metric?

I bring this up because, as a person in the trenches, I’m able to better understand the impact of something (and therefore, provide a better mitigation plan) if I can understand the size, length, and number of ripples in the fabric that spread out from the blip.

At large companies, this impact may be described as thousands of dollars per minute of cost charged against the bottom line.  Some places, like VA, point out the number of people an hour that an outage prevented someone from successfully interacting with the DMV.  Websites may see it as the number of advertising impressions that don’t go out due to the site being unavailable.

Whatever metric is used, it needs to be understandable and an order of magnitude that someone can comprehend.  I understand impacting 2500 people per hour of down time.  I understand costing a company $1 million dollars per minute that the factory is unable to reach it’s control network. I understand an outage costing an engineering team a day’s worth of work (which can ultimately affect the bottom line due to down stream slippage in timelines). What that metric comes down to is being able to understand, in measurable terms, how the blip impacts either people or money.

It’s important to understand these things.  Why?  Because it allows you to more adequately assess your risk of the (unplanned) outage and design your environment appropriately.  If you can point to a solid metric and show how it materially affects people or money, it’s certainly a lot easier to go to management and provide justification for improvements in your environment.  If you can only say, with vague hand waving, that there’s AN effect but no data to back that up, you’re just waffling.

So.  Have you created your approrpriately detailed outage impact metrics?

I haven’t.  But I’m working on it.

Never change anything on Friday at 5pm.

SEVERE: Socket accept failed
java.net.SocketException: SSL handshake error
javax.net.ssl.SSLException: No available certificate or key corresponds to the
SSL cipher suites which are
enabled.
        at
org.apache.tomcat.util.net.jsse.JSSESocketFactory.acceptSocket(JSSESocketFactory.java:150)
        at
org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:310)
        at java.lang.Thread.run(Unknown Source)

This is the reason you never change anything on Friday at 5pm. While attempting to update the SSL certificate for the MySQL Enterprise Monitor (for which the process has no documentation), I managed to break it in a way that caused a few hundred megs of these errors to dump to the catalina log for MEM. Oh, and it meant no monitoring was taking place for a few minutes.

Sigh.

Lesson re-learned. Glad I made a backup of the keystore before I started mucking with it. Now we wait for MySQL to provide me with the correct documentation (after they write it some time this weekend). You would think someone would have already encountered this with their product considering how long it’s been out there already.

At least we have monitoring back.