in DevOps

Ugh. Four hours wasted.

I’m sitting here with a browser on one window and a text console to an installing system in another. Why? Because I’m waiting for the installation to finish. I’ve been debugging an odd bcfg2 failure during kickstart post-install for our provisioning system. It first started last night when I left the office. I’d just fired off a reinstall of an IAM system to verify that it would work correctly from the production kickstart (as I’d just pushed out the first real production bits to it).

This morning, I got in only to stare at a console still stuck in the kickstart post install. Sigh. Ok, dig around to find the magic remote rescue arcana so I can poke around for the logs. See that two files aren’t binding correctly in bcfg2, which potentially croaked the install (it certainly looked like it hung, that’s for sure). Get the kickstart updated to use the “right” profile for now.

Reboot, reinstall. Lather. Rinse. Repeat.

Ok, kickstart is completing successfully! Yay! Confetti and champagne for everyone!

Reboot.

Hey, grub doesn’t have the right setup. Easy fix in the repo by moving the TGenshi template processing into the right group. Go to run a quick update on the IAM system and .. hey, where’s bcfg2?

headdesk

No wonder post didn’t error out. It didn’t actually do anything! Well .. it did. It errored out on yum because … the rpmforge repo got corrupted. Why did it get corrupt? Well, it appears that the stable thing we’ve been doing for months is now broken because the repository where the rpmforge gpg keys and yum repo setup is at isn’t answering requests.

Fix the url, reinstall and now I’m back where I started early this morning. A broken bcfg2 config that stalls out in post.

I love four hour snipe hunts.

At least we know where we need to fix some things, including:

  • Pulling the rpmforge-repo rpm locally
  • Possibly mirroring all of rpmforge for the repos we need
  • better error handling on the kickstart post
  • need better portholes into the post install to see where errors are. Like, why isn’t our safety shell starting on tty2 like it should be?
Travis Campbell
Staff Systems Engineer at ghostar
Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.
  1. I assume you tee the ks-%post spray to something reviewable? Its pretty useful:

    %post
    #### Directory to throw install logs in
    mkdir -m700 /usr/local/kickstart
    (
    *** bunch of %post magic ***
    ) 2>&1 | tee /usr/local/kickstart/kickstart.log

    Also, I think http://ftp.utexas mirrors rpmforge (possibly under the dag.wieers namespace). If its not there, a quick note to TN will have a mirror on your behalf. If its out of date, also ping and demand it be watched more closely. Oscar is great at helping make http://ftp.utexas a worthwhile resource for the campus at-large.

    Cheers from ct.us.

  2. Yeah, I did something very similar, but specific for bcfg2. Part of the problem was that I had the console= ordering incorrect on the kernel command line so I wasn’t able to get to the password prompt during reboot whenever init was failing. That made it more difficult to actually get to my logs. Once I realized it, things got much simpler.

    I’ll check out the rpmforge mirror, thanks for pointing it out!

Comments are closed.