in Misc

Questions to ask when interviewing for data jobs

I recently discussed interviewing questions with someone looking to transition from front-end development and engineering into the Big Data and Analytics industry. They had found several positions that looked very interesting but had no idea how to vet the maturity of the data organization to understand what they were walking into.

I’m lucky that I’ve experienced running data teams from the perspective of building and running an extensive data operations infrastructure. I’ve also been on the sales and professional services side, working with customers to develop better frameworks and improve sustainability and resilience in their operations. Having worked in this industry for over a decade, I’ve experienced the whole spectrum of good and bad when it comes to team dynamics, technology, and figuring out what was the right path, what was doable, and what was needed to get us through a burning fire.

Here are the questions I would ask. And these aren’t strictly things I’ve asked in job interviews: these are questions built up from years of working with customers and coming up to speed on the problems, ongoing issues, successes, and looming technical debt in their environments that I’m helping to resolve. But, they’re pretty good indicators of how healthy the data organization is and how far along the maturity curve they are.

Data Protection

What’s your company’s position on data governance, protection, and data sovereignty? How do you currently achieve these? What technical or business barriers are in place that make this more difficult than it needs to be?

Data protection is essential. More than essential, it’s critical to being an ethical data steward. Data people collect data to answer questions. Single pieces of information are innocuous. But, when you pair datasets together, you begin to build a map of a person’s life. The more data you join, the more complete this map becomes. Innocuous data points gravitate together, becoming data that we must protect.

With these questions, I’m looking to understand how mature that data governance process is. Do you know what your users are doing? Do you know why they’re doing it? Is it the Wild West? Is it Fort Knox?

It also ties into self-service questions later. If you have to do a lot of manual work to protect data or give access to it, it’s going to get tedious.

Software Development (and Data) Lifecycle

What does your development environment look like from the perspective of a data SDLC? How easy is it for a new developer to come in and become effective? What’s the flow from Dev->QA->Production look like?

This lifecycle is much like any software development life cycle. Change requires testing. Testing requires different levels of flow, from development to QA, to production. At each level, you become more rigorous in the stability and resilience requirements. But, the flow needs to be smooth enough that you can get from development to production quickly. If they have only one environment that runs everything? That’s a red flag. If they have no intention of separating development workloads from production ones, that’s a red flag.

Isolation == safety == production reliability.

Remember: as Michael Stanke says, “Everybody has a testing environment. Some people are lucky enough to have a totally separate environment to run production in.”

What’s your stack?

What first and third-party tools are you expecting people to use in your data platform, and how do those help create an advantage for your team and company?

Companies build their competitive advantage around a single set of tools or purposeful lock-in, which may be good or bad. Good to be aware of it, at least, because some tools are bigger tire-fires than others.

What technologies/tools/whatever are off-limits and why?

This question is related to the previous. It’s good to know what you can’t use. If you had experience/interest in BigQuery, but they would never become a Google shop, you’d probably get frustrated. Also, knowing what technology suggestions to avoid can help you navigate touchy situations where a bad experience occurred.

Ongoing Technology Health

What was your last major data platform upgrade like? How long did it take to implement? How closely did your development community work with your data operations team?

Upgrades should be weeks to months of effort, not months to years. The environment that moves frequently but sustainably forward is going to be resilient and easier to work with. An environment that’s rarely touched beyond break-fix issues is an environment that is stale and fragile. Why? Because you’re going to deal with the same problems over and over, rather than finding a permanent resolution. And, when you finally get to the point of needing to move, the institutional inertia or will to change will be significant because no one will want to take ownership of the effort.

A well-run environment doesn’t need to move to the latest and greatest software release constantly. But, continually getting to a recent release vintage means you’ll constantly solve the ever-present list of problems that act like papercuts to the developers and operations teams.

Remember: if they’re afraid to touch it, they’re afraid to change it.

Towards the future

How do you like your current data platform? Are you looking at moving or migrating away from it? If so, why? And what does your timeline look like for completing that?

These questions are related to and impacted by the upgrade questions. People change platforms because they believe the effort is about the same impact as doing major upgrades. The mentality is, “well, if I have to do the testing and validation anyway, I may as well look at redesigning everything.”

I’d be cautious about this and understand the fundamental technical and business reasons requiring the change. Sometimes, the problems are real and not solvable by the existing software. And sometimes, people like to chase shiny software in the name of progress rather than addressing the complex problems present in the current system.

Remember: someone willing to throw away a system will spend a lot of time building up the new system. Are they sure the opportunity cost is worth it?

Are you a self-service organization?

How well-developed is the self-service aspect of data usage, management, governance, and related? I.e., how much do you enable users to do things themselves? What guardrails do you put in place to help them avoid sharp edges?

We should enable users to do things themselves within reasonable and well-guardrailed paths. That takes you out of the critical path to focus on the more interesting problems. Need access to a particular data set? Systems should be in place to let the user’s management chain handle approval and give access. For example, manage access to data using roles, groups, and metadata tagging. Then, you’re not creating a new access policy for individual users; you’re reusing existing policy and providing membership to the right roles and groups the security team previously established. Then the user’s management only has to ensure membership in the right group.

Better yet, work with HR to automate this process, tying department hierarchies to the roles and groups that manage their data. This flow allows placing new hires into their correct data access groups from the beginning.

Solving the self-service problem allows you time for better work. Look at all the tedious, repeatable, and automatable-but-isn’t stuff. The utility work is almost always tactical. Routine. Repeatable. The same. This utility work takes away from time in the interesting problem spaces, where longer-term, strategic work moves the business forward.

“The Incident”

What’s the last big data disaster that you had, and how did you recover from it?

Every company has this in their data mythos. The Incident. The Event. That Weekend That Shall Not Be Named.

What you want to understand is how bad things may be when everything suddenly and unexpectedly goes sideways. This includes understanding how they deal with the situation. Are they calm and collected? Is there yelling? Did someone go into a closet and scream? Was there blame and finger-pointing? Hopefully, no blaming, just solutions.

Data deletion or corruption happens. Accidents happen. You want to know how they responded to this. Companies either have robust disaster recovery and business continuity plans, or they don’t. And when they don’t, they throw people at the problem and have lots and lots of manual rebuild-from-source procedures, usually while someone else is yelling because some report isn’t available to the business.

A company (or team) willing to talk about its failures is one willing to learn from those failures.

I once had to recover a multi-petabyte Hadoop cluster because someone deleted the Linux CUPS packages from the system. An extremely non-obvious dependency chain led to the Hadoop packages also getting nuked. Someone thought they were going to get fired. No worries. Accidents happen, and complexity occurs in a system. You can’t predict the behavior of every change you make. But, you can certainly learn from it and understand how to mitigate it from happening again in the future.

Finally …

There are many, many more questions you can ask. Deep minutiae. Architecture. Strategy. What’s for lunch. Those are important, to be precise. You may get a vague, high-level understanding of how they connect tool A to pipeline B. You might even get a discussion of their 6-month plans. Or that Bob and Alice like SQL. But, I don’t think you’ll get profoundly resonating information here, seducing you to abandon the ship you’re on now. Especially given they will be asking you questions about your experience and interests as well.

Treat your questions as a barometer. It’s a gauge of what may be occurring under the surface. Your goal is to figure out if their problems are interesting, worth solving, worth solving by you, and not just doing the rote, undifferentiated heavy lifting that engineers should have already solved.

TWIL: Blue Oceans, A Nudge, Hell’s Angels, and Manager READMEs

I’ve decided to start something new on here to drive one of my less/more/none goals. It’s a nudge, really. Periodically, I’m going to write-up a bit about the things I’ve read or that I’m working on that I find brain-engaging. It’ll be called “This Week I Learned” or “TWIL” for short. The goal is to […]

Less, More, and None

I’ve been thinking about positive life choices lately. Trying to improve my outlook, really. More of the good things, less of the time sucking things, none of the time-wasting stuff. I ran across this format recently and thought it would be a good way to keep track of things I wanted to accomplish throughout the […]

Things I wished I knew before archiving data in Hadoop HDFS

I was recently in a good discussion about sizing a Hadoop HDFS cluster for doing long-term archiving of data. Hadoop seems like a great fit for this, right? It has easy expansion of data storage as your data foot print grows, it is fault tolerant and somewhat self-recovering, and generally just works. From a high-level […]

Hive Metastore and Impala UnknownHostException during table creation

Like many environments, we run a few long-lived Hadoop clusters in our lab for doing testing of various feature and functionality scenarios before they are placed in a production context. These are used as big sandboxes for our team to play with and do development upon. Today, we encountered a strange Hive Metastore error on one environment […]

Docker for Mac Tips for Troubleshooting Container Problems

I’ve used Docker for Mac since the Beta release opened to wider audiences. With the rapid prototyping I’m doing on Hadoop environments, I’m finding it great for providing quick environments to test out theories. Problem: How do you access the Docker for Mac VM? The problem with a black box is not being able to easily get inside […]

A followup on the strange stunnel behavior in docker

This is a quick followup to the strange stunnel behavior I was seeing that I wrote about previously. After discussing the issue with a colleague, we came up with two different solutions to this problem with stunnel writing to /dev/console inside a docker container. Indirect route with docker exec In his method, we invoke the container and […]

Strange stunnel debug logging behavior in docker

I’ve been playing with xenserver lately to quickly model small Hadoop clusters. One of the frustrating things about xenserver is the lack of good graphical user interfaces that provide for a minimal amount of automation. This means I’m frequently dropping to the command line on the xenserver master and running the xe tools by hand […]

Containing a snakebite with python pex

I’ve used Hadoop for several years now. One of the most frustrating parts of using Hadoop is the time it takes to start-up the Java HDFS client to run simple tasks.  Even listing a directory can take several seconds because of the startup cost associated with launching the JVM. In 2013, Spotify open sourced a pure […]

Creating RPMS with fpm and docker

Now and again, you need to create RPMS of third-party tools, such as Python libraries or Ruby gems.  The most effective (and best!) way to do this is with fpm. Historically, I have built RPMS using a few dedicated virtual machines for CentOS5 and CentOS6-specific builds. These virtual machines have gotten crufty with all the various libraries installed. Wouldn’t it be nice to have […]