Tuesday, June 26, 2012

Fabric management redesigned!


People who know me probably know that when it comes to fabric management i'm a fan of Quattor. Quattor is a great tool that can manage nodes from installation part (utilizing pxeboot and kickstart) till fine-tuning service features. While Quattor is a strong tool that could help administering hundreds or even thousands of nodes, it has some weak spots which i'd like to get rid of:
  • Very steep learning curve
  • Some operations may be quite time consuming (i.e. doing errata updates)
  • Use of a custom programming language (PAN) which is usually unknown to even senior admins
  • Inventory of the assets is based on what you describe and not on what exists

The above usually lead to misuse of the Quattor which in turn leads to unmaintained or difficult to maintain templates.
After a very long time of using Quattor and also evaluating other fabric management tools (i.e. puppet, chef) my conclusion is that while they can help admins to scale their infrastructure, they work pretty much against the way that administrators are used to. Admins are used to ssh and do their stuff while these management systems don't know how to interpret these changes so they simply skip them.
So lets move one step back and redesign it!
What if you had something monitoring your systems and had the ability to upload your local changes from the node to the central repository of your configuration?
I'm thinking of a solution which will have the following features:
  • A simple PXE image to inventory your nodes (that way you only need to know the MAC address of your nodes before importing them to the system).
  • PXE and kickstart (or equivalent) configuration to bootstrap nodes
  • Ability to take-over a pre-installed node (i.e. cloud VM)
  • Components (per feature) that will run on the nodes and will:
    1. Identify current configuration
    2. Change the configuration
    3. Prepare a feedback commit if the admin wants to push the local changes on the node to the configuration system
  • A simple tool/daemon that will run on the nodes as a wrapper for the components
  • All the configuration will be based on a tagging system so that each node will have its own tag (to override configuration) and then a sorted list of tags to include (i.e. a tag called web-server and a tag called production would result a node that will be part of the production web-servers)
  • The configuration server would store its data on a DB and timestamp every change. Changes should be easily reverted in case it is needed.
  • The configuration server would provide API interface with read-only access for other tools to consume data (i.e. an dynamic infrastructure monitoring system) and read-write to update the data.

A list of the first components (i.e. before announce alpha release) that i'm thinking of is:
  1. package managing
  2. file generator
  3. daemon handler
  4. user management

And a demo could be the installation of a node from scratch with a couple of users and the ssh service up and running.

So in principal this is going to be yet another fabric management tool but with the addition of feedback from the nodes to the server. From the fabric management tools that i've used i found that Chef is closer to what i'm thinking of and basically because of the "ohai" util that feedbacks information to the server. I'll probably depend on this.

PS: This is just an idea for now so please add comments, implementation will start when time allows. The implementation will be done based on RedHat rules (thus compatible with RedHat Enterprise Linux and its clones) but should be modular enough to be extended to other distributions.

Saturday, March 31, 2012

A successful failure

This post is about a really special failure that i had recently which i find as a success.

About a month ago i had a totally unexpected contact from a nice guy from Mountain View. He send me an email telling me that he is part of Google Staffing team and asking me if i'm interested to join the Google.com Engineering team.

At first i was pretty sure that this was a mail scam and i replied so, but he insisted with a complete description of the Google.com Engineering (a.k.a. SRE) team. I really got excited by that moment and i wanted some more information on how he found me. What made Google be interested in me? It turned out to be my involvement to opensource projects especially about machine deployment (via quattor) and monitoring (via nagios).

We proceeded to an (unfortunately quick) round of interviews. First i had an interview with him as my recruiter, he introduced me to the team once more and told me that there were two groups in that team, one focused on development and another focused system administration. That was a pretty good tip as i couldn't see my development skills on languages like C, Java etc to match Google's expectations. Of course he also knew this so we focused on the sys-admin side. This first interview was successful. He asked me to rate myself in a series of things like programming languages, administrating, scripting etc in a scale of 1 to 10 keeping in mind that 1 is like never heard about it and 10 that i've written some books about it. I'm not good in evaluating myself so shortly after this i was waiting a response like "Thanks Christos, we'll contact you again when you raise some of them", instead he continued with a list of questions on system administrating. To this point everything went well. I answer all (or almost all) of them correctly so we both were happy about the results.

He arranged the next interview about 10 days later with a member of the SRE team. He also sent me some interview tips and wished me good luck. I studied the resources he pointed me which were the google research documents about the infrastructure that (i believe) Google still uses like Google Filesystem, BigTables and MapReduce but also books both on development and system administration. These were some of the busiest but also exciting 10 days i had as i continued my normal work while i was studying things during night. I think at that point the Google dream was planted: "I will work on all these! I'll be a googler!".

The days passed and i had the second interview. This was with the SRE team member. He first made sure that i was expecting this call and then i totally lost him. He had proceed on the interview but i was still at the "HEY i'm talking with a Google engineer!"-world. At that point i introduced him my stressful side. He asked first about any strong skills that i wanted to outline, i've chosen infrastructure monitoring. We had a conversation on this for about 5 minutes (feeling like about half an hour). Then he asked me some network questions. Nothing real special, but things that normally computer scientists learn in college. It was about network blocks but i was still on "he asks me something, i have to reply soon or ..."-mood. I completely lost it at that point and started answering things that i knew to be wrong (but still wrong is better than nothing, or not?).

Next he proceed to a more familiar question about system load. I answered quickly and was happy that i passed this point till his next question "How would you calculate the system load?". Ok the theory is clear but how you do this in practice? This is totally kernel's job! To make it even worse he asked me to write a "simple" script to do so in the shared google docs we had both open. There my mistakes started one after another:
a) I used bash, awk etc for this. Why this is bad? because you get extra processes running :)
b) I only checked the Running processes from ps output. Completely forgot about blocked ones.
c) In order to get load average, i used expr for the final result, which didn't give me any decimals.
d) Probably many other things that i didn't realize, like use of so many external programs.
We discussed it a little bit and during our discussion i tried to fix some bits that would lead to syntax errors but the scope was already lost.

After this he gave me some time to ask whatever i wanted about Google and interview ended. By the time i hang up the phone every mistake, that at least i know about, came in my mind and found the correct answer (or at least a better one), but it was too late. I hoped that there will be another interview to try to fix things but the next day i got a call from my recruiter telling me politely that i'm not their guy and that they will keep my profile for possible future match. Fair enough considering their expectations.

So why i consider all these as success? Well after the interview i got really disappointed. I felt that the google dream was vanished. But then i got back a month... would i ever submit for an opening in Google? No! They found me! I got contacted by a leading corporation in IT world because of my previous (public) work. My first ever real (outside academic world) job interview was done by Google engineers! And most important this gave me the needed push to realize that what i do is important not only in academic world but also to the IT industry out there.

I no longer feel that the Google dream is over. Google may contact me in future again (given that the percentage of people being dropped at first and hired by the second time is quite high), it may also not. But does it really matter if it would be Google? Knowing that my expertise is in need by such corporations makes my day.