Tuesday, June 26, 2012

Fabric management redesigned!


People who know me probably know that when it comes to fabric management i'm a fan of Quattor. Quattor is a great tool that can manage nodes from installation part (utilizing pxeboot and kickstart) till fine-tuning service features. While Quattor is a strong tool that could help administering hundreds or even thousands of nodes, it has some weak spots which i'd like to get rid of:
  • Very steep learning curve
  • Some operations may be quite time consuming (i.e. doing errata updates)
  • Use of a custom programming language (PAN) which is usually unknown to even senior admins
  • Inventory of the assets is based on what you describe and not on what exists

The above usually lead to misuse of the Quattor which in turn leads to unmaintained or difficult to maintain templates.
After a very long time of using Quattor and also evaluating other fabric management tools (i.e. puppet, chef) my conclusion is that while they can help admins to scale their infrastructure, they work pretty much against the way that administrators are used to. Admins are used to ssh and do their stuff while these management systems don't know how to interpret these changes so they simply skip them.
So lets move one step back and redesign it!
What if you had something monitoring your systems and had the ability to upload your local changes from the node to the central repository of your configuration?
I'm thinking of a solution which will have the following features:
  • A simple PXE image to inventory your nodes (that way you only need to know the MAC address of your nodes before importing them to the system).
  • PXE and kickstart (or equivalent) configuration to bootstrap nodes
  • Ability to take-over a pre-installed node (i.e. cloud VM)
  • Components (per feature) that will run on the nodes and will:
    1. Identify current configuration
    2. Change the configuration
    3. Prepare a feedback commit if the admin wants to push the local changes on the node to the configuration system
  • A simple tool/daemon that will run on the nodes as a wrapper for the components
  • All the configuration will be based on a tagging system so that each node will have its own tag (to override configuration) and then a sorted list of tags to include (i.e. a tag called web-server and a tag called production would result a node that will be part of the production web-servers)
  • The configuration server would store its data on a DB and timestamp every change. Changes should be easily reverted in case it is needed.
  • The configuration server would provide API interface with read-only access for other tools to consume data (i.e. an dynamic infrastructure monitoring system) and read-write to update the data.

A list of the first components (i.e. before announce alpha release) that i'm thinking of is:
  1. package managing
  2. file generator
  3. daemon handler
  4. user management

And a demo could be the installation of a node from scratch with a couple of users and the ssh service up and running.

So in principal this is going to be yet another fabric management tool but with the addition of feedback from the nodes to the server. From the fabric management tools that i've used i found that Chef is closer to what i'm thinking of and basically because of the "ohai" util that feedbacks information to the server. I'll probably depend on this.

PS: This is just an idea for now so please add comments, implementation will start when time allows. The implementation will be done based on RedHat rules (thus compatible with RedHat Enterprise Linux and its clones) but should be modular enough to be extended to other distributions.

Saturday, March 31, 2012

A successful failure

This post is about a really special failure that i had recently which i find as a success.

About a month ago i had a totally unexpected contact from a nice guy from Mountain View. He send me an email telling me that he is part of Google Staffing team and asking me if i'm interested to join the Google.com Engineering team.

At first i was pretty sure that this was a mail scam and i replied so, but he insisted with a complete description of the Google.com Engineering (a.k.a. SRE) team. I really got excited by that moment and i wanted some more information on how he found me. What made Google be interested in me? It turned out to be my involvement to opensource projects especially about machine deployment (via quattor) and monitoring (via nagios).

We proceeded to an (unfortunately quick) round of interviews. First i had an interview with him as my recruiter, he introduced me to the team once more and told me that there were two groups in that team, one focused on development and another focused system administration. That was a pretty good tip as i couldn't see my development skills on languages like C, Java etc to match Google's expectations. Of course he also knew this so we focused on the sys-admin side. This first interview was successful. He asked me to rate myself in a series of things like programming languages, administrating, scripting etc in a scale of 1 to 10 keeping in mind that 1 is like never heard about it and 10 that i've written some books about it. I'm not good in evaluating myself so shortly after this i was waiting a response like "Thanks Christos, we'll contact you again when you raise some of them", instead he continued with a list of questions on system administrating. To this point everything went well. I answer all (or almost all) of them correctly so we both were happy about the results.

He arranged the next interview about 10 days later with a member of the SRE team. He also sent me some interview tips and wished me good luck. I studied the resources he pointed me which were the google research documents about the infrastructure that (i believe) Google still uses like Google Filesystem, BigTables and MapReduce but also books both on development and system administration. These were some of the busiest but also exciting 10 days i had as i continued my normal work while i was studying things during night. I think at that point the Google dream was planted: "I will work on all these! I'll be a googler!".

The days passed and i had the second interview. This was with the SRE team member. He first made sure that i was expecting this call and then i totally lost him. He had proceed on the interview but i was still at the "HEY i'm talking with a Google engineer!"-world. At that point i introduced him my stressful side. He asked first about any strong skills that i wanted to outline, i've chosen infrastructure monitoring. We had a conversation on this for about 5 minutes (feeling like about half an hour). Then he asked me some network questions. Nothing real special, but things that normally computer scientists learn in college. It was about network blocks but i was still on "he asks me something, i have to reply soon or ..."-mood. I completely lost it at that point and started answering things that i knew to be wrong (but still wrong is better than nothing, or not?).

Next he proceed to a more familiar question about system load. I answered quickly and was happy that i passed this point till his next question "How would you calculate the system load?". Ok the theory is clear but how you do this in practice? This is totally kernel's job! To make it even worse he asked me to write a "simple" script to do so in the shared google docs we had both open. There my mistakes started one after another:
a) I used bash, awk etc for this. Why this is bad? because you get extra processes running :)
b) I only checked the Running processes from ps output. Completely forgot about blocked ones.
c) In order to get load average, i used expr for the final result, which didn't give me any decimals.
d) Probably many other things that i didn't realize, like use of so many external programs.
We discussed it a little bit and during our discussion i tried to fix some bits that would lead to syntax errors but the scope was already lost.

After this he gave me some time to ask whatever i wanted about Google and interview ended. By the time i hang up the phone every mistake, that at least i know about, came in my mind and found the correct answer (or at least a better one), but it was too late. I hoped that there will be another interview to try to fix things but the next day i got a call from my recruiter telling me politely that i'm not their guy and that they will keep my profile for possible future match. Fair enough considering their expectations.

So why i consider all these as success? Well after the interview i got really disappointed. I felt that the google dream was vanished. But then i got back a month... would i ever submit for an opening in Google? No! They found me! I got contacted by a leading corporation in IT world because of my previous (public) work. My first ever real (outside academic world) job interview was done by Google engineers! And most important this gave me the needed push to realize that what i do is important not only in academic world but also to the IT industry out there.

I no longer feel that the Google dream is over. Google may contact me in future again (given that the percentage of people being dropped at first and hired by the second time is quite high), it may also not. But does it really matter if it would be Google? Knowing that my expertise is in need by such corporations makes my day.

Monday, July 4, 2011

Developing the proper way

Recently i wanted to submit some patches to the Lustre Filesystem to allow the compilation of the kernel modules (patchless client) for Fedora 14 kernels. I initially posted my patches to the discuss list but i was pointed to the company that maintains the community based Lustre fork (whamcloud).

There i was amazed by the way that code is maintained as it uses all the current best practices that i could think of for developing. Thus i wanted to share them with my undefined readers.

Git repository:
First the code is publicly available on a git repo. People can clone the repository locally, do their modifications and they push them to Gerrit.

Gerrit:
Gerrit is a code reviewing web tool for git repositories which allows people to submit via "git push" patches. Patches are verified and reviewed. In Lustre's case, code verification is done by Jenkins while review is done by at least 2 code reviewers

Jenkins:
Jenkins monitors execution of repeated jobs. In Lustre's case it is used for automated builds of the current master branch + the committed patchset.

Finally when patches are approved they are merged to the master branch.

Tuesday, February 8, 2011

Could you please nslookup my CRL?

How do you publish your CRLs?
The common answer to this is via an http(s) URL or from an LDAP server.

If you hosted a CA that serves Grid community you would find out soon that your CDPs are hit very hard. According to our logs we get more than 250 hits per minute in order to download a file sized in most of the times less than 100 KB.

On the other hand Relying Parties to these CAs many times blame CAs for their outages or unavailability and want to have a reliable caching mechanism of the CRLs distributed in the whole world. Currently caching depends on CA's webserver configuration AND at client's willing to cache things. While big sites use squids to save their bandwidth, you still see many requests from the same origin for the same file getting the same response "cache this for 1 hour" but still trying to get a fresh one out of our server.

So yesterday on my way home i was thinking what if we host all the CRLs on a DNS server?
DNS resolvers have a reliable caching mechanism which is spread usually to one per site, the only limitation is how to store this info. TXT records have limit of 1300 bytes (max recommended size by a recent draft of IETF in order to fit in a single 1500-byte Ethernet package). One way to move is to split them. CRLs in PEM format are in base64 form which means that they have 65 bytes per line (64 data + newline) thus up to 20 lines per TXT record. Will this scale?

Our current infrastructure (as part of EGI) supports ~ 100 CAs. The majority of the CAs are covered by less than 10 TXT records. Less than 10 CAs need up to 40 TXT records while we have about 5 CAs that need more than 100 TXT records with the top reaching the 1770 records!

So yes this should cover the needs of the "small" CAs in terms of CRL size but would be unreliable for the "big" ones.

I'm planning to do some testing on this on my next spare time period and report here with my results (hopefully sooner than my last post).

Tuesday, September 15, 2009

Hex-editing your GPFS Terabytes

I remember the old times where you could find a those geek guys that would use a hex editor to "patch" your favorite game and become bullet-proof or with 99...9 (still counting) lives or ... do what ever you wanted in order to win. At that time i was sure that hex editors are powerful to save you from a "disaster" but i couldn't think what a disaster can be.

We are using GPFS as network file system for our clusters and except from dummy scratch space for MPI jobs it is also used for some local user's home directory. A local team need to expand their GPFS filesystem so we had to add a few disks to our array. The procedure sounded trivial, adding the new disks to the array, create a new logical volume and finally add the new raw device to GPFS filesystem.

But of course something went wrong. The new volume was about 10TB in size which due to a GPFS limitation we had to partition in at least 2 partitions. Easy work via parted but what happens when parted (for a reason still unknown) "modifies" the partition table of other logical volume which is part of the GPFS filesystem as a whole (without partition table).

Well the parted simply ruins the first sectors of a GPFS NSD which means it ruins all the valuable information (Disk ids and NSD ids as well as the filesystem definition) from this disk. The users report "We are receiving 'Input/Output error' when using the X file" and everything gets worse and worse.

Fortunately there IS a solution to this disaster. Although we couldn't find any official IBM documentation on this (apart from some posts in GPFS's forum), there is a way to recover from this situation. What you need is a hex editor, the famous "dd" and a lot of patience.

First copy the sector 8 from each disk within the GPFS filesystem. This sector is the File System Descriptor and it is common on all disks. Next we have to recover sector 2 and sector 1. Sector 2 is GPFS disk identifier (know also as the NSD-ID). Finally the sector 1 contains information about the disk which is called the disk descriptor.

Due to legal reasons i'm not sure if i'm allowed to reveal more information on how to do this but studying carefully the sectors starting from 8 and going to 2 and then 1 you are able to recover your FS.