Tuesday, June 26, 2012

Fabric management redesigned!


People who know me probably know that when it comes to fabric management i'm a fan of Quattor. Quattor is a great tool that can manage nodes from installation part (utilizing pxeboot and kickstart) till fine-tuning service features. While Quattor is a strong tool that could help administering hundreds or even thousands of nodes, it has some weak spots which i'd like to get rid of:
  • Very steep learning curve
  • Some operations may be quite time consuming (i.e. doing errata updates)
  • Use of a custom programming language (PAN) which is usually unknown to even senior admins
  • Inventory of the assets is based on what you describe and not on what exists

The above usually lead to misuse of the Quattor which in turn leads to unmaintained or difficult to maintain templates.
After a very long time of using Quattor and also evaluating other fabric management tools (i.e. puppet, chef) my conclusion is that while they can help admins to scale their infrastructure, they work pretty much against the way that administrators are used to. Admins are used to ssh and do their stuff while these management systems don't know how to interpret these changes so they simply skip them.
So lets move one step back and redesign it!
What if you had something monitoring your systems and had the ability to upload your local changes from the node to the central repository of your configuration?
I'm thinking of a solution which will have the following features:
  • A simple PXE image to inventory your nodes (that way you only need to know the MAC address of your nodes before importing them to the system).
  • PXE and kickstart (or equivalent) configuration to bootstrap nodes
  • Ability to take-over a pre-installed node (i.e. cloud VM)
  • Components (per feature) that will run on the nodes and will:
    1. Identify current configuration
    2. Change the configuration
    3. Prepare a feedback commit if the admin wants to push the local changes on the node to the configuration system
  • A simple tool/daemon that will run on the nodes as a wrapper for the components
  • All the configuration will be based on a tagging system so that each node will have its own tag (to override configuration) and then a sorted list of tags to include (i.e. a tag called web-server and a tag called production would result a node that will be part of the production web-servers)
  • The configuration server would store its data on a DB and timestamp every change. Changes should be easily reverted in case it is needed.
  • The configuration server would provide API interface with read-only access for other tools to consume data (i.e. an dynamic infrastructure monitoring system) and read-write to update the data.

A list of the first components (i.e. before announce alpha release) that i'm thinking of is:
  1. package managing
  2. file generator
  3. daemon handler
  4. user management

And a demo could be the installation of a node from scratch with a couple of users and the ssh service up and running.

So in principal this is going to be yet another fabric management tool but with the addition of feedback from the nodes to the server. From the fabric management tools that i've used i found that Chef is closer to what i'm thinking of and basically because of the "ohai" util that feedbacks information to the server. I'll probably depend on this.

PS: This is just an idea for now so please add comments, implementation will start when time allows. The implementation will be done based on RedHat rules (thus compatible with RedHat Enterprise Linux and its clones) but should be modular enough to be extended to other distributions.