Showing posts with label developing. Show all posts
Showing posts with label developing. Show all posts

Tuesday, June 26, 2012

Fabric management redesigned!


People who know me probably know that when it comes to fabric management i'm a fan of Quattor. Quattor is a great tool that can manage nodes from installation part (utilizing pxeboot and kickstart) till fine-tuning service features. While Quattor is a strong tool that could help administering hundreds or even thousands of nodes, it has some weak spots which i'd like to get rid of:
  • Very steep learning curve
  • Some operations may be quite time consuming (i.e. doing errata updates)
  • Use of a custom programming language (PAN) which is usually unknown to even senior admins
  • Inventory of the assets is based on what you describe and not on what exists

The above usually lead to misuse of the Quattor which in turn leads to unmaintained or difficult to maintain templates.
After a very long time of using Quattor and also evaluating other fabric management tools (i.e. puppet, chef) my conclusion is that while they can help admins to scale their infrastructure, they work pretty much against the way that administrators are used to. Admins are used to ssh and do their stuff while these management systems don't know how to interpret these changes so they simply skip them.
So lets move one step back and redesign it!
What if you had something monitoring your systems and had the ability to upload your local changes from the node to the central repository of your configuration?
I'm thinking of a solution which will have the following features:
  • A simple PXE image to inventory your nodes (that way you only need to know the MAC address of your nodes before importing them to the system).
  • PXE and kickstart (or equivalent) configuration to bootstrap nodes
  • Ability to take-over a pre-installed node (i.e. cloud VM)
  • Components (per feature) that will run on the nodes and will:
    1. Identify current configuration
    2. Change the configuration
    3. Prepare a feedback commit if the admin wants to push the local changes on the node to the configuration system
  • A simple tool/daemon that will run on the nodes as a wrapper for the components
  • All the configuration will be based on a tagging system so that each node will have its own tag (to override configuration) and then a sorted list of tags to include (i.e. a tag called web-server and a tag called production would result a node that will be part of the production web-servers)
  • The configuration server would store its data on a DB and timestamp every change. Changes should be easily reverted in case it is needed.
  • The configuration server would provide API interface with read-only access for other tools to consume data (i.e. an dynamic infrastructure monitoring system) and read-write to update the data.

A list of the first components (i.e. before announce alpha release) that i'm thinking of is:
  1. package managing
  2. file generator
  3. daemon handler
  4. user management

And a demo could be the installation of a node from scratch with a couple of users and the ssh service up and running.

So in principal this is going to be yet another fabric management tool but with the addition of feedback from the nodes to the server. From the fabric management tools that i've used i found that Chef is closer to what i'm thinking of and basically because of the "ohai" util that feedbacks information to the server. I'll probably depend on this.

PS: This is just an idea for now so please add comments, implementation will start when time allows. The implementation will be done based on RedHat rules (thus compatible with RedHat Enterprise Linux and its clones) but should be modular enough to be extended to other distributions.

Monday, July 4, 2011

Developing the proper way

Recently i wanted to submit some patches to the Lustre Filesystem to allow the compilation of the kernel modules (patchless client) for Fedora 14 kernels. I initially posted my patches to the discuss list but i was pointed to the company that maintains the community based Lustre fork (whamcloud).

There i was amazed by the way that code is maintained as it uses all the current best practices that i could think of for developing. Thus i wanted to share them with my undefined readers.

Git repository:
First the code is publicly available on a git repo. People can clone the repository locally, do their modifications and they push them to Gerrit.

Gerrit:
Gerrit is a code reviewing web tool for git repositories which allows people to submit via "git push" patches. Patches are verified and reviewed. In Lustre's case, code verification is done by Jenkins while review is done by at least 2 code reviewers

Jenkins:
Jenkins monitors execution of repeated jobs. In Lustre's case it is used for automated builds of the current master branch + the committed patchset.

Finally when patches are approved they are merged to the master branch.

Sunday, June 28, 2009

You can do the DB scheme work online!

I'm working on a project with colleagues from other institutes. Within this project we decided to some work on a client-server model application with a database back-end.

We had many mail exchanges, nice figures to describe workflows, phone meetings, video conference meetings but it was time to start doing some work.

One first thing we had to agree on was the DB scheme we are going to work on and i was surprised by a tool that one of the colleagues used to give us his SQL model. The tool is called wwwsqldesigner and it is open-source. Of course there is a demo installation to use if you don't want to install it your self.

What i liked most is that you are able to get your design in XML format. Then you can send the XML file to the rest developing team who can upload it again either to their local installation or the demo one, do their changes and publish a new version etc... Of course you are able to save your model at the server and then others can just select it from a list in order to view it and change it.

I liked it so much that i'm thinking of installing it locally and upload our local project's schemes.

Saturday, June 20, 2009

OpenMP jobs on Grid? (The LCG-CE - PBS approach)

There was a user support requirement for OpenMP jobs in Grid. OpenMP is a shared-memory implementation which means that all processes must run on the same box.

Well this can easily achieved at PBS side by using the directive:
#PBS -l nodes=1:ppn=X

Where "X" is the number of requested processes. But the main issue is HOW can we get this requirement based on what WMS gives to us on submission?

After googling this, the "correct" solution can only be achieved at CREAM CE where users can select a number of requirements that will not only be used for job matching process at WMS but also passed to the CE. You can find more info on this here.

LCG CEs on the other hand are only getting a poor RSL which doesn't carry almost any of the user's requirements. So lets get in LCG CE's internals...

First a job reaches the globus-gatekeeper. At this phase user's proxy is matched to a pool account. GateKeeper's task is to authenticate the user and the job and pass it to the globus job manager.

The globus job manager uses the GRAM protocol to report the job state and submits the job to the globus-job-manager-marshal which is using a perl module to talk to the relevant queuing system.

This perl module is responsible for the creation of the job (shell script) that will be submitted to the PBS server. In this module the CpuNumber requirement is translated by default to:
#PBS -l nodes=X

So this is the part we need to change in order to create OpenMP jobs. The next issue now is how we find out if user has asked for OpenMP job. I've noticed that the JDL option "Environment" is passed to the job executable that will be submitted thus a definition like the following:
Environment = {"OPENMP=true"};
can do the trick.

The whole above approach works but for sure needs a lot of work but as proof of concept is more than ok...
In the (near) future i would like to test the CREAM CE which, as i said before, has a more clear way to support requirements from JDLs using the CeForwardParameters definition.

Friday, June 19, 2009

Coding on multiple SVN repositories...

As a developer i use to use repositories (mainly SVN) for code versioning and to interact with other developers.

Involvement in developments from other teams within a project usually require that the (production) repository is hosted somewhere centrally. This give us the advantage of having one code-base where all developers to work. The main disadvantage of this implementation though is that usually developers doesn't commit till they have something really stable and working.

Another disadvantage is that it is not clear to someone outside your mind to find out on what you are working on (and usually "manager" guys need to do so).

I was proposed to use a local repository for every developing i do where it would be easy to have "every change" commits and commit stable versions to the central repositories. This will give us both frequent commits (thus clear history view) and other are able to see on what you are working and probably comment on this work. At first i was highly against this... It's clear that it adds a lot of additional work without giving us many clear advantages.

As this was a "manager's" proposition i had to try it. The initial thought was to work on our local repository and then, when i have something stable, take a diff since last sync of the repositories to apply it to the remote (central repository).

But ... thinking on this again, is it the "svn tagging" procedure having different server for trunk and tags?

An implementation to test:
  1. Creation of a local test repository with trunk and tag trees
  2. Create a new repository to serve as "the remote central repository"
  3. Create a new tag at the first repository which will have an svn:external link to "the remote central repository"
  4. Start tagging as normal on the local test repository but always at the same tag.