Showing posts with label Grid. Show all posts
Showing posts with label Grid. Show all posts

Tuesday, February 8, 2011

Could you please nslookup my CRL?

How do you publish your CRLs?
The common answer to this is via an http(s) URL or from an LDAP server.

If you hosted a CA that serves Grid community you would find out soon that your CDPs are hit very hard. According to our logs we get more than 250 hits per minute in order to download a file sized in most of the times less than 100 KB.

On the other hand Relying Parties to these CAs many times blame CAs for their outages or unavailability and want to have a reliable caching mechanism of the CRLs distributed in the whole world. Currently caching depends on CA's webserver configuration AND at client's willing to cache things. While big sites use squids to save their bandwidth, you still see many requests from the same origin for the same file getting the same response "cache this for 1 hour" but still trying to get a fresh one out of our server.

So yesterday on my way home i was thinking what if we host all the CRLs on a DNS server?
DNS resolvers have a reliable caching mechanism which is spread usually to one per site, the only limitation is how to store this info. TXT records have limit of 1300 bytes (max recommended size by a recent draft of IETF in order to fit in a single 1500-byte Ethernet package). One way to move is to split them. CRLs in PEM format are in base64 form which means that they have 65 bytes per line (64 data + newline) thus up to 20 lines per TXT record. Will this scale?

Our current infrastructure (as part of EGI) supports ~ 100 CAs. The majority of the CAs are covered by less than 10 TXT records. Less than 10 CAs need up to 40 TXT records while we have about 5 CAs that need more than 100 TXT records with the top reaching the 1770 records!

So yes this should cover the needs of the "small" CAs in terms of CRL size but would be unreliable for the "big" ones.

I'm planning to do some testing on this on my next spare time period and report here with my results (hopefully sooner than my last post).

Friday, July 17, 2009

Bye bye SL3! Bye bye gLite 3.0...

The last piece of gLite 3.0 node was decommissioned this week with the shutdown of node001.grid.auth.gr.

This node was the sBDII/lcg-CE for GR-01-AUTH for long time now and was serving as Torque server for local PBS queues.

A new node (XEN guest node with SL4-x86_64 and gLite 3.1) was setup in April to take over this task but the migration of the local users/queues was postponed till July. The node has already processed tens of thousands jobs.

With this migration we finally achieved the milestone where ALL our Grid nodes are controlled by our Quattor installation.

Thursday, June 25, 2009

Lets cut some (gLite) Hydra heads

You may be familiar with the Lernaean Hydra. The complexity of this beast was perfect to name a gLite service that is used to encrypt/decrypt data.

This service is based on the "Shamir's Secret Sharing" algorithm where a the encryption/decryption key is divided to X parts and Y parts of them (where Y <= X) are needed to reconstruct the key.

A requirement for data encryption was raised sometime in the previous years and we had deployed 3 gLite Hydra servers (each one will hold a part of every user's key and only 2 of them would be required for encryption/decryption operations) with clear geographic and administration separation.

A software update to one of them led to a "funny" situation where no new keys were able to be registered and no old ones could be unregistered. (These are the only operations that require all the servers to be up and responding). The tool that was provided to (re)configure the service had the very interesting operation of dropping every DB table and re-create them using the predefined schema.

A re-configuration of the updated server gave us a "everything just doesn't work" state, which we had to resolve under user community pressure. Note that if the service just didn't work, users may have lost lots of human/cpu hours because they are just able to get an encrypted output which they can't decrypt.

Analysis to the DB at another gLite Hydra instance gave us an idea of how this service stores its data. Due to luck the actual keys were not deleted by the configuration script but only the relation between users and keys was deleted.

A copy of the user database and some reverse engineering at the relation DB at a working Hydra instance was enough to recover the service with (almost?) no cost.

That reminded me that common Murphy's law where the backup you have is either unreadable at the time you needed or was last updated BEFORE your critical data was stored.

Saturday, June 20, 2009

OpenMP jobs on Grid? (The LCG-CE - PBS approach)

There was a user support requirement for OpenMP jobs in Grid. OpenMP is a shared-memory implementation which means that all processes must run on the same box.

Well this can easily achieved at PBS side by using the directive:
#PBS -l nodes=1:ppn=X

Where "X" is the number of requested processes. But the main issue is HOW can we get this requirement based on what WMS gives to us on submission?

After googling this, the "correct" solution can only be achieved at CREAM CE where users can select a number of requirements that will not only be used for job matching process at WMS but also passed to the CE. You can find more info on this here.

LCG CEs on the other hand are only getting a poor RSL which doesn't carry almost any of the user's requirements. So lets get in LCG CE's internals...

First a job reaches the globus-gatekeeper. At this phase user's proxy is matched to a pool account. GateKeeper's task is to authenticate the user and the job and pass it to the globus job manager.

The globus job manager uses the GRAM protocol to report the job state and submits the job to the globus-job-manager-marshal which is using a perl module to talk to the relevant queuing system.

This perl module is responsible for the creation of the job (shell script) that will be submitted to the PBS server. In this module the CpuNumber requirement is translated by default to:
#PBS -l nodes=X

So this is the part we need to change in order to create OpenMP jobs. The next issue now is how we find out if user has asked for OpenMP job. I've noticed that the JDL option "Environment" is passed to the job executable that will be submitted thus a definition like the following:
Environment = {"OPENMP=true"};
can do the trick.

The whole above approach works but for sure needs a lot of work but as proof of concept is more than ok...
In the (near) future i would like to test the CREAM CE which, as i said before, has a more clear way to support requirements from JDLs using the CeForwardParameters definition.