OpenShop Job Framework

Project plans

These are some thoughts regarding the behaviours that should be supported by the openshop protocol in order to facilitate resource sharing between widely separated clusters on common projects.

Sharing

Sharing jobs between clusters means that it must be possible to partition a problem in arbitrary ways and dynamically distribute the partitions across a number of sites. Come typical cases are as follows.
  1. A project begins its life on a single cluster. As it grows, parts of the tree are broken off and moved to other sites. Parts of the tree which have been moved can be further split, and the process continues without bound. From the viewpoint of the project end-user, there is no visible change.
  2. A project begins its life as several independent branches at a number of sites, with one of the sites providing the basic job management infrastructure. At some point it is realized that there is too much manual labour involved in maintaining several independent projects, and the time has come to unify them under a single project heading. One of the sites builds a common root job which is to host the existing branches as children. A synchronization step is performed to replicate the base heirarchy on all sites, and the project then becomes a multi-sited project under a single head node. The user notices an additional top-level node has been added to the heirarchy, and new branches appear beside his own underneath. His own project continues to function as before.

A major problem with implementing such a scheme in openshop is that it requires that some jobs (jobs above a site-split in the tree) be replicated than one site. The framework allows this, provided that one of them is the authoritative copy in project space. But what happens if a job has descendents that belong to more than one site and then the job is rerun? The authoritative copy reruns and updates its internal state, but the copies retain the old state. After that, rerunning any child nodes on other sites that depend on the data in question will cause them to read stale data and be corrupted. This is an example of the more general problem in replicated data networks: cache coherency. The resolution of the problem requires a robust set of job states and interlocks. A good way to avoid illegal or ill-defined conditions like the one described above is to define a set of rules that restrict the conditions under which jobs are allowed to change their state. The actions that cause a job to change its state are implemented in its methods. A major steps in a job's evolution it is required to request permission to make those changes with the job manager, which is the central componented of the openshop framework. The manager decides whether to allow the state change or not, and if permission is granted then the manager records the new state of the job in the project database. This centralized transaction processor is a powerful concept because it not only allows all jobs objects to conform to the same state machine model, but it also can implement complex rules that involve interactions between jobs. The manager is able to implement the locking rules that can guarantee proper project coherency across a multi-sited project.