Cluster Job Management Systems
Job management is also known as workload management, load sharing, or load management. We will first discuss the basic issues facing a job management system and summarize the available soft-ware packages. A Job Management System (JMS) should have three parts:
• A user server lets the user submit jobs to one or more queues, specify resource requirements for each job, delete a job from a queue, and inquire about the status of a job or a queue.
• A job scheduler performs job scheduling and queuing according to job types, resource requirements, resource availability, and scheduling policies.
• A resource manager allocates and monitors resources, enforces scheduling policies, and collects accounting information.
1. JMS Administration
The functionality of a JMS is often distributed. For instance, a user server may reside in each host node, and the resource manager may span all cluster nodes. However, the administration of a JMS should be centralized. All configuration and log files should be maintained in one location. There should be a single user interface to use the JMS. It is undesirable to force the user to run PVM jobs through one software package, MPI jobs through another, and HPF jobs through yet another.
The JMS should be able to dynamically reconfigure the cluster with minimal impact on the running jobs. The administrator’s prologue and epilogue scripts should be able to run before and after each job for security checking, accounting, and cleanup. Users should be able to cleanly kill their own jobs. The administrator or the JMS should be able to cleanly suspend or kill any job. Clean means that when a job is suspended or killed, all its processes must be included. Otherwise, some “orphan” processes are left in the system, which wastes cluster resources and may eventually render the system unusable.
2. Cluster Job Types
Several types of jobs execute on a cluster. Serial jobs run on a single node. Parallel jobs use multi-ple nodes. Interactive jobs are those that require fast turnaround time, and their input/output is directed to a terminal. These jobs do not need large resources, and users expect them to execute immediately, not to wait in a queue. Batch jobs normally need more resources, such as large mem-ory space and long CPU time. But they do not need immediate responses. They are submitted to a job queue to be scheduled to run when the resource becomes available (e.g., during off hours).
While both interactive and batch jobs are managed by the JMS, foreign jobs are created outside the JMS. For instance, when a network of workstations is used as a cluster, users can submit inter-active or batch jobs to the JMS. Meanwhile, the owner of a workstation can start a foreign job at any time, which is not submitted through the JMS. Such a job is also called a local job, as opposed to cluster jobs (interactive or batch, parallel or serial) that are submitted through the JMS of the cluster. The characteristic of a local job is fast response time. The owner wants all resources to exe-cute his job, as though the cluster jobs did not exist.
3. Characteristics of a Cluster Workload
To realistically address job management issues, we must understand the workload behavior of clusters. It may seem ideal to characterize workload based on long-time operation data on real clusters. The parallel workload traces include both development and production jobs. These traces are then fed to a simulator to generate various statistical and performance results, based on different sequen-tial and parallel workload combinations, resource allocations, and scheduling policies. The following workload characteristics are based on a NAS benchmark experiment. Of course, different workloads may have variable statistics.
• Roughly half of parallel jobs are submitted during regular working hours. Almost 80 percent of parallel jobs run for three minutes or less. Parallel jobs running longer than 90 minutes account for 50 percent of the total time.
• The sequential workload shows that 60 percent to 70 percent of workstations are available to execute parallel jobs at any time, even during peak daytime hours.
• On a workstation, 53 percent of all idle periods are three minutes or less, but 95 percent of idle time is spent in periods of time that are 10 minutes or longer.
• A 2:1 rule applies, which says that a network of 64 workstations, with proper JMS software, can sustain a 32-node parallel workload in addition to the original sequential workload. In other words, clustering gives a supercomputer half of the cluster size for free!
4. Migration Schemes
A migration scheme must consider the following three issues:
• Node availability This refers to node availability for job migration. The Berkeley NOW project has reported such opportunity does exist in university campus environment. Even during peak hours, 60 percent of workstations in a cluster are available at Berkeley campus.
• Migration overhead What is the effect of the migration overhead? The migration time can significantly slow down a parallel job. It is important to reduce the migration overhead (e.g., by improving the communication subsystem) or to migrate only rarely. The slowdown is significantly reduced if a parallel job is run on a cluster of twice the size. For instance, for a 32-node parallel job run on a 60-node cluster, the slowdown caused by migration is no more than 20 percent, even when the migration time is as long as three minutes. This is because more nodes are available, and thus the migration demand is less frequent.
• Recruitment threshold What should be the recruitment threshold? In the worst scenario, right after a process migrates to a node, the node is immediately claimed by its owner. Thus, the process has to migrate again, and the cycle continues. The recruitment threshold is the amount of time a workstation stays unused before the cluster considers it an idle node.
5. Desired Features in A JMS
Here are some features that have been built in some commercial JMSes in cluster computing applications:
• Most support heterogeneous Linux clusters. All support parallel and batch jobs. However, Connect:Queue does not support interactive jobs.
• Enterprise cluster jobs are managed by the JMS. They will impact the owner of a workstation in running local jobs. However, NQE and PBS allow the impact to be adjusted. In DQS, the impact can be configured to be minimal.
• All packages offer some kind of load-balancing mechanism to efficiently utilize cluster resources. Some packages support checkpointing.
• Most packages cannot support dynamic process migration. They support static migration: A process can be dispatched to execute on a remote node when the process is first created.
However, once it starts execution, it stays in that node. A package that does support dynamic process migration is Condor.
• All packages allow dynamic suspension and resumption of a user job by the user or by the administrator. All packages allow resources (e.g., nodes) to be dynamically added to or deleted.
• Most packages provide both a command-line interface and a graphical user interface. Besides UNIX security mechanisms, most packages use the Kerberos authentication system.