Load Sharing Facility (LSF) for Cluster Computing
LSF is a commercial workload management system from Platform Computing . LSF emphasizes job management and load sharing on both parallel and sequential jobs. In addition, it supports checkpointing, availability, load migration, and SSI. LSF is highly scalable and can support a clus-ter of thousands of nodes. LSF has been implemented for various UNIX and Windows/NT plat-forms. Currently, LSF is being used not only in clusters but also in grids and clouds.
1. LSF Architecture
LSF supports most UNIX platforms and uses the standard IP for JMS communication. Because of this, it can convert a heterogeneous network of UNIX computers into a cluster. There is no need to change the underlying OS kernel. The end user utilizes the LSF functionalities through a set of utility commands. PVM and MPI are supported. Both a command-line interface and a GUI are pro-vided. LSF also offers skilled users an API that is a runtime library called LSLIB (load sharing library). Using LSLIB explicitly requires the user to modify the application code, whereas using the utility commands does not. Two LSF daemons are used on each server in the cluster. The load information managers (LIMs) periodically exchange load information. The remote execution server (RES) executes remote tasks.
2. LSF Utility Commands
A cluster node may be a single-processor host or an SMP node with multiple processors, but always runs with only a single copy of the operating system on the node. Here are interesting features built into the LSF facilities:
• LSF supports all four combinations of interactive, batch, sequential, and parallel jobs. A job that is not executed through LSF is called a foreign job. A server node is one which can execute LSF jobs. A client node is one that can initiate or submit LSF jobs but cannot execute them. Only the resources on the server nodes can be shared. Server nodes can also initiate or submit LSF jobs.
• LSF offers a set of tools (lstools) to get information from LSF and to run jobs remotely. For instance, lshosts lists the static resources (discussed shortly) of every server node in the cluster. The command lsrun executes a program on a remote node.
• When a user types the command line %lsrun-R ‘swp>100’ myjob at a client node, the application myjob will be automatically executed on the most lightly loaded server node that has an available swap space greater than 100 MB.
• The lsbatch utility allows users to submit, monitor, and execute batch jobs through LSF. This utility is a load-sharing version of the popular UNIX command interpreter tcsh. Once a user enters the lstcsh shell, every command issued will be automatically executed on a suitable node. This is done transparently: The user sees a shell exactly like a tcsh running on the local node.
• The lsmake utility is a parallel version of the UNIX make utility, allowing a makefile to be processed in multiple nodes simultaneously.
Example 2.13 Application of the LSF on a Cluster of Computers
Suppose a cluster consists of eight expensive server nodes and 100 inexpensive client nodes (workstations or PCs). The server nodes are expensive due to better hardware and software, including application soft-ware. A license is available to install a FORTRAN compiler and a CAD simulation package, both valid for up to four users. Using a JMS such as LSF, all the hardware and software resources of the server nodes are made available to the clients transparently.
A user sitting in front of a client’s terminal feels as though the client node has all the software and speed of the servers locally. By typing lsmake my.makefile, the user can compile his source code on up to four servers. LSF selects the nodes with the least amount of load. Using LSF also benefits resource utiliza-tion. For instance, a user wanting to run a CAD simulation can submit a batch job. LSF will schedule the job as soon as the software becomes available.