RADICAL-Pilot 1.4.0 Documentation

RADICAL-Pilot (RP) is a Pilot system [1] [2] written in Python and specialized in executing applications composed of many computational tasks on high performance computing (HPC) platforms. As a Pilot system, RP separates resource acquisition from using those resources to execute application tasks. Resources are acquired by submitting a job to the batch system of an HPC machine. Once the job is scheduled on the requested resources, RP can directly schedule and launch application tasks on those resources. Thus, tasks are not scheduled via the batch system of the HPC platform, but directly on the acquired resources.

As every Pilot system, RP offers two main benefits: (1) high-throughput task execution; and (2) concurrent and sequential task executions on the same pilot. High-throughput is possible because the user exclusively owns the resources on which those tasks are executed for as long as the job submitted to the HPC platform remains available. Depending on resource availability, tasks can be scheduled concurrently and, if more tasks need to be executed, one after the other. In this way, tasks can execute both concurrently and sequentially on the same pilot.

RP offers four unique features when compared to other pilot systems or tools that enable the execution of multi-task applications on HPC platforms: (1) execution different types of tasks concurrently on the same pilot, e.g., single-core, OpenMP, MPI, single- and multi-GPU; (2) support of all the major HPC batch systems, e.g., slurm, torque, pbs, lsf, etc.; (3) support of more than 14 methods to launch tasks, e.g., ssh, mpirun, aprun, jsrun, prrte, etc.; and (4) a general purpose architecture.

RADICAL-Pilot is not a static system, but it rather provides the user with a programming library (“Pilot-API”) that provides abstractions for resource access and task management. With this library, the user can develop everything from simple “submission scripts” to arbitrarily complex applications, higher-level services and tools.

Chapter RADICAL-Pilot Overview offers more information about tasks, workloads, pilot, pilot systems, and RP implementation. The user is strongly invited to carefully read that section before starting to use RP.