Using HPC Platforms

RADICAL-Pilot (RP) has two main components: Client and Agent, where the Client component is responsible for initiating and handling process managers (rp.Session, rp.PilotManager, rp.TaskManager), pilot and task descriptions (rp.PilotDescription, rp.TaskDescription), while the Agent component is responsible for the execution process of tasks within the pilot after allocating requested resources. Client and Agent can run either on the same machine or on different ones.

Note: Running Client and Agent components on different machines (e.g., running Client on user workstation and Agent on the target HPC platform) depends on the access policy of the target platform, where the execution of computing tasks will be performed. We advise you to check platforms user guides and supported HPC platforms for more details.

RP provides two ways to use supported HPC platforms to execute workloads:

Launching RP application from the target platform (local access)
- Run the Client component on platform login nodes or within the batch job (using either interactive session or batch script) on compute nodes. The Agent component will run on the batch node (i.e., launcher node within the job allocation, also called MOM node, which could be a regular compute node, if platform doesn’t support a dedicate node type).
Launching RP application outside the target platform (remote access)
- Run the Client component on the machine, which is not associated with the target platform. Client will make a remote job submission and Agent will run within the job allocation in a similar way as for the previous mode.

Launching from a batch job

We recommend to launch RP applications from the batch job, since such mode means no running processes on login nodes. Some HPC platforms limit the number of processes running on login nodes and the amount of resources used by such processes (see Examples of login nodes policies), system daemons might terminate any of user processes if that violates corresponding rules and policies. If it is not the case for your chosen platform, feel free to Launch RP application from a login node. The downside of “launching from a batch job” is that it requires the user to do one of the following operations manually: either to start the interactive session (i.e., interactive job) or submit a corresponding batch script calling RP application from it.

Note: The command to acquire an interactive job and the script language to write a batch job depends on the batch system deployed on the HPC platform and on its configuration. That means that you may have to use different commands or scripts depending on the HPC platform that you want to use. See the guide for each supported HPC platform for more details.

Note: Make sure that the amount of resources specified within rp.PilotDescription (pd.nodes and pd.runtime) of your RP application corresponds to the amount of resources requested for a batch job.

Examples of interactive jobs

As with any job, an interactive job is queued until the specified number of nodes is available. After job is started as the interactive session, you need to activate a corresponding virtual environment with installed RP package in it, and launch RP application as python rp_application.py.

SLURM Scheduler. Initiate an interactive job with salloc. It is recommended to use salloc over srun --pty $SHELL, since srun has certain limitations regarding interactive jobs.

OLCF/ORNL Frontier

salloc -A PROJECT_NAME -p PARTITION_NAME -J JOB_NAME \
       -N 1 -t 00:30:00

NCSA Delta

salloc --account=PROJECT_NAME --partition=gpuA40x4-interactive,gpuA100x4-interactive \
       --nodes=1 --cpus-per-task=2 --gpus-per-node=1 --mem=16g --time=00:30:00

PBSPro Scheduler. The qsub command is used to request an interactive job.

ALCF/ANL Polaris

qsub -I  -A PROJECT_NAME -q PARTITION_NAME \
         -l select=1 -l filesystems=home:eagle -l walltime=00:30:00

Examples of batch scripts

Batch jobs are submitted through a batch script using a corresponding job submission command, for example, for SLURM such command is sbatch and for PBSPro it is qsub. Batch script specifies your resource requirements, application run time, and the RP application that you want to execute.

SLURM Scheduler. Job submission: sbatch jobscript.slurm, where jobscript.slurm looks as following:

OLCF/ORNL Frontier

#!/bin/bash

#SBATCH -A PROJECT_NAME
#SBATCH -p PARTITION_NAME
#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -J JOB_NAME
#SBATCH -o %x-%j.out

source ~/ve_rp/bin/activate
python rp_application.py

PBSPro Scheduler. Job submission: qsub jobscript.pbs, where jobscript.pbs looks as following:

ALCF/ANL Polaris

#!/bin/bash -l

#PBS -A PROJECT_NAME
#PBS -q PARTITION_NAME
#PBS -l select=4:ncpus=256
#PBS -l walltime=0:30:00
#PBS -N JOB_NAME

source ~/ve_rp/bin/activate
python rp_application.py

Launching from a login node

Warning: Launching applications from login nodes might be restricted by platform rules and policies. Please check platform user guides regarding such restrictions (see Examples of login nodes policies).

To run your RP application on the login node of a supported HPC platform, you will need to ssh into the login node, load the python virtual environment (see Getting Started) and launch your RP application. RP will start Client related processes on the login node and will keep them running until RP application is finished. RP will make a job submission on a user behalf, and will start tasks execution after corresponding batch job starts (i.e., pilot state rp.PMGR_ACTIVE indicates that job starts).

ssh username@target_platform
# assuming that the virtual environment is already prepared with the RP package in it
source ~/ve_rp/bin/activate
python rp_application.py

# within `rp_application.py` for ALCF/ANL Polaris
pd = rp.PilotDescription({'resource' : 'anl.polaris',  # target platform
                          'project'  : 'PROJECT_NAME',
                          'queue'    : 'PARTITION_NAME',
                          'runtime'  : 30})

Examples of login nodes policies

OLCF/ORNL Frontier:

When you connect to the system, you are placed on a login node. Login nodes are used for tasks such as code editing, compiling, etc. They are shared among all users of the system, so it is not appropriate to run tasks that are long/computationally intensive on login nodes. Users should also limit the number of simultaneous tasks on login nodes (e.g. concurrent tar commands, parallel make).

Compute-intensive, memory-intensive, or other disruptive processes running on login nodes may be killed without warning.

TACC Frontera Conduct:

Each HPC resource’s login nodes are shared amongst all users. Depending on the resource, dozens of users may be logged on at one time accessing the shared file systems. A single user running computationally expensive or disk intensive task/s will negatively impact performance for other users. Running jobs on the login nodes is one of the fastest routes to account suspension. Instead, run on the compute nodes via an interactive session (idev) or by submitting a batch job.

Launching remotely

Warning: Remote submission does not work with two factors authentication. Target HPC platforms need to support passphrase-protected ssh keys as a login method without the use of a second authentication factor. Usually, the user needs to reach an agreement with the system administrators of the platform in order to allow ssh connections from a specific IP address.

Warning: Remote submissions require an SSH connection to be alive for the entire duration of the application run. If the ssh connection fails while the application runs, the application will fail. This has the potential of leaving an orphan RP Agent running on the HPC platform, consuming allocation and failing to properly execute any new application task. Remote submissions should not be attempted on a laptop with a Wi-Fi connection; and the risk of interrupting the ssh connection increases with the time taken by the application to complete.

If you can manually ssh into the target HPC platform, RP can do the same. You will have to set up an SSH key and, for example, follow up this guide if you need to become more familiar. RP will not work without configuring the ssh-agent, and it will require entering the user’s SSH key passphrase to access the HPC platform.

After setting up and configuring ssh, you can instruct RP to run its Client on your local workstation and its Agent on one or more HPC platforms. With the remote submission mode, you need to set a particular access_schema, which will point to corresponding endpoints for the job submission and the filesystem access. All other parameters stay the same as for launching from a login node.

pd = rp.PilotDescription({'resource'     : 'tacc.frontera',
                          'access_schema': 'ssh',
                          'project'      : 'PROJECT_NAME',
                          'queue'        : 'PARTITION_NAME',
                          'runtime'      : 6000})

# where `tacc.frontera` configuration has the following endpoints set:
#
#    "schemas"                     : {
#        "ssh"                     : {
#            "job_manager_endpoint": "slurm+ssh://frontera.tacc.utexas.edu/",
#            "filesystem_endpoint" : "sftp://frontera.tacc.utexas.edu/"
#        }