Using HPC Platforms

RADICAL-Pilot consists of a client and a so-called agent. Client and agent can execute on two different machines, e.g., the former on your workstation, the latter on an HPC platform’s compute node. Alternatively, the client can execute on the login node of an HPC platform and the agent on a compute node, or both client and agent can execute on a compute node.

How to deploy RADICAL-Pilot depends on the platform’s policies that regulate access to the platform (ssh, DUO, hardware token), and the amount and type of resources that can be used on a login node (usually minimal). Further, each HPC platform will require a specific resource configuration file (provided with RADICAL-Pilot) and, in some cases, some user-dependent configuration.

RADICAL-Pilot (RP) provides three ways to use supported HPC platforms to execute workloads:

Remote submission: users can execute their RP application from their workstation, and then RP accesses the HPC platform via ssh.
Interactive submission: users can submit an interactive/batch job on the HPC platform, and then RP from a compute node.
Login submission: users can ssh into the login node of the HPC platform, and then launch their RP application from that shell.

Remote submission

Warning: Remote submission does not work with two factors authentication. Target HPC platforms need to support passphrase-protected ssh keys as a login method without the use of a second authentication factor. Usually, the user needs to reach an agreement with the system administrators of the platform in order to allow ssh connections from a specific IP address. Putting such an agreement in place is from difficult to impossible, and requires a fixed IP.

Warning: Remote submissions require a ``ssh`` connection to be alive for the entire duration of the application. If the ssh connection fails while the application executes, the application will fail. This has the potential of leaving an orphan RP Agent running on the HPC platform, consuming allocation and failing to properly execute any new application task. Remote submissions should not be attempted on a laptop with a Wi-Fi connection; and the risk of interrupting the ssh connection increases with the time taken by the application to complete.

If you can manually ssh into the target HPC platform, RADICAL-Pilot can do the same. You will have to set up an ssh key and, for example, follow up this guide if you need to become more familiar.

Note: RADICAL-Pilot will not work without configuring the ssh-agent, and it will require entering the user’s ssh key passphrase to access the HPC platform

After setting up and configuring ssh, you will be able to instruct RP to run its client on your local workstation and its agent on one or more HPC platforms. With the remote submission mode, you:

Create a pilot description object;
Specify and the RP resource ID of the supported HP platform;
Specify the access schema you want to use to access that platform.

[1]:

import radical.pilot as rp

session = rp.Session()
pd_init = {'resource'     : 'tacc.frontera',
           'access_schema': 'ssh'
          }

pdesc = rp.PilotDescription(pd_init)

Note: For a list of supported HPC platforms, see List of Supported Platforms. Resource configuration files can are located at radical/pilot/configs/ in the RADICAL-Pilot git repository.

Interactive Submission

User can perform an interactive submission of an RP application on a supported HPC platform in two ways:

Submitting an interactive job to the batch system to acquire a shell and then executing the RP application from that shell.
Submitting a batch script to the batch system that, once scheduled, will execute the RP application.

Note: The command to acquire an interactive job and the script language to write a batch job depends on the batch system deployed on the HPC platform and on its configuration. That means that you may have to use different commands or scripts depending on the HPC platform that you want to use. See the guide for each supported HPC platform for more details.

Configuring an RP application for interactive submission

You will need to set the access_schema in your pilot description to interactive. All the other parameters of your application remain the same and are independent of how you execute your RP application. For example, assume that your application requires 4096 cores, will terminate in 10 hours, and you want to execute it on TACC Frontera. To run it from an interactive job, you will have to use the following pilot description:

[2]:

pd_init = {'resource'     : 'tacc.frontera',
           'access_schema': 'interactive',
           'runtime'      : 6000,
           'exit_on_error': True,
           'project'      : 'myproject',
           'queue'        : 'normal',
           'cores'        : 4096,
           'gpus'         : 0
          }

pdesc = rp.PilotDescription(pd_init)
session.close(cleanup=True)

Submitting an interactive job

To run RP an RP application with that pilot description on an interactive computing mode, you must request the amount and type of resources needed to execute your application. That means that, if your application requires N/M cores/GPUs, you will have to submit an interactive job requiring N nodes so that N provides N/M cores/GPUs. Consult the user guide of the resource you want to use to find out how many cores/GPUs each compute node has.

For our example application, you will need to do the following:

ssh into Frontera’s login node. To find out Frontera’s FQDN check its user guide
Check how many nodes you need on Frontera to get at least 4096 cores. Following the user guide, each Cascade Lake (CLX) Compute Nodes of Frontera has 56 cores. Thus, you will need 74 nodes (you may want to consider whether your application could scale to use all the available 4144 cores).
Find on Frontera’s user guide the command and the options required to submit an interactive job.
Issue the appropriate command, in our case, assuming that your application will take no more than 10 hours to complete:

idev -p normal -N 74 -n 56 -m 600

Once your job is scheduled and returns a shell, execute your RP application from that shell, e.g. with:

python3 -m venv /ve/my_rp_ve
. ~/ve/my_rp_ve/bin/activate
python3 my_application.py

Submitting a batch job

To run RP in a batch job, you must create a batch script that specifies your resource requirements, application execution time, and the RP application that you want to execute. Following the example given above, the following script could be used on TACC Frontera:

#SBATCH -J myjob           # Job name
#SBATCH -o myjob.o%j       # Name of stdout output file
#SBATCH -e myjob.e%j       # Name of stderr error file
#SBATCH -p normal          # Queue (partition) name
#SBATCH -N 74              # Total # of nodes
#SBATCH -n 56              # Total # of mpi tasks
#SBATCH -t 10:00:00        # Run time (hh:mm:ss)
#SBATCH --mail-type=all    # Send email at begin and end of job
#SBATCH -A myproject       # Project/Allocation name (req'd if you have more than 1)

python my_application.py

Once saved into a myjobscript.sbatch, you could submit your batch job on Frontera with:

sbatch myjobscript.sbatch

Login submission

Warning: very likely, login submission will break the login node usage policies and be killed by system administrators. Login submissions should be used as a last resort, only when either a remote or interactive submission is not available.

To run your RP application on the login node of a supported HPC platform, you will need to ssh into the login node, load the python environment and execute your PR application. For the example application above, you would do the following:

ssh username@frontera.tacc.utexas.edu
python3 -m venv /ve/my_rp_ve
. ~/ve/my_rp_ve/bin/activate
python3 my_application.py

But you would be breaching the login node usage policies on Frontera.