Configuration System

RADICAL-Pilot (RP) uses a configuration system to set control and management parameters for the initialization of its components and to define resource entry points for the target platform.

It includes:

  • Run description

    • Resource label for a target platform configuration file;

    • Project allocation name (i.e., account/project, specific for HPC platforms);

    • Job queue name (i.e., queue/partition, specific for HPC platforms);

    • Amount of the resources (e.g., cores, gpus, memory) to allocate for the runtime period;

    • Mode to access the target platform (e.g., local, ssh, batch/interactive).

  • Target platform description

    • Batch system (e.g., SLURM, LSF, etc.);

    • Provided launch methods (e.g., SRUN, MPIRUN, etc.);

    • Environment setup (including package manager, working directory, etc.);

    • Entry points: batch system URL, file system URL.

Run description

Users have to describe at least one pilot in each RP application. That is done by instantiating a radical.pilot.PilotDescription object. Among that object’s attributes, resource is mandatory and is referred as a resource label (or platform ID), which corresponds to a target platform configuration file (see the section Platform description). Users need to know what ID corresponds to the HPC platform on which they want to execute their RP application.

Allocation parameters

Every run should state the project name (i.e., allocation account), preferable queue for a job submission, and the amount of required resources explicitly, unless it is a local run without accessing any batch system.

import radical.pilot as rp

pd = rp.PilotDescription({
    'resource': 'ornl.frontier',  # platform ID
    'project' : 'XYZ000',         # allocation account
    'queue'   : 'debug',          # optional (default value might be set in the platform description)
    'cores'   : 32,               # amount of CPU slots
    'gpus'    : 8,                # amount of GPU slots
    'runtime' : 15                # maximum runtime for a pilot (in minutes)
})

Resource access schema

Resource access schema (pd.access_schema) is provided as part of a platform description, and in case of more than one access schemas users can set a specific one in radical.pilot.PilotDescription. Check schema availability per target platform:

  • local: launch RP application from the target platform (e.g., login nodes of the specific machine).

  • ssh: launch RP application outside the target platform and use ssh protocol and corresponding SSH client to access the platform remotely.

  • gsissh: launch RP application outside the target platform and use GSI-enabled SSH to access the platform remotely.

  • interactive: launch RP application from the target platform within the interactive session after being placed on allocated resources (e.g., batch or compute nodes).

  • batch: launch RP application by a submitted batch script at the target platform.

Note: For details on submission of applications on HPC see the tutorial Using RADICAL-Pilot on HPC Platforms.

Platform description

The RADICAL-Pilot uses configuration files for bookkeeping of supported platforms. Each configuration file identifies a facility (e.g., ACCESS, TACC, ORNL, ANL, etc.), is written in JSON and is named following the resource_<facility_name>.json convention. Each facility configuration file contains a set of platform names/labels with corresponding configuration parameters. Resource label (or platform ID) follows the <facility_name>.<platform_name> convention, and users use it for the resource attribute of their radical.pilot.PilotDescription object.

Predefined configurations

The RADICAL-Pilot development team maintains a growing set of pre-defined configuration files for supported HPC platforms (list platform descriptions in RP’s GitHub repo).

For example, if users want to execute their RP application on Frontera, they will have to search for the resource_tacc.json file and, inside that file, for the key(s) that start with the name frontera. The file resource_tacc.json contains the keys frontera, frontera_rtx, and frontera_prte. Each key identifies a specific set of configuration parameters: frontera offers a general-purpose set of configuration parameters; frontera_rtx enables the use of the rtx queue for GPU nodes; and frontera_prte enables the use of the PRTE-based launch method to execute the application’s tasks. Thus, for Frontera, the value for resource will be tacc.frontera, tacc.frontera_rtx or tacc.frontera_prte.

Customizing a predefined configuration

Users can customize existing platform configuration files by overwriting existing key/value pairs with ones from configuration files, which have the same names, but located in a user space. Default location of user-defined configuration files is $HOME/.radical/pilot/configs/.

Note: To change the location for user-defined platform configuration files, please, use env variable RADICAL_CONFIG_USER_DIR, which will be used instead of env variable HOME in the location path above. Make sure that the corresponding path exists, before creating configs there.

Two examples of customized configurations are below: (i) in one for ornl.summit you change parameter system_architecture.options, and (ii) in another for tacc.frontera you set a default launch method MPIEXEC. With that files, every pilot description using 'resource': 'ornl.summit' or 'resource': 'tacc.frontera' would use that new values. Changed parameters are described in the following section.

resource_ornl.json

{
    "summit": {
        "system_architecture": {
            "options": ["gpumps", "gpudefault"]
        }
    }
}

resource_tacc.json

{
    "frontera": {
        "launch_methods": {
            "order"  : ["MPIEXEC"],
            "MPIEXEC": {}
        }
    }
}

User-defined configuration

Users can write whole new configuration for an existing or a new platform with arbitrary platform ID. For example, you will create a custom platform configuration entry resource_tacc.json locally. That file will be loaded into the RP’s radical.pilot.Session object alongside with other configurations for TACC-related platforms.

[1]:
resource_tacc_tutorial = \
{
    "frontera_tutorial":
    {
        "description"                 : "Short description of the resource",
        "notes"                       : "Notes about resource usage",

        "default_schema"              : "local",
        "schemas"                     : {
            "local"                   : {
                "job_manager_endpoint": "slurm://frontera.tacc.utexas.edu/",
                "filesystem_endpoint" : "file://frontera.tacc.utexas.edu/"
            },
            "ssh"                     : {
                "job_manager_endpoint": "slurm+ssh://frontera.tacc.utexas.edu/",
                "filesystem_endpoint" : "sftp://frontera.tacc.utexas.edu/"
            },
            "batch"                   : {
                "job_manager_endpoint": "fork://localhost/",
                "filesystem_endpoint" : "file://localhost/"
            },
            "interactive"             : {
                "job_manager_endpoint": "fork://localhost/",
                "filesystem_endpoint" : "file://localhost/"
            },
        },

        "default_queue"               : "production",
        "resource_manager"            : "SLURM",

        "cores_per_node"              : 56,
        "gpus_per_node"               : 0,
        "system_architecture"         : {
                                         "smt"           : 1,
                                         "options"       : ["nvme", "intel"],
                                         "blocked_cores" : [],
                                         "blocked_gpus"  : []
                                        },

        "agent_config"                : "default",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "default_remote_workdir"      : "$HOME",

        "pre_bootstrap_0"             : [
                                        "module unload intel impi",
                                        "module load   intel impi",
                                        "module load   python3/3.9.2"
                                        ],
        "launch_methods"              : {
                                         "order"  : ["MPIRUN"],
                                         "MPIRUN" : {
                                             "pre_exec_cached": [
                                                 "module load TACC"
                                             ]
                                         }
                                        },

        "python_dist"                 : "default",
        "virtenv_mode"                : "local"
    }
}

The definition of each field:

  • description (optional) - human-readable description of the platform.

  • notes (optional) - information needed to form valid pilot descriptions, such as what parameters are required, etc.

  • schemas - allowed values for the pd.access_schema attribute of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed, which specifies job_manager_endpoint and filesystem_endpoint.

  • job_manager_endpoint - access URL for pilot submission (interpreted by RADICAL-SAGA).

  • filesystem_endpoint - access URL for file staging (interpreted by RADICAL-SAGA).

  • default_queue (optional) - queue name to be used for pilot submission to a corresponding batch system (see job_manager_endpoint).

  • resource_manager - the type of job management system. Valid values are: CCM, COBALT, FORK, LSF, PBSPRO, SLURM, TORQUE, YARN.

  • cores_per_node (optional) - number of available CPU cores per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.

  • gpus_per_node (optional) - number of available GPUs per compute node. If not provided then it will be discovered by RADICAL-SAGA and by Resource Manager in RADICAL-Pilot.

  • system_architecture (optional) - set of options that describe platform features:

    • smt - Simultaneous MultiThreading (i.e., threads per physical core). If it is not provided then the default value 1 is used. It could be reset with env variable RADICAL_SMT exported before running RADICAL-Pilot application. RADICAL-Pilot uses cores_per_node x smt to calculate all available cores/CPUs per node.

    • options - list of job management system specific attributes/constraints, which are provided to RADICAL-SAGA.

      • COBALT uses option --attrs for configuring location as filesystems=home,grand, mcdram as mcdram=flat, numa as numa=quad;

      • LSF uses option -alloc_flags to support gpumps, nvme;

      • PBSPRO uses option -l for configuring location as filesystems=grand:home, placement as place=scatter;

      • SLURM uses option --constraint for compute nodes filtering.

    • blocked_cores - list of cores/CPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.

    • blocked_gpus - list of GPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.

  • agent_config - configuration file for RADICAL-Pilot Agent (default value is default for a corresponding file agent_default.json).

  • agent_scheduler - Scheduler in RADICAL-Pilot (default value is CONTINUOUS).

  • agent_spawner - Executor in RADICAL-Pilot, which spawns task execution processes (default value is POPEN).

  • default_remote_workdir (optional) - directory for agent sandbox (see the tutorials Getting Started and Staging Data with RADICAL-Pilot). If not provided then the current directory is used ($PWD).

  • forward_tunnel_endpoint (optional) - name of the host, which can be used to create ssh tunnels from the compute nodes to the outside of the platform.

  • pre_bootstrap_0 (optional) - list of commands to execute for the bootstrapping process to launch RADICAL-Pilot Agent.

  • pre_bootstrap_1 (optional) - list of commands to execute for initialization of sub-agent, which are used to run additional instances of RADICAL-Pilot components such as Executor and Stager.

  • launch_methods - set of supported launch methods. Valid values are APRUN, CCMRUN, FLUX, FORK, IBRUN, JSRUN (JSRUN_ERF), MPIEXEC (MPIEXEC_MPT), MPIRUN (MPIRUN_CCMRUN, MPIRUN_DPLACE, MPIRUN_MPT, MPIRUN_RSH), PRTE, RSH, SRUN, SSH. For each launch method, a subsection is needed, which specifies pre_exec_cached with list of commands to be executed to configure the launch method, and method related options (e.g., dvm_count for PRTE).

    • order - sets the order of launch methods to be selected for the task placement (the first value in the list is a default launch method).

  • python_dist - python distribution. Valid values are default and anaconda.

  • virtenv_mode - bootstrapping process set the environment for RADICAL-Pilot Agent:

    • create - create a python virtual environment from scratch;

    • recreate - delete the existing virtual environment and build it from scratch, if not found then create;

    • use - use the existing virtual environment, if not found then create;

    • update - update the existing virtual environment, if not found then create (default);

    • local - use the client existing virtual environment (environment from where RADICAL-Pilot application was launched).

  • virtenv (optional) - path to the existing virtual environment or its name with the pre-installed RCT stack; use it only when virtenv_mode=use.

  • rp_version - RADICAL-Pilot installation or reuse process:

    • local - install from tarballs, from client existing environment (default);

    • release - install the latest released version from PyPI;

    • installed - do not install, target virtual environment has it.

Examples

Note: In our examples, we will not show a progression bar while waiting for some operation to complete, e.g., while waiting for a pilot to stop. That is because the progression bar offered by RP’s reporter does not work within a notebook. You could use it when executing an RP application as a standalone Python script.

[2]:
%env RADICAL_REPORT_ANIME=FALSE
env: RADICAL_REPORT_ANIME=FALSE
[3]:
# ensure that the location for user-defined configurations exists
!mkdir -p "${RADICAL_CONFIG_USER_DIR:-$HOME}/.radical/pilot/configs/"
[4]:
import os

import radical.pilot as rp
import radical.utils as ru

With the next steps, you will save the earlier created configuration for a target platform into the file resource_tacc.json, located in a user-space. You also will be able to read that file and print some of its attributes to confirm that they are in place.

[5]:
# save earlier defined platform configuration into the user-space
ru.write_json(resource_tacc_tutorial, os.path.join(os.path.expanduser('~'), '.radical/pilot/configs/resource_tacc.json'))
[6]:
tutorial_cfg = rp.utils.get_resource_config(resource='tacc.frontera_tutorial')

for attr in ['schemas', 'resource_manager', 'cores_per_node', 'system_architecture']:
    print('%-20s : %s' % (attr, tutorial_cfg[attr]))
schemas              : {'batch': Config: {'filesystem_endpoint': 'file://localhost/', 'job_manager_endpoint': 'fork://localhost/'}, 'interactive': Config: {'filesystem_endpoint': 'file://localhost/', 'job_manager_endpoint': 'fork://localhost/'}, 'local': Config: {'filesystem_endpoint': 'file://frontera.tacc.utexas.edu/', 'job_manager_endpoint': 'slurm://frontera.tacc.utexas.edu/'}, 'ssh': Config: {'filesystem_endpoint': 'sftp://frontera.tacc.utexas.edu/', 'job_manager_endpoint': 'slurm+ssh://frontera.tacc.utexas.edu/'}}
resource_manager     : SLURM
cores_per_node       : 56
system_architecture  : {'blocked_cores': [], 'blocked_gpus': [], 'options': ['nvme', 'intel'], 'smt': 1}
[7]:
print('job_manager_endpoint : ', rp.utils.get_resource_job_url(resource='tacc.frontera_tutorial', schema='ssh'))
print('filesystem_endpoint  : ', rp.utils.get_resource_fs_url (resource='tacc.frontera_tutorial', schema='ssh'))
job_manager_endpoint :  slurm+ssh://frontera.tacc.utexas.edu/
filesystem_endpoint  :  sftp://frontera.tacc.utexas.edu/
[8]:
session = rp.Session()
pmgr    = rp.PilotManager(session=session)
[9]:
tutorial_cfg = session.get_resource_config(resource='tacc.frontera_tutorial', schema='batch')
for attr in ['label', 'launch_methods', 'job_manager_endpoint', 'filesystem_endpoint']:
    print('%-20s : %s' % (attr, ru.as_dict(tutorial_cfg[attr])))
label                : tacc.frontera_tutorial
launch_methods       : {'MPIRUN': {'pre_exec_cached': ['module load TACC']}, 'order': ['MPIRUN']}
job_manager_endpoint : fork://localhost/
filesystem_endpoint  : file://localhost/

Platform description created above is also available within the radical.pilot.Session object. Let’s confirm that newly created resource description is within the session. Session object has all provided platform configurations (pre- and user-defined ones), thus for a pilot you just need to select a particular configuration and a corresponding access schema (as part of the pilot description).

[10]:
pd = rp.PilotDescription({
    'resource'     : 'tacc.frontera_tutorial',
    'project'      : 'XYZ000',
    'queue'        : 'production',
    'cores'        : 56,
    'runtime'      : 15,
    'access_schema': 'batch',
    'exit_on_error': False
})

pilot = pmgr.submit_pilots(pd)
[11]:
from pprint import pprint

pprint(pilot.as_dict())
{'client_sandbox': '/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/stable/docs/source/tutorials',
 'description': {'access_schema': 'batch',
                 'app_comm': [],
                 'candidate_hosts': [],
                 'cleanup': False,
                 'cores': 56,
                 'exit_on_error': False,
                 'gpus': 0,
                 'input_staging': [],
                 'job_name': None,
                 'layout': 'default',
                 'memory': 0,
                 'nodes': 0,
                 'output_staging': [],
                 'prepare_env': {},
                 'project': 'XYZ000',
                 'queue': 'production',
                 'resource': 'tacc.frontera_tutorial',
                 'runtime': 15,
                 'sandbox': None,
                 'services': [],
                 'uid': None},
 'endpoint_fs': 'file://localhost/',
 'js_hop': 'fork://localhost/',
 'js_url': 'fork://localhost/',
 'log': None,
 'pilot_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox/rp.session.907d1adc-fd44-11ee-9ee6-0242ac110002/pilot.0000/',
 'pmgr': 'pmgr.0000',
 'resource': 'tacc.frontera_tutorial',
 'resource_details': None,
 'resource_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox',
 'resources': None,
 'session': 'rp.session.907d1adc-fd44-11ee-9ee6-0242ac110002',
 'session_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox/rp.session.907d1adc-fd44-11ee-9ee6-0242ac110002',
 'state': 'PMGR_LAUNCHING',
 'stderr': None,
 'stdout': None,
 'type': 'pilot',
 'uid': 'pilot.0000'}

After exploring pilot setup and configuration we close the session.

[12]:
session.close(cleanup=True)