Configuration System
RADICAL-Pilot (RP) uses a configuration system to set control and management parameters for the initialization of its components and to define resource entry points for the target platform.
It includes:
-
Resource label for a target platform configuration file;
Project allocation name (i.e., account/project) - specific for HPC platforms;
Job queue name (i.e., queue/partition) - specific for HPC platforms;
Amount of the resources (e.g.,
cores,gpus,memory) to allocate for the runtime period;Mode to access the target platform (e.g.,
local,ssh) - optional, default is “local”.
-
Batch system (e.g.,
SLURM,LSF, etc.);Provided launch methods (e.g.,
SRUN,MPIRUN, etc.);Environment setup (including package manager, working directory, etc.);
Entry points: batch system URL, file system URL.
Run description
Users have to describe at least one pilot in each RP application. That is done by instantiating a rp.PilotDescription object. Among that object’s attributes, pd.resource is mandatory and is referred as a resource label (or platform ID), which corresponds to a target platform configuration file (see the section Platform description). Users need to know what ID corresponds to the HPC platform on which they want to
execute their RP application.
Allocation parameters
Every run should state the project name (i.e., allocation account), preferable queue for a job submission, and the amount of required resources explicitly, unless it is a run on localhost without accessing any batch system.
import radical.pilot as rp
pd = rp.PilotDescription({
'resource': 'ornl.frontier', # platform ID
'project' : 'XYZ000', # allocation account
'queue' : 'debug', # optional (default value is in the platform description)
'cores' : 32, # amount of CPU slots
'gpus' : 8, # amount of GPU slots
'runtime' : 15 # maximum runtime for a pilot (in minutes)
})
Resource access schema
Resource access schema (pd.access_schema) defines a set of endpoints for job submission and file system access. It is provided as part of a platform description, and in case of more than one access schemas users can set a specific one in rp.PilotDescription. Check schema availability per target platform:
Launching RP application from the target platform:
local(default) - allows to run application from login nodes of the specific machine, compute nodes while within the interactive session, or within a batch script.
Launching RP application outside the target platform:
ssh- use SSH protocol and corresponding SSH client to access the platform remotely.gsissh- use GSI-enabled SSH to access the platform remotely.
Warning: Most platforms do not allow remote access for job submissions, thus this parameter shouldn’t be set by user and default value should be used. For details on submission of applications on HPC see the tutorial Using RADICAL-Pilot on HPC Platforms.
Platform description
The RADICAL-Pilot uses configuration files for bookkeeping of supported platforms. Each configuration file identifies a facility (e.g., ACCESS, TACC, ORNL, ANL, etc.), is written in JSON and is named following the resource_<facility_name>.json convention. Each facility configuration file contains a set of platform names/labels with corresponding configuration parameters. Resource label (or platform ID) follows the <facility_name>.<platform_name> convention, and users use it for the
pd.resource attribute of their rp.PilotDescription object.
Predefined configurations
The RADICAL-Pilot development team maintains a growing set of pre-defined configuration files for supported HPC platforms (list platform descriptions in RP’s GitHub repo).
For example, if users want to execute their RP application on Frontera, they will have to search for the resource_tacc.json file and, inside that file, for the key(s) that start with the name frontera. The file resource_tacc.json contains the keys frontera, frontera_rtx, and frontera_prte. Each key identifies a specific set of configuration parameters:
frontera offers a general-purpose set of configuration parameters; frontera_rtx enables the use of the rtx queue for GPU nodes; and frontera_prte enables the use of the PRRTE-based launch method to execute the application’s tasks. Thus, for Frontera, the value for pd.resource will be tacc.frontera, tacc.frontera_rtx or tacc.frontera_prte.
Customizing a predefined configuration
Users can customize existing platform configuration files by overwriting existing key/value pairs with ones from configuration files, which have the same names, but located in a user space. Default location of user-defined configuration files is $HOME/.radical/pilot/configs/.
Note: To change the location for user-defined platform configuration files, please, use env variable RADICAL_CONFIG_USER_DIR, which will be used instead of env variable HOME in the path above. Make sure that the corresponding path exists, before creating configs there.
Two examples of customized configurations are below: (i) in one for ornl.frontier you change parameter system_architecture.options, and (ii) in another for tacc.frontera you set a default launch method MPIEXEC. With that files, every pilot description using pd.resource = 'ornl.frontier' or pd.resource = 'tacc.frontera' would use that new values. Changed parameters are described in the following section.
resource_ornl.json
{
"frontier": {
"system_architecture": {
"options": ["nvme"]
}
}
}
resource_tacc.json
{
"frontera": {
"launch_methods": {
"order" : ["MPIEXEC"],
"MPIEXEC": {}
}
}
}
User-defined configuration
Users can write whole new configuration for an existing or a new platform with arbitrary platform ID. For example, you will create a custom platform configuration entry resource_tacc.json locally. That file will be loaded into the rp.Session object alongside with other configurations for TACC-related platforms.
[1]:
resource_tacc_tutorial = \
{
"frontera_tutorial":
{
"description" : "Short description of the resource",
"notes" : "Notes about resource usage",
"default_schema" : "local",
"schemas" : {
"local" : {
"job_manager_endpoint": "slurm://frontera.tacc.utexas.edu/",
"filesystem_endpoint" : "file://frontera.tacc.utexas.edu/"
},
"ssh" : {
"job_manager_endpoint": "slurm+ssh://frontera.tacc.utexas.edu/",
"filesystem_endpoint" : "sftp://frontera.tacc.utexas.edu/"
},
"no_submission" : {
"job_manager_endpoint": "fork://localhost/",
"filesystem_endpoint" : "file://localhost/"
}
},
"default_queue" : "production",
"resource_manager" : "SLURM",
"cores_per_node" : 56,
"gpus_per_node" : 0,
"system_architecture" : {
"smt" : 1,
"options" : ["nvme", "intel"],
"blocked_cores" : [],
"blocked_gpus" : []
},
"agent_config" : "default",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"default_remote_workdir" : "$HOME",
"pre_bootstrap_0" : [
"module unload intel impi",
"module load intel impi",
"module load python3/3.9.2"
],
"launch_methods" : {
"order" : ["MPIRUN"],
"MPIRUN" : {
"pre_exec_cached": [
"module load TACC"
]
}
},
"virtenv_mode" : "local"
}
}
The definition of each field:
description (optional) - human-readable description of the platform.
notes (optional) - information needed to form valid pilot descriptions, such as what parameters are required, etc.
schemas - allowed values for the
pd.access_schemaattribute of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed, which specifies job_manager_endpoint and filesystem_endpoint.job_manager_endpoint - access URL for pilot submission (e.g., used by PSI/J).
filesystem_endpoint - access URL for file staging.
default_queue (optional) - queue name to be used for pilot submission to a corresponding batch system (see job_manager_endpoint).
resource_manager - the type of job management system. Valid values are:
COBALT,FORK,LSF,PBSPRO,SLURM.cores_per_node (optional) - number of available CPU cores per compute node. If not provided then it will be discovered by Resource Manager in RADICAL-Pilot.
gpus_per_node (optional) - number of available GPUs per compute node. If not provided then it will be discovered by Resource Manager in RADICAL-Pilot.
system_architecture (optional) - set of options that describe platform features:
smt - Simultaneous MultiThreading (i.e., threads per physical core). If it is not provided then the default value
1is used. It could be reset with env variableRADICAL_SMTexported before running RADICAL-Pilot application. RADICAL-Pilot usescores_per_node x smtto calculate all available cores/CPUs per node.options - list of job management system specific attributes/constraints (e.g., provided to PSI/J).
COBALTuses option--attrsfor configuring location asfilesystems=home,grand,mcdramasmcdram=flat,numaasnuma=quad;LSFuses option-alloc_flagsto supportgpumps,nvme;PBSPROuses option-lfor configuring location asfilesystems=grand:home, placement asplace=scatter;SLURMuses option--constraintfor compute nodes filtering.
blocked_cores - list of cores/CPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
blocked_gpus - list of GPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.
agent_config - configuration file for RADICAL-Pilot Agent (default value is
defaultfor a corresponding file agent_default.json).agent_scheduler - Scheduler in RADICAL-Pilot (default value is
CONTINUOUS).agent_spawner - Executor in RADICAL-Pilot, which spawns task execution processes (default value is
POPEN).default_remote_workdir (optional) - directory for agent sandbox (see the tutorials Getting Started and Staging Data with RADICAL-Pilot). If not provided then the current directory is used (
$PWD).forward_tunnel_endpoint (optional) - name of the host, which can be used to create ssh tunnels from the compute nodes to the outside of the platform.
pre_bootstrap_0 (optional) - list of commands to execute for the bootstrapping process to launch RADICAL-Pilot Agent.
pre_bootstrap_1 (optional) - list of commands to execute for initialization of sub-agent, which are used to run additional instances of RADICAL-Pilot components such as Executor and Stager.
launch_methods - set of supported launch methods. Valid values are
APRUN,CCMRUN,FLUX,FORK,IBRUN,JSRUN(JSRUN_ERF),MPIEXEC(MPIEXEC_MPT),MPIRUN(MPIRUN_CCMRUN,MPIRUN_DPLACE,MPIRUN_MPT,MPIRUN_RSH),PRTE,RSH,SRUN,SSH. For each launch method, a subsection is needed, which specifies pre_exec_cached with list of commands to be executed to configure the launch method, and method related options (e.g., dvm_count forPRTE).order - sets the order of launch methods to be selected for the task placement (the first value in the list is a default launch method).
python_dist - python distribution. Valid values are
defaultandanaconda.virtenv_mode - bootstrapping process set the environment for RADICAL-Pilot Agent (default value is
local):create- create a python virtual environment from scratch;recreate- delete the existing virtual environment and build it from scratch, if not found thencreate;use- use the existing virtual environment, if not found thencreate;update- update the existing virtual environment, if not found thencreate;local- use the client existing virtual environment (environment from where RADICAL-Pilot application was launched).
virtenv (optional) - path to the existing virtual environment or its name with the pre-installed RADICAL stack; use it only when
virtenv_mode=use.rp_version - RADICAL-Pilot installation or reuse process (default value is
installed):local- install from tarballs, from client existing environment;release- install the latest released version from PyPI;installed- do not install, target virtual environment has it.
Examples
Note: In our examples, we will not show a progression bar while waiting for some operation to complete, e.g., while waiting for a pilot to stop. That is because the progression bar offered by RP’s reporter does not work within a notebook. You could use it when executing an RP application as a standalone Python script.
[2]:
%env RADICAL_REPORT=TRUE
%env RADICAL_REPORT_ANIME=FALSE
env: RADICAL_REPORT=TRUE
env: RADICAL_REPORT_ANIME=FALSE
[3]:
# ensure that the location for user-defined configurations exists
!mkdir -p "${RADICAL_CONFIG_USER_DIR:-$HOME}/.radical/pilot/configs/"
[4]:
import os
import radical.pilot as rp
import radical.utils as ru
With the next steps, you will save the earlier created configuration for a target platform into the file resource_tacc.json, located in a user-space. You also will be able to read that file and print some of its attributes to confirm that they are in place.
[5]:
# save earlier defined platform configuration into the user-space
ru.write_json(resource_tacc_tutorial,
os.path.join(os.path.expanduser('~'),
'.radical/pilot/configs/resource_tacc.json'))
[6]:
tutorial_cfg = rp.utils.get_resource_config(resource='tacc.frontera_tutorial')
for attr in ['schemas', 'resource_manager', 'cores_per_node', 'system_architecture']:
print('%-20s : %s' % (attr, tutorial_cfg[attr]))
schemas : {'local': Config: {'filesystem_endpoint': 'file://frontera.tacc.utexas.edu/', 'job_manager_endpoint': 'slurm://frontera.tacc.utexas.edu/'}, 'no_submission': Config: {'filesystem_endpoint': 'file://localhost/', 'job_manager_endpoint': 'fork://localhost/'}, 'ssh': Config: {'filesystem_endpoint': 'sftp://frontera.tacc.utexas.edu/', 'job_manager_endpoint': 'slurm+ssh://frontera.tacc.utexas.edu/'}}
resource_manager : SLURM
cores_per_node : 56
system_architecture : {'blocked_cores': [], 'blocked_gpus': [], 'options': ['nvme', 'intel'], 'smt': 1}
[7]:
print('job_manager_endpoint : ',
rp.utils.get_resource_job_url(resource='tacc.frontera_tutorial', schema='ssh'))
print('filesystem_endpoint : ',
rp.utils.get_resource_fs_url (resource='tacc.frontera_tutorial', schema='ssh'))
job_manager_endpoint : slurm+ssh://frontera.tacc.utexas.edu/
filesystem_endpoint : sftp://frontera.tacc.utexas.edu/
[8]:
session = rp.Session()
pmgr = rp.PilotManager(session=session)
new session: [rp.session.69f9ebc6-4693-11f0-b4ec-764078d63b85] \
zmq proxy : [tcp://172.17.0.2:10001] ok
create pilot manager ok
[9]:
tutorial_cfg = session.get_resource_config(resource='tacc.frontera_tutorial', schema='local')
for attr in ['label', 'launch_methods', 'job_manager_endpoint', 'filesystem_endpoint']:
print('%-20s : %s' % (attr, ru.as_dict(tutorial_cfg[attr])))
label : tacc.frontera_tutorial
launch_methods : {'MPIRUN': {'pre_exec_cached': ['module load TACC']}, 'order': ['MPIRUN']}
job_manager_endpoint : slurm://frontera.tacc.utexas.edu/
filesystem_endpoint : file://frontera.tacc.utexas.edu/
Platform description created above is also available within the rp.Session object. Let’s confirm that newly created resource description is within the session. rp.Session object has all provided platform configurations (pre- and user-defined ones), thus for a pilot you just need to select a particular configuration and a corresponding access schema if needed (as part of the pilot description).
[10]:
pd = rp.PilotDescription({
'resource' : 'tacc.frontera_tutorial',
'project' : 'XYZ000',
'queue' : 'production',
'cores' : 56,
'runtime' : 15,
'exit_on_error': False,
'access_schema': 'no_submission' # user defined schema (see configuration file)
})
pilot = pmgr.submit_pilots(pd)
submit 1 pilot(s)
pilot.0000 tacc.frontera_tutorial 56 cores 0 gpus ok
[11]:
from pprint import pprint
pprint(pilot.as_dict())
{'client_sandbox': '/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/devel/docs/source/tutorials',
'description': {'access_schema': 'no_submission',
'app_comm': [],
'cleanup': False,
'cores': 56,
'enable_ep': False,
'exit_on_error': False,
'gpus': 0,
'input_staging': [],
'job_name': None,
'memory': 0,
'nodes': 0,
'output_staging': [],
'prepare_env': {},
'project': 'XYZ000',
'queue': 'production',
'reconfig_src': None,
'resource': 'tacc.frontera_tutorial',
'runtime': 15,
'sandbox': None,
'services': [],
'uid': None},
'endpoint_fs': 'file://localhost/',
'js_hop': 'fork://localhost/',
'js_url': 'fork://localhost/',
'log': None,
'nodelist': None,
'pilot_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox/rp.session.69f9ebc6-4693-11f0-b4ec-764078d63b85/pilot.0000/',
'pmgr': 'pmgr.0000',
'resource': 'tacc.frontera_tutorial',
'resource_details': None,
'resource_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox',
'resources': None,
'session': 'rp.session.69f9ebc6-4693-11f0-b4ec-764078d63b85',
'session_sandbox': 'file://localhost/home/docs/radical.pilot.sandbox/rp.session.69f9ebc6-4693-11f0-b4ec-764078d63b85',
'state': 'PMGR_LAUNCHING',
'stderr': None,
'stdout': None,
'type': 'pilot',
'uid': 'pilot.0000'}
After exploring pilot setup and configuration we close the session.
[12]:
session.close()
closing session rp.session.69f9ebc6-4693-11f0-b4ec-764078d63b85 \
close pilot manager \
wait for 1 pilot(s)
ok
ok
session lifetime: 4.4s ok