{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Configuration System\n", "\n", "RADICAL-Pilot (RP) uses a configuration system to set control and management parameters for the initialization of its components and to define resource entry points for the target platform.\n", "\n", "It includes:\n", "\n", "* [Run description](#Run-description)\n", " * Resource label for a target platform configuration file;\n", " * Project allocation name (i.e., account/project) - _specific for HPC platforms_;\n", " * Job queue name (i.e., queue/partition) - _specific for HPC platforms_;\n", " * Amount of the resources (e.g., `cores`, `gpus`, `memory`) to allocate for the runtime period;\n", " * Mode to access the target platform (e.g., `local`, `ssh`) - _optional, default is \"local\"_.\n", "* [Target platform description](#Platform-description)\n", " * Batch system (e.g., `SLURM`, `LSF`, etc.);\n", " * Provided launch methods (e.g., `SRUN`, `MPIRUN`, etc.);\n", " * Environment setup (including package manager, working directory, etc.);\n", " * Entry points: batch system URL, file system URL.\n", "\n", "## Run description\n", "\n", "Users have to describe at least one pilot in each RP application. That is done by instantiating a [rp.PilotDescription](../apidoc.rst#radical.pilot.PilotDescription) object. Among that object's attributes, `pd.resource` is mandatory and is referred as a resource label (or platform ID), which corresponds to a target platform configuration file (see the section [Platform description](#Platform-description)). Users need to know what ID corresponds to the HPC platform on which they want to execute their RP application.\n", "\n", "### Allocation parameters\n", "\n", "Every run should state the project name (i.e., allocation account), preferable queue for a job submission, and the amount of required resources explicitly, unless it is a run on _localhost_ without accessing any batch system.\n", "\n", "```python\n", "import radical.pilot as rp\n", "\n", "pd = rp.PilotDescription({\n", " 'resource': 'ornl.frontier', # platform ID\n", " 'project' : 'XYZ000', # allocation account\n", " 'queue' : 'debug', # optional (default value is in the platform description)\n", " 'cores' : 32, # amount of CPU slots\n", " 'gpus' : 8, # amount of GPU slots\n", " 'runtime' : 15 # maximum runtime for a pilot (in minutes)\n", "})\n", "```\n", "\n", "### Resource access schema\n", "\n", "Resource access schema (`pd.access_schema`) defines a set of endpoints for job submission and file system access. It is provided as part of a platform description, and in case of more than one access schemas users can set a specific one in [rp.PilotDescription](../apidoc.rst#radical.pilot.PilotDescription). Check schema availability per target platform:\n", "\n", "* Launching RP application **from the target platform**:\n", " * `local` (_default_) - allows to run application from login nodes of the specific machine, compute nodes while within the interactive session, or within a batch script.\n", "* Launching RP application **outside the target platform**:\n", " * `ssh` - use SSH protocol and corresponding SSH client to access the platform remotely.\n", " * `gsissh` - use GSI-enabled SSH to access the platform remotely.\n", "\n", "
\n", "\n", "__Warning:__ Most platforms do not allow remote access for job submissions, thus this parameter shouldn't be set by user and default value should be used. For details on submission of applications on HPC see the tutorial [Using RADICAL-Pilot on HPC Platforms](submission.ipynb).\n", "\n", "
\n", "\n", "## Platform description\n", "\n", "The RADICAL-Pilot uses configuration files for bookkeeping of supported platforms. Each configuration file identifies a facility (e.g., ACCESS, TACC, ORNL, ANL, etc.), is written in JSON and is named following the `resource_.json` convention. Each facility configuration file contains a set of platform names/labels with corresponding configuration parameters. Resource label (or platform ID) follows the `.` convention, and users use it for the `pd.resource` attribute of their [rp.PilotDescription](../apidoc.rst#radical.pilot.PilotDescription) object.\n", "\n", "### Predefined configurations\n", "\n", "The RADICAL-Pilot development team maintains a growing set of pre-defined configuration files for supported HPC platforms (list platform descriptions in RP's [GitHub repo](https://github.com/radical-cybertools/radical.pilot/tree/master/src/radical/pilot/configs)).\n", "\n", "For example, if users want to execute their RP application on Frontera, they will have to search for the [resource_tacc.json](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/resource_tacc.json) file and, inside that file, for the key(s) that start with the name `frontera`. The file `resource_tacc.json` contains the keys `frontera`, `frontera_rtx`, and `frontera_prte`. Each key identifies a specific set of configuration parameters: `frontera` offers a general-purpose set of configuration parameters; `frontera_rtx` enables the use of the `rtx` queue for GPU nodes; and `frontera_prte` enables the use of the PRRTE-based launch method to execute the application's tasks. Thus, for Frontera, the value for `pd.resource` will be `tacc.frontera`, `tacc.frontera_rtx` or `tacc.frontera_prte`. \n", "\n", "### Customizing a predefined configuration\n", "\n", "Users can customize existing platform configuration files by overwriting existing key/value pairs with ones from configuration files, which have the same names, but located in a user space. Default location of user-defined configuration files is `$HOME/.radical/pilot/configs/`. \n", "\n", "
\n", "\n", "__Note:__ To change the location for user-defined platform configuration files, please, use env variable `RADICAL_CONFIG_USER_DIR`, which will be used instead of env variable `HOME` in the path above. Make sure that the corresponding path exists, before creating configs there.\n", "\n", "
\n", "\n", "Two examples of customized configurations are below: (i) in one for `ornl.frontier` you change parameter __system_architecture.options__, and (ii) in another for `tacc.frontera` you set a default launch method `MPIEXEC`. With that files, every pilot description using `pd.resource = 'ornl.frontier'` or `pd.resource = 'tacc.frontera'` would use that new values. Changed parameters are described in the following section.\n", "\n", "__resource_ornl.json__\n", "```json\n", "{\n", " \"frontier\": {\n", " \"system_architecture\": {\n", " \"options\": [\"nvme\"]\n", " }\n", " }\n", "}\n", "```\n", "\n", "__resource_tacc.json__\n", "```json\n", "{\n", " \"frontera\": {\n", " \"launch_methods\": {\n", " \"order\" : [\"MPIEXEC\"],\n", " \"MPIEXEC\": {}\n", " }\n", " }\n", "}\n", "```\n", "\n", "### User-defined configuration\n", "\n", "Users can write whole new configuration for an existing or a new platform with arbitrary platform ID. For example, you will create a custom platform configuration entry `resource_tacc.json` locally. That file will be loaded into the [rp.Session](../apidoc.rst#radical.pilot.Session) object alongside with other configurations for TACC-related platforms." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "resource_tacc_tutorial = \\\n", "{\n", " \"frontera_tutorial\":\n", " {\n", " \"description\" : \"Short description of the resource\",\n", " \"notes\" : \"Notes about resource usage\",\n", "\n", " \"default_schema\" : \"local\",\n", " \"schemas\" : {\n", " \"local\" : {\n", " \"job_manager_endpoint\": \"slurm://frontera.tacc.utexas.edu/\",\n", " \"filesystem_endpoint\" : \"file://frontera.tacc.utexas.edu/\"\n", " },\n", " \"ssh\" : {\n", " \"job_manager_endpoint\": \"slurm+ssh://frontera.tacc.utexas.edu/\",\n", " \"filesystem_endpoint\" : \"sftp://frontera.tacc.utexas.edu/\"\n", " },\n", " \"no_submission\" : {\n", " \"job_manager_endpoint\": \"fork://localhost/\",\n", " \"filesystem_endpoint\" : \"file://localhost/\"\n", " }\n", " },\n", "\n", " \"default_queue\" : \"production\",\n", " \"resource_manager\" : \"SLURM\",\n", "\n", " \"cores_per_node\" : 56,\n", " \"gpus_per_node\" : 0,\n", " \"system_architecture\" : {\n", " \"smt\" : 1,\n", " \"options\" : [\"nvme\", \"intel\"],\n", " \"blocked_cores\" : [],\n", " \"blocked_gpus\" : []\n", " },\n", "\n", " \"agent_config\" : \"default\",\n", " \"agent_scheduler\" : \"CONTINUOUS\",\n", " \"agent_spawner\" : \"POPEN\",\n", " \"default_remote_workdir\" : \"$HOME\",\n", "\n", " \"pre_bootstrap_0\" : [\n", " \"module unload intel impi\",\n", " \"module load intel impi\",\n", " \"module load python3/3.9.2\"\n", " ],\n", " \"launch_methods\" : {\n", " \"order\" : [\"MPIRUN\"],\n", " \"MPIRUN\" : {\n", " \"pre_exec_cached\": [\n", " \"module load TACC\"\n", " ]\n", " }\n", " },\n", " \n", " \"virtenv_mode\" : \"local\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The definition of each field:\n", "\n", "* __description__ (_optional_) - human-readable description of the platform.\n", "* __notes__ (_optional_) - information needed to form valid pilot descriptions, such as what parameters are required, etc.\n", "* __schemas__ - allowed values for the `pd.access_schema` attribute of the pilot description. The first schema in the list is used by default. For each schema, a subsection is needed, which specifies __job_manager_endpoint__ and __filesystem_endpoint__.\n", " * __job_manager_endpoint__ - access URL for pilot submission (e.g., used by [PSI/J](https://exaworks.org/psij)).\n", " * __filesystem_endpoint__ - access URL for file staging.\n", "* __default_queue__ (_optional_) - queue name to be used for pilot submission to a corresponding batch system (see __job_manager_endpoint__).\n", "* __resource_manager__ - the type of job management system. Valid values are: `COBALT`, `FORK`, `LSF`, `PBSPRO`, `SLURM`.\n", "* __cores_per_node__ (_optional_) - number of available CPU cores per compute node. If not provided then it will be discovered by Resource Manager in RADICAL-Pilot.\n", "* __gpus_per_node__ (_optional_) - number of available GPUs per compute node. If not provided then it will be discovered by Resource Manager in RADICAL-Pilot.\n", "* __system_architecture__ (_optional_) - set of options that describe platform features:\n", " * __smt__ - Simultaneous MultiThreading (i.e., threads per physical core). If it is not provided then the default value `1` is used. It could be reset with env variable `RADICAL_SMT` exported before running RADICAL-Pilot application. RADICAL-Pilot uses `cores_per_node x smt` to calculate all available cores/CPUs per node.\n", " * __options__ - list of job management system specific attributes/constraints (e.g., provided to [PSI/J](https://exaworks.org/psij)).\n", " * `COBALT` uses option `--attrs` for configuring location as `filesystems=home,grand`, `mcdram` as `mcdram=flat`, `numa` as `numa=quad`;\n", " * `LSF` uses option `-alloc_flags` to support `gpumps`, `nvme`;\n", " * `PBSPRO` uses option `-l` for configuring location as `filesystems=grand:home`, placement as `place=scatter`;\n", " * `SLURM` uses option `--constraint` for compute nodes filtering.\n", " * __blocked_cores__ - list of cores/CPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.\n", " * __blocked_gpus__ - list of GPUs indices, which are not used by Scheduler in RADICAL-Pilot for tasks assignment.\n", "* __agent_config__ - configuration file for RADICAL-Pilot Agent (default value is `default` for a corresponding file [agent_default.json](https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/configs/agent_default.json)).\n", "* __agent_scheduler__ - Scheduler in RADICAL-Pilot (default value is `CONTINUOUS`).\n", "* __agent_spawner__ - Executor in RADICAL-Pilot, which spawns task execution processes (default value is `POPEN`).\n", "* __default_remote_workdir__ (_optional_) - directory for agent sandbox (see the tutorials [Getting Started](../getting_started.ipynb#Generated-Output) and [Staging Data with RADICAL-Pilot](staging_data.ipynb#Locations)). If not provided then the current directory is used (`$PWD`).\n", "* __forward_tunnel_endpoint__ (_optional_) - name of the host, which can be used to create ssh tunnels from the compute nodes to the outside of the platform.\n", "* __pre_bootstrap_0__ (_optional_) - list of commands to execute for the bootstrapping process to launch RADICAL-Pilot Agent.\n", "* __pre_bootstrap_1__ (_optional_) - list of commands to execute for initialization of sub-agent, which are used to run additional instances of RADICAL-Pilot components such as Executor and Stager.\n", "* __launch_methods__ - set of supported launch methods. Valid values are `APRUN`, `CCMRUN`, `FLUX`, `FORK`, `IBRUN`, `JSRUN` (`JSRUN_ERF`), `MPIEXEC` (`MPIEXEC_MPT`), `MPIRUN` (`MPIRUN_CCMRUN`, `MPIRUN_DPLACE`, `MPIRUN_MPT`, `MPIRUN_RSH`), `PRTE`, `RSH`, `SRUN`, `SSH`. For each launch method, a subsection is needed, which specifies __pre_exec_cached__ with list of commands to be executed to configure the launch method, and method related options (e.g., __dvm_count__ for `PRTE`).\n", " * __order__ - sets the order of launch methods to be selected for the task placement (the first value in the list is a default launch method).\n", "* __python_dist__ - python distribution. Valid values are `default` and `anaconda`.\n", "* __virtenv_mode__ - bootstrapping process set the environment for RADICAL-Pilot Agent (default value is `local`):\n", " * `create` - create a python virtual environment from scratch;\n", " * `recreate` - delete the existing virtual environment and build it from scratch, if not found then `create`;\n", " * `use` - use the existing virtual environment, if not found then `create`;\n", " * `update` - update the existing virtual environment, if not found then `create`;\n", " * `local` - use the client existing virtual environment (environment from where RADICAL-Pilot application was launched).\n", "* __virtenv__ (_optional_) - path to the existing virtual environment or its name with the pre-installed RADICAL stack; use it only when `virtenv_mode=use`.\n", "* __rp_version__ - RADICAL-Pilot installation or reuse process (default value is `installed`):\n", " * `local` - install from tarballs, from client existing environment;\n", " * `release` - install the latest released version from PyPI;\n", " * `installed` - do not install, target virtual environment has it.\n", "\n", "## Examples\n", "\n", "
\n", "\n", "__Note:__ In our examples, we will not show a progression bar while waiting for some operation to complete, e.g., while waiting for a pilot to stop. That is because the progression bar offered by RP's reporter does not work within a notebook. You could use it when executing an RP application as a standalone Python script.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env RADICAL_REPORT=TRUE\n", "%env RADICAL_REPORT_ANIME=FALSE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ensure that the location for user-defined configurations exists\n", "!mkdir -p \"${RADICAL_CONFIG_USER_DIR:-$HOME}/.radical/pilot/configs/\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import radical.pilot as rp\n", "import radical.utils as ru" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the next steps, you will save the earlier created configuration for a target platform into the file `resource_tacc.json`, located in a user-space. You also will be able to read that file and print some of its attributes to confirm that they are in place." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save earlier defined platform configuration into the user-space\n", "ru.write_json(resource_tacc_tutorial, \n", " os.path.join(os.path.expanduser('~'), \n", " '.radical/pilot/configs/resource_tacc.json'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tutorial_cfg = rp.utils.get_resource_config(resource='tacc.frontera_tutorial')\n", "\n", "for attr in ['schemas', 'resource_manager', 'cores_per_node', 'system_architecture']:\n", " print('%-20s : %s' % (attr, tutorial_cfg[attr]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('job_manager_endpoint : ', \n", " rp.utils.get_resource_job_url(resource='tacc.frontera_tutorial', schema='ssh'))\n", "print('filesystem_endpoint : ', \n", " rp.utils.get_resource_fs_url (resource='tacc.frontera_tutorial', schema='ssh'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = rp.Session()\n", "pmgr = rp.PilotManager(session=session)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tutorial_cfg = session.get_resource_config(resource='tacc.frontera_tutorial', schema='local')\n", "for attr in ['label', 'launch_methods', 'job_manager_endpoint', 'filesystem_endpoint']:\n", " print('%-20s : %s' % (attr, ru.as_dict(tutorial_cfg[attr])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Platform description created above is also available within the [rp.Session](../apidoc.rst#radical.pilot.Session) object. Let's confirm that newly created resource description is within the session. [rp.Session](../apidoc.rst#radical.pilot.Session) object has all provided platform configurations (pre- and user-defined ones), thus for a pilot you just need to select a particular configuration and a corresponding access schema if needed (as part of the pilot description)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd = rp.PilotDescription({\n", " 'resource' : 'tacc.frontera_tutorial',\n", " 'project' : 'XYZ000',\n", " 'queue' : 'production',\n", " 'cores' : 56,\n", " 'runtime' : 15,\n", " 'exit_on_error': False,\n", " 'access_schema': 'no_submission' # user defined schema (see configuration file)\n", "})\n", "\n", "pilot = pmgr.submit_pilots(pd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "pprint(pilot.as_dict())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After exploring pilot setup and configuration we close the session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }