Getting Started

This notebook walks you through executing a hello_world application written with RADICAL-Pilot (RP) and locally executed on a GNU/Linux operating system. The application consists of a Bag of Tasks with heterogeneous requirements: different number of CPU cores/GPUs and different execution time for each task. In this simple application, tasks have no data requirements but see data staging for how to manage data in RP.

Warning: We assume you understand what a pilot is and how it enables to concurrently and sequentially execute compute tasks on its resources. See our Brief Introduction to RP video to familiarize yourself with the architectural concepts and execution model of a pilot.

Installation

Warning: RP must be installed in a Python environment. RP will not work properly when installed as a system-wide package. You must create and activate a virtual environment before installing RP.

You can create a Python environment suitable to RP using Virtualenv, Venv or Conda. Once created and activated a virtual environment, RP is a Python module installed via pip, Conda, or Spack.

Note: Please see using virtual environments with RP for more options and detailed information. That will be especially useful when executing RP on supported high performance computing (HPC) platforms.

Virtualenv

virtualenv ~/.ve/radical-pilot
. ~/.ve/radical-pilot/bin/activate
pip install radical.pilot

Venv

python -m venv ~/.ve/radical-pilot
. ~/.ve/radical-pilot/bin/activate
pip install radical.pilot

Conda

If there is no conda pre-installed, here is a distilled set of commands to install Miniconda on a GNU/Linux x86_64 OS. Find more (and possibly updated) information on the official Conda documentation)

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ./miniconda.sh
chmod +x ./miniconda.sh
./miniconda.sh -b -p ./conda
source ./conda/bin/activate

Once Conda is available:

conda create -y -n radical-pilot
conda activate radical-pilot
conda install -y -c conda-forge radical.pilot

Spack

If there is no spack pre-installed, here is a distilled set of commands to install Spack on a GNU/Linux x86_64 OS. Find more (and possibly updated) information on the official Spack documentation.

git clone https://github.com/spack/spack.git
. spack/share/spack/setup-env.sh

Once Spack is available:

spack env create ve.rp
spack env activate ve.rp
spack install py-radical-pilot

Note: Simplified stack for local usage: When using RP on a local machine or cluster there is no remote file staging or job submission. In that situation, RP can fall back to a simplified software stack. If this is the case, run pip install psij-python (or equivalents for conda or spack), and RP will transparently switch to PSIJ for local job submission.

Check the installed version

Often, we need to know what version of RP we installed. For example, you will need to know that when opening a support ticket with the RADICAL development team.

We install a command with RP that prints information for all the installed RADICAL Cybertools:

[1]:
!radical-stack

  python               : /home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/envs/devel/bin/python3
  pythonpath           :
  version              : 3.7.17
  virtualenv           :

  radical.gtod         : 1.52.0
  radical.pilot        : v1.52.1-3-gc98f7e8@HEAD-detached-at-origin-devel
  radical.utils        : 1.52.0

Write your first application

RP executes in batch mode:

  • Write an application using RP API.

  • Launch that application.

  • Wait a variable amount of time doing something else.

  • When the application exits, come back, collect and check the results.

Each RP application has a distinctive pattern:

  1. Create a session

  2. Create a pilot manager

  3. Describe the pilot on which you want to run your application tasks:

    • Define the platform on which you want to execute the application

    • Define the amount/type of resources you want to use to run your application tasks.

  4. Assign the pilot description to the pilot manager

  5. Create a task manager

  6. Describe the computational tasks that you want to execute:

    • Executable launched by the task

    • Arguments to pass to the executable command if any

    • Amount of each type of resource used by the executable, e.g., CPU cores and GPUs

    • When the executable is MPI/OpenMP, number of ranks for the whole executable, number of ranks per core or GPU

    • Many other parameters. See the API specification for full details

  7. Assign the task descriptions to the task manager

  8. Submit tasks for execution

  9. Wait for tasks to complete execution

Some of RP behavior can be configured via environment variables. RP’s progression bar does not work properly with Jupyter notebooks. Thus, you may want to set it to FALSE.

[2]:
%env RADICAL_REPORT_ANIME=FALSE
env: RADICAL_REPORT_ANIME=FALSE

As with every Python application, first you import all the required modules.

[3]:
import radical.pilot as rp

Enable user feedback

As RP implements a batch programming model, by default, it returns a minimal amount of information. After submitting the tasks for execution, RP will remain silent until all the tasks have completed. In practice, when developing and debugging your application, you will want more feedback. We wrote a reporter module that you can use with RP and all the other RADICAL-Cybertools.

To use the reporter:

  • Configure RP by exporting a shell environment variable.

    export RADICAL_PILOT_REPORT=True
    
  • Import radical.utils, create a reporter and start to use it to print meaningful messages about the state of the application execution.

Note: See our tutorial about Profiling a RADICAL-Pilot Application for a guide on how to trace and profile your application.

Note: See our tutorial about Debugging a RADICAL-Pilot Application for a guide on how to debug your application.

[4]:
import radical.utils as ru

report = ru.Reporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version)

================================================================================
 Getting Started (RP version v1.52.1)
================================================================================


Creating a session

radical.pilot.Session is the root object of all the other objects of RP.

[5]:
session = rp.Session()

Creating a pilot manager

You need to manage the resources you will acquire with a pilot either locally or, more commonly and usefully, on a supported HPC platform. An instance of radical.pilot.PilotManager attached to your session will do that for you.

Note: One radical.pilot.PilotManager can manage multiple pilots. See our tutorial about Using Multiple Pilots with RADICAL-Pilot to see why and how

[6]:
pmgr = rp.PilotManager(session=session)

Configuring pilot resources

You can use a dictionary to specify location, amount and other properties of the resources you want to acquire with a pilot; and use that dictionary to initialize a radical.pilot.PilotDescription object. See the radical.pilot.TaskDescription API for a full list of properties you can specify for each pilot.

In this example, we want to run our hello_world application on our local GNU/Linux, for not more than 30 minutes and use 2 cores.

Warning: We choose a 30 minutes runtime, but the application could take less or more time to complete. 30 minutes are the upper bound but RP will exit as soon as all the task have reached a final state (DONE, CANCELED, FAILED). Conversely, RP will always exit once the runtime expires, even if some tasks still need to be executed.

Note: We could choose to use as many CPU cores as we have available on our local machine. RP will allocate all of them, but it will use only the cores required by the application tasks. If all the tasks together require fewer cores than those available, the remaining cores will go unused. Conversely, if there are more tasks that cores, RP will schedule each task as soon as the required amount of cores becomes available. In this way, RP will maintain the available resources as busy as possible and the application tasks will run both concurrently and sequentially, depending on resource availability.

Note: 'exit_on_error': False allows us to compile this notebook without errors. You should probably not use it with a stand-alone RP application.

[7]:
pd_init = {'resource'     : 'local.localhost',
           'runtime'      : 30,  # pilot runtime minutes
           'exit_on_error': True,
           'project'      : None,
           'queue'        : None,
           'cores'        : 4,
           'gpus'         : 0,
           'exit_on_error': False}
pdesc = rp.PilotDescription(pd_init)

Submitting the pilot

We now have a pilot manager, we know how many resources we want and on what platform. We are ready to submit our request!

Note: On a local machine, RP acquires the requested resources as soon as we submit the pilot. On a supported HPC platform, our request will queue a job into the platform’s batch system. The actual resources will become available only when the batch system schedules the job. This is not under the control of RP and, barring reservation, the actual queue time will be unknown.

We use the submit_pilots method of our pilot manager and pass it the pilot description.

[8]:
report.header('submit pilot')
pilot = pmgr.submit_pilots(pdesc)

--------------------------------------------------------------------------------
submit pilot


Creating a task manager

We have acquired the resources we asked for (or we are waiting in a queue to get them) so now we need to do something with those resources, i.e., executing our application tasks :-) First, we create a radical.pilot.TaskManager and associate it to our session. That manager will take care of taking our task descriptions and sending them to our pilot so that it can execute those tasks on the allocated resources.

[9]:
tmgr = rp.TaskManager(session=session)

Registering the pilot with the task manager

We tell the task manager what pilot it should use to execute its tasks.

[10]:
tmgr.add_pilots(pilot)

Describing the application tasks

In this example, we want to run simple tasks but that require different number of CPU cores and that run for a variable amount of time. Thus, we use the executable radical-pilot-hello.sh we crafted to occupy a configurable amount of resources for a configurable amount of time.

Each task is an instance of radical.pilot.TaskDescription with some properties defined (for a complete list of task properties see the TaskDescription API:

  • executable: the name of the executable we want to launch with the task

  • arguments: the arguments to pass to executable. In this case, the number of seconds it needs to run for.

  • ranks: this is the number of nodes on which the task should run. Here it is set to 1 as we are running all our tasks on our local computer. See Describing tasks in RADICAL-Pilot and the details about executing tasks that use the message passing interface (MPI).

  • cores_per_rank: the amount of cores that each (rank of the) task utilizes. In our case, each task will randomly use either 1 or 2 cores or the 4 we have requested.

Warning: Executing MPI tasks (i.e., one with radical.pilot.TaskDescription.ranks > 1) requires for an MPI implementation to be available on the machine on which you will run the task. That is usually taken care of by the system administrator but if you are managing your own cluster, you will have to install and make available one of the many MPI distributions available for GNU/Linux.

We run 10 tasks that should be enough to see both concurrent and sequential executions on the amount of resources we requested, but not enough to clog the example.

Note: We use the reporter to produce a progress bar while we loop over the task descriptions

[11]:
import os
import random

n = 10

report.progress_tgt(n, label='create')
tds = list()
for i in range(n):

    td = rp.TaskDescription()
    td.executable     = 'radical-pilot-hello.sh'
    td.arguments      = [random.randint(1, 10)]
    td.ranks          =  1
    td.cores_per_rank =  random.randint(1, 2)

    tds.append(td)
    report.progress()

report.progress_done()
create: ########################################################################

Submitting tasks for execution

Now that we have all the elements of the application we can execute its tasks. We submit the list of application tasks to the task manager that, in turn, will submit them to the indicated pilot for execution. Upon receiving the list of task descriptions, the pilot will schedule those tasks on its available resources and then execute them.

Note: For RP, tasks are black boxes, i.e., it knows nothing about the code executed by the task. RP just knows that a task has been launched on the requested amount of resources, and it will wait until the tasks exits. In that way, RP is agnostic towards task details like language used for its implementation, the type of scientific computation it performs, how it uses data, etc. This is why RP can serve a wide range of scientists, independent on their scientific domain.

[12]:
report.header('submit %d tasks' % n)
tmgr.submit_tasks(tds)

--------------------------------------------------------------------------------
submit 10 tasks


[12]:
[<Task object, uid task.000000>,
 <Task object, uid task.000001>,
 <Task object, uid task.000002>,
 <Task object, uid task.000003>,
 <Task object, uid task.000004>,
 <Task object, uid task.000005>,
 <Task object, uid task.000006>,
 <Task object, uid task.000007>,
 <Task object, uid task.000008>,
 <Task object, uid task.000009>]

Waiting for the tasks to complete

Wait for all tasks to reach a final state (DONE, CANCELED or FAILED). This is a blocking call, i.e., the application will wait without exiting and, thus, the shell from which you launched the application should not exit either. Thus, no closing your laptop or no exiting from a remote connection without first leaving the shell running in background or using a terminal multiplexer like tmux.

Note: After the wait call returns, you can describe and/or submit more tasks/pilots as your RP session will still be open.

Note: You can wait for the execution of a subset of the tasks you defined. See Describing tasks in RADICAL-Pilot for more information.

[13]:
tmgr.wait_tasks()
[13]:
['DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE',
 'DONE']

Once the wait is finished, let us know and exit!

[14]:
report.header('finalize')
session.close(cleanup=True)

--------------------------------------------------------------------------------
finalize


Generated Output

RP is a distributed system, even when all its components run on a single machine as with this example. RP has two main components (Client and Agent) and both store their output into a sandbox, stored at a specific filesystem location:

  • Client sandbox: A directory created within the working directory from where the RP application was launched. The sandbox is named after the session ID, e.g., rp.session.nodename.username.018952.0000.

  • Agent sandbox: A directory created at a different location, depending on the machine on which the application executes. The Agent sandbox, named radical.pilot.sandbox, contains the following nested directories: <session_sandbox_ID>/<pilot_sandbox_ID>/<task_sandbox_ID>.

When running RP locally, the Agent sandbox is located at $HOME/radical.pilot.sandbox. When using a supported HPC platform, the location of the Agent sandbox depends on the filesystem capabilities of the platform. You can see the pre-configured location for the Agent sandbox in the RP git repository, at src/radical/pilot/configs/resource_*.json.

Warning: When executing RP on a supported HPC platform, the output file(s) of each task are saved in the task_sanbox_ID of that task. Without specific staging instructions (see our tutorial Staging Data with RADICAL-Pilot), you will have to manually retrieve those files. When executing locally, you can retrieve them from $HOME/radical.pilot.sandbox/<session_sandbox_ID>/<pilot_sandbox_ID>/<task_sandbox_ID>.

Note: When enabling debugging (see our tutorial Debugging a RADICAL-Pilot Application) and/or tracing (see our tutorial Tracing and Profiling a RADICAL-Pilot Application), RP writes the debug and/or trace files in the Client and Agent sandbox. On large/production runs, RP can produce hundreds of debug files. Please contact the RADICAL development team if you need further assistance.

Here are the output files of the task.000000 of the application we just executed in this notebook:

[15]:
! ls $HOME/radical.pilot.sandbox/rp.session.*/pilot*/task.000000/
task.000000.err      task.000000.launch.out  task.000000.out
task.000000.exec.sh  task.000000.launch.sh   task.000000.prof

Here is the “result” produced by task.000000:

[16]:
! cat `ls $HOME/radical.pilot.sandbox/rp.session.*/pilot*/task.000000/task.000000.out`
0 : PID     : 3639
0 : NODE    : build-24105523-project-13481-radicalpilot
0 : CPUS    : 00
0 : GPUS    :
0 : RANK    : 0
0 : THREADS : 1
0 : SLEEP   : 6