Getting Started

This notebook walks you through executing a hello_world application written with RADICAL-Pilot (RP) and locally executed on a GNU/Linux operating system. The application consists of a Bag of Tasks with heterogeneous requirements: different number of CPU cores/GPUs and different execution time for each task. In this simple application, tasks have no data requirements but see the Data Staging tutorial for how to manage data in RP.

Warning: We assume you understand what a pilot is and how it enables to concurrently and sequentially execute compute tasks on its resources. See our Brief Introduction to RP video to familiarize yourself with the architectural concepts and execution model of a pilot.

Installation

Warning: RP must be installed in a Python environment. RP will not work properly when installed as a system-wide package. You must create and activate a virtual environment before installing RP.

You can create a Python environment suitable to RP using Virtualenv, Venv or Conda. Once created and activated a virtual environment, RP is a Python module installed via pip, Conda, or Spack.

Note: Please see using environment variables with RP for more options and detailed information. That will be especially useful when executing RP on supported high performance computing (HPC) platforms.

Virtualenv

virtualenv ~/ve_rp
. ~/ve_rp/bin/activate
pip install radical.pilot

Venv

python -m venv ~/ve_rp
. ~/ve_rp/bin/activate
pip install radical.pilot

Conda

If there is no conda pre-installed, here is a distilled set of commands to install Miniconda on a GNU/Linux x86_64 OS. Find more (and possibly updated) information on the official Conda documentation)

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ./miniconda.sh
chmod +x ./miniconda.sh
./miniconda.sh -b -p ./conda
source ./conda/bin/activate

Once Conda is available:

conda create -y -n ve_rp python=3.9
conda activate ve_rp
conda install -y -c conda-forge radical.pilot

Spack

If there is no spack pre-installed, here is a distilled set of commands to install Spack on a GNU/Linux x86_64 OS. Find more (and possibly updated) information on the official Spack documentation.

git clone https://github.com/spack/spack.git
. spack/share/spack/setup-env.sh

Once Spack is available:

spack env create ve_rp
spack env activate ve_rp
spack install py-radical-pilot

Note: It is recommended to use PYTHONNOUSERSITE environment variable before activating your virtual environment to prevent user site packages from interfering: export PYTHONNOUSERSITE=True.

Check the installed version

Often, we need to know what version of RP we installed. For example, you will need to know that when opening a support ticket with the RADICAL development team.

We install a command with RP that prints information for all the installed RADICAL Cybertools:

[1]:
!radical-stack

  python               : /home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/envs/devel/bin/python3
  pythonpath           :
  version              : 3.9.22
  virtualenv           :

  radical.gtod         : 1.102.0
  radical.pilot        : 1.103.0-v1.102.0-27-gfe796b4@detached
  radical.utils        : 1.102.0

Write your first application

RP executes in batch mode:

  • Write an application using RP API.

  • Launch that application.

  • Wait a variable amount of time doing something else.

  • When the application exits, come back, collect and check the results.

Each RP application has a distinctive pattern:

  1. Create a session

  2. Create a pilot manager

  3. Describe the pilot on which you want to run your application tasks:

    • Define the platform on which you want to execute the application

    • Define the amount/type of resources you want to use to run your application tasks.

  4. Assign the pilot description to the pilot manager

  5. Create a task manager

  6. Describe the computational tasks that you want to execute:

    • Executable launched by the task

    • Arguments to pass to the executable command if any

    • Amount of each type of resource used by the executable, e.g., CPU cores and GPUs

    • When the executable is MPI/OpenMP, number of ranks for the whole executable, number of ranks per core or GPU

    • Many other parameters. See the API specification for full details

  7. Assign the task descriptions to the task manager

  8. Submit tasks for execution

  9. Wait for tasks to complete execution

Some of RP behavior can be configured via environment variables. RP’s progression bar does not work properly with Jupyter notebooks. Thus, you may want to set it to FALSE.

[2]:
%env RADICAL_REPORT_ANIME=FALSE
env: RADICAL_REPORT_ANIME=FALSE

As with every Python application, first you import all the required modules.

[3]:
import radical.pilot as rp

Enable user feedback

As RP implements a batch programming model, by default, it returns a minimal amount of information. After submitting the tasks for execution, RP will remain silent until all the tasks have completed. In practice, when developing and debugging your application, you will want more feedback. We wrote a reporter module that you can use with RP and all the other RADICAL-Cybertools.

To use the reporter:

  • Configure RP by exporting a shell environment variable.

    export RADICAL_PILOT_REPORT=TRUE
    
  • Import radical.utils, create a reporter and start to use it to print meaningful messages about the state of the application execution.

Note: See our tutorial about Profiling a RADICAL-Pilot Application for a guide on how to trace and profile your application.

Note: See our tutorial about Debugging a RADICAL-Pilot Application for a guide on how to debug your application.

[4]:
import radical.utils as ru

report = ru.Reporter(name='radical.pilot')
report.title('Getting Started (RP version %s)' % rp.version)

================================================================================
 Getting Started (RP version 1.102.0)
================================================================================


Creating a session

rp.Session is the root object of all the other objects of RP.

[5]:
session = rp.Session()

Creating a pilot manager

You need to manage the resources you will acquire with a pilot either locally or, more commonly and usefully, on a supported HPC platform. An instance of rp.PilotManager attached to your session will do that for you.

Note: One rp.PilotManager can manage multiple pilots. See our tutorial about Using Multiple Pilots with RADICAL-Pilot to see why and how.

[6]:
pmgr = rp.PilotManager(session=session)

Configuring pilot resources

You can use a dictionary to specify location, amount and other properties of the resources you want to acquire with a pilot; and use that dictionary to initialize a rp.PilotDescription object. See the Configuration tutorial for more details.

In this example, we want to run our hello_world application on our local GNU/Linux, for not more than 30 minutes and use 2 cores.

Warning: We choose a 30 minutes runtime, but the application could take less or more time to complete. 30 minutes are the upper bound but RP will exit as soon as all tasks have reached their final state (DONE, CANCELED, FAILED). Conversely, RP will always exit once the runtime expires, even if some tasks still need to be executed.

Note: We could choose to use as many CPU cores as we have available on our local machine. RP will allocate all of them, but it will use only the cores required by the application tasks. If all the tasks together require fewer cores than those available, the remaining cores will go unused. Conversely, if there are more tasks that cores, RP will schedule each task as soon as the required amount of cores becomes available. In this way, RP will maintain the available resources as busy as possible and the application tasks will run both concurrently and sequentially, depending on resource availability.

Note: 'exit_on_error': False allows us to compile this notebook without errors. You should probably not use it with a standalone RP application.

[7]:
pdesc = rp.PilotDescription({'resource'     : 'local.localhost',
                             'project'      : None,
                             'queue'        : None,
                             'cores'        : 4,
                             'gpus'         : 0,
                             'runtime'      : 30,  # pilot runtime minutes
                             'exit_on_error': False})

Submitting the pilot

We now have a pilot manager, we know how many resources we want and on what platform. We are ready to submit our request!

Note: On a supported HPC platform, our request will queue a job into the platform’s batch system. The actual resources will become available only when the batch system schedules the job. This is not under the control of RP and, barring reservation, the actual queue time will be unknown.

We use the rp.PilotManager.submit_pilots() method of our pilot manager and pass to it the pilot description.

[8]:
report.header('submit pilot')
pilot = pmgr.submit_pilots(pdesc)

# preserve pilot sandbox
pilot_sandbox  = ru.Url(pilot.pilot_sandbox).path

# wait for pilot to become ACTIVE
pilot.wait(rp.PMGR_ACTIVE)

# report pilot state
report.info('<<Pilot state:')
report.ok(f'>>{pilot.state}\n')

--------------------------------------------------------------------------------
submit pilot

Pilot state:                                                         PMGR_ACTIVE

Creating a task manager

We have acquired the resources we asked for (or we are waiting in a queue to get them) so now we need to do something with those resources, i.e., executing our application tasks. First, we create a rp.TaskManager and associate it to our session. That manager will take care of taking our task descriptions and sending them to our pilot so that it can execute those tasks on the allocated resources.

[9]:
tmgr = rp.TaskManager(session=session)

Registering the pilot with the task manager

We tell the task manager what pilot it should use to execute its tasks.

[10]:
tmgr.add_pilots(pilot)

Describing the application tasks

In this example, we want to run simple tasks but that require different number of CPU cores and that run for a variable amount of time. Thus, we use the executable radical-pilot-hello.sh we crafted to occupy a configurable amount of resources for a configurable amount of time.

Each task is an instance of rp.TaskDescription with some defined properties:

  • executable: the name of the executable we want to launch with the task.

  • arguments: the arguments to pass to executable. In this case, the number of seconds it needs to run for.

  • ranks: this is the number of processes (i.e., MPI ranks) on which the task should run. See Describing tasks in RADICAL-Pilot and the details about executing tasks that use the message passing interface (MPI).

  • cores_per_rank: the amount of cores that each rank of the task utilizes. In our case, each task will randomly use either 1 or 2 cores have requested.

Warning: Executing MPI tasks (i.e., one with rp.TaskDescription.ranks > 1) requires for an MPI implementation to be available on the machine on which you will run the task. That is usually taken care of by the system administrator, but if you are managing your own cluster, you will have to install and make available one of the many MPI distributions available for GNU/Linux.

We run 10 tasks that should be enough to see both concurrent and sequential executions on the amount of resources we requested, but not enough to clog the example.

Note: We use the reporter to produce a progress bar while we loop over the task descriptions.

[11]:
import os
import random

n = 10

report.progress_tgt(n, label='create')
tds = list()
for i in range(n):

    td = rp.TaskDescription()
    td.executable     = 'radical-pilot-hello.sh'
    td.arguments      = [random.randint(1, 10)]
    td.ranks          =  1
    td.cores_per_rank =  random.randint(1, 2)
    td.named_env      = 'rp'

    tds.append(td)
    report.progress()

report.progress_done()
create: ########################################################################

Submitting tasks for execution

Now that we have all the elements of the application we can execute its tasks. We submit the list of application tasks to the task manager that, in turn, will submit them to the indicated pilot for execution. Upon receiving the list of task descriptions, the pilot will schedule those tasks on its available resources and then execute them.

Note: For RP, tasks are black boxes, i.e., it knows nothing about the code executed by the task. RP just knows that a task has been launched on the requested amount of resources, and it will wait until the tasks exits. In that way, RP is agnostic towards task details like language used for its implementation, the type of scientific computation it performs, how it uses data, etc. This is why RP can serve a wide range of scientists, independent on their scientific domain.

[12]:
report.header('submit %d tasks' % n)
tasks = tmgr.submit_tasks(tds)

--------------------------------------------------------------------------------
submit 10 tasks


Waiting for the tasks to complete

Wait for all tasks to reach a final state (DONE, CANCELED or FAILED). This is a blocking call, i.e., the application will wait without exiting and, thus, the shell from which you launched the application should not exit either. Thus, no closing your laptop or no exiting from a remote connection without first leaving the shell running in background or using a terminal multiplexer like tmux.

Note: After the wait call returns, you can describe and/or submit more tasks/pilots as your RP session will still be open.

Note: You can wait for the execution of a subset of the tasks you defined. See Describing tasks in RADICAL-Pilot for more information.

[13]:
tmgr.wait_tasks(uids=[t.uid for t in tasks])
report.header(f'finished {len(tasks)} tasks')

task_states   = [t.state for t in tasks]
state_counter = {state: task_states.count(state) for state in set(task_states)}
for state, counter in sorted(state_counter.items()):
    report.info('<<Tasks state:')
    report.ok(f'>>({counter}) {state}\n')

--------------------------------------------------------------------------------
finished 10 tasks

Tasks state:                                                           (10) DONE

Once the wait is finished, let us know and exit!

[14]:
report.header('finalize')
session.close()

--------------------------------------------------------------------------------
finalize


Generated Output

RP is a distributed system, even when all its components run on a single machine as with this example. RP has two main components (Client and Agent) and both store their output into a sandbox, stored at a specific filesystem location:

  • Client sandbox: A directory created within the working directory from where the RP application was launched. The sandbox is named after the session ID, e.g., rp.session.nodename.username.018952.0000.

  • Agent sandbox: A directory created at a different location, depending on the machine on which the application executes. The Agent sandbox, named radical.pilot.sandbox, contains the following nested directories: <session_ID>/<pilot_ID>/<task_ID> which represent session-sandbox, pilot-sandbox(es) and task-sandbox(es) respectively.

When running RP on localhost, the Agent sandbox is located at $HOME/radical.pilot.sandbox. When using a supported HPC platform, the location of the Agent sandbox depends on the filesystem capabilities of the platform. You can see the pre-configured location for the Agent sandbox in the RP git repository, at src/radical/pilot/configs/resource_*.json.

Warning: When executing RP on a supported HPC platform, the output file(s) of each task are saved in the task-sandbox, named after task_ID of that task. Without specific staging instructions (see our tutorial Staging Data with RADICAL-Pilot), you will have to manually retrieve those files. When executing on localhost, you can retrieve them from $HOME/radical.pilot.sandbox/<session_ID>/<pilot_ID>/<task_ID>.

Note: When enabling debugging (see our tutorial Debugging a RADICAL-Pilot Application) and/or tracing (see our tutorial Tracing and Profiling a RADICAL-Pilot Application), RP writes the debug and/or trace files in the Client and Agent sandboxes. On large/production runs, RP can produce hundreds of debug files. Please contact the RADICAL development team if you need further assistance.

Here are the output files of the task.000000 of the application we just executed in this notebook:

[15]:
!echo $pilot_sandbox
!ls $pilot_sandbox/task.000000/
/home/docs/radical.pilot.sandbox/rp.session.35db48f8-4693-11f0-bd8b-764078d63b85/pilot.0000/
task.000000.err      task.000000.launch.out  task.000000.out
task.000000.exec.sh  task.000000.launch.sh   task.000000.prof
task.000000.files    task.000000.ofiles

Here is the “result” produced by task.000000:

[16]:
!cat $pilot_sandbox/task.000000/task.000000.out
0 : PID     : 3881
0 : NODE    : build-28471584-project-13481-radicalpilot
0 : CPUS    : 00
0 : GPUS    :
0 : RANK    : 0
0 : THREADS : 1
0 : SLEEP   : 3