Executing Tasks with RAPTOR

This notebook introduces you to RAPTOR, a high-throughput RADICAL-Pilot’s subsystem that executes function tasks and non-MPI executable tasks at scale on supported HPC platforms. This tutorial will guide you through the setup and configuration of RAPTOR and through the specification and execution of a variety of task types.

Warning: We assume you have already worked through our getting started and describing tasks tutorials.

Warning: All examples in this notebook are executed locally on your machine. You need to have installed MPI before executing these examples. RADICAL-Pilot and RAPTOR support OpenMPI, MPICH, MVAPICH or any other MPI flavor that provides a standards compliant mpiexec command.

When Using RAPTOR

Use RAPTOR when you want to concurrently execute free/serialized Python functions, Python class methods and shell commands. For example, you want to concurrently execute up to 10^5 machine learning functions across thousands of GPUs and CPUs. RAPTOR supports single, multi-process and MPI Python functions.

You should also use RAPTOR when your application requires non-MPI tasks which execute for less than 5 minutes. You could use RADICAL-Pilot without RAPTOR for that workload but you would incur into high scheduling and launching overheads.

What is RAPTOR

RAPTOR is a subsystem of RP, thus you have to execute RP in order to use RAPTOR. RAPTOR launches a configurable number of masters and workers on the resources you acquired via RP. Once up and running, each RAPTOR’s master will receive task execution requests from RP. In turn, each master will dispatch those requests to the workers which are optimized to execute small, short-running tasks at scale.

Different from RP’s ‘normal’ task, RAPTOR can execute a variety of task types:

executables: similar to RP’s native task execution, but without MPI support
free Python functions
Python class methods
serialized Python functions
plain Python code
shell commands

Importantly, all function invocations can make use of MPI by defining multiple ranks.

RAPTOR has a number of advanced capabilities, such as:

new task types can be added by applications
the RAPTOR Master class can be overloaded by applications
the RAPTOR Worker class can be overloaded by applications
Master and Worker layout can be tuned in a variety of ways
different Worker implementations are available with different capabilities and scaling properties
workload execution can mix RAPTOR task execution and ‘normal’ RP task execution

Those topics will not be covered in this basic tutorial.

Prepare a RP pilot to host RAPTOR

We will launch a pilot with sufficient resources to run both the raptor master (using 1 core) and two worker instances (using 8 cores each):

[1]:

import os

# do not use animated output in notebooks
os.environ['RADICAL_REPORT_ANIME'] = 'False'

import radical.pilot as rp
import radical.utils as ru

# determine the path of the currently active ve to simplify some examples below
ve_path = os.path.dirname(os.path.dirname(ru.which('python3')))

# create session and managers
session = rp.Session()
pmgr    = rp.PilotManager(session)
tmgr    = rp.TaskManager(session)

# submit a pilot
pilot = pmgr.submit_pilots(rp.PilotDescription({'resource'     : 'local.localhost',
                                                'runtime'      : 60,
                                                'cores'        : 17,
                                                'exit_on_error': False}))

# add the pilot to the task manager and wait for the pilot to become active
tmgr.add_pilots(pilot)
pilot.wait(rp.PMGR_ACTIVE)
print('pilot is up and running')

pilot is up and running

We now have an active pilot with sufficient resource and can start executing the RAPTOR master and worker instances. Both master and worker need to run in an environment which has radical.pilot installed, so we place it in the pilot agent environment rp (the named_env attribute is covered by the tutorial describing tasks):

[2]:

master_descr = {'mode'     : rp.RAPTOR_MASTER,
                'named_env': 'rp'}
worker_descr = {'mode'     : rp.RAPTOR_WORKER,
                'named_env': 'rp'}

raptor  = pilot.submit_raptors( [rp.TaskDescription(master_descr)])[0]
workers = raptor.submit_workers([rp.TaskDescription(worker_descr),
                                 rp.TaskDescription(worker_descr)])

Task execution

At this point we have the pilot set up and running, we started the master task, and the master will upon initialization start the worker tasks: the RAPTOR overlay is now ready to execute a Python function. Note that a function must be decorated with the rp.pythontask decorator for it to be callable as a raptor function.

[3]:

# function for raptor to execute
@rp.pythontask
def msg(val: int):
    if(val %2 == 0):
        print('Regular message')
    else:
        print(f'This is a very odd message: {val}')

# create a minimal function task
td   = rp.TaskDescription({'mode'    : rp.TASK_FUNCTION,
                           'function': msg(3)})
task = raptor.submit_tasks([td])[0]

The task will be scheduled for execution on the pilot we created above. We now wait for the task to complete, i.e., to reach one of the final states DONE, CANCELED or FAILED:

[4]:

print(task)
tmgr.wait_tasks([task.uid])
print('id: %s [%s]:\n    out:\n%s\n    ret: %s\n'
     % (task.uid, task.state, task.stdout, task.return_value))

['task.000000', '', 'AGENT_STAGING_INPUT_PENDING']
id: task.000000 [DONE]:
    out:
This is a very odd message: 3

    ret: None

[5]:

session.close()