Debugging

RADICAL-Pilot is a complex runtime system which employs multiple distributed components to orchestrate workload execution. It is also a research software, funded by research grants. As such, it is possibly not comparable to commercially supported software systems.

Also, RADICAL-Pilot targets mostly academic HPC environments and high-end machines which are usually at the cutting edge of hard and software development. Those machines thus usually have their own custom and sometimes peculiar and evolving system environment.

All that is to say that it might be necessary to investigate various possible failure modes, both failures related to the execution of your workload tasks, and also possibly failures related to RADICAL-Pilot’s own operation.

This notebook attempts to guide you through different means to investigate possible failure modes. That is not necessarily an intuitive process, but hopefully serves to cover the most common problems. We want to encourage you to seek support from the RADICAL development community via TODO if the presented means proof insufficient.

Setting the stage

We run a simple RP application which triggers specific failures on purpose - the resulting session will be used as demonstrator for the remainder of this notebook. We submit one task which succeeds (task.000000) and one with an invalid executable which fails (task.000001).

[1]:
%env RADICAL_LOG_LVL=DEBUG

%env RADICAL_REPORT=TRUE
%env RADICAL_REPORT_ANIME=FALSE
env: RADICAL_LOG_LVL=DEBUG
env: RADICAL_REPORT=TRUE
env: RADICAL_REPORT_ANIME=FALSE
[2]:

import radical.pilot as rp import radical.utils as ru client_sandbox = None pilot_sandbox = None with rp.Session() as session: pmgr = rp.PilotManager(session=session) tmgr = rp.TaskManager(session=session) pilot = pmgr.submit_pilots( rp.PilotDescription({'resource' : 'local.localhost', 'cores' : 4, 'runtime' : 10, 'exit_on_error': False})) tmgr.add_pilots(pilot) td_1 = rp.TaskDescription({'executable': 'date'}) td_2 = rp.TaskDescription({'executable': 'data'}) # <- unknown executable tasks = tmgr.submit_tasks([td_1, td_2]) tmgr.wait_tasks() client_sandbox = ru.Url(pilot.client_sandbox).path + '/' + session.uid pilot_sandbox = ru.Url(pilot.pilot_sandbox).path for task in tasks: print('%s: %s' % (task.uid, task.state)) print('client sandbox: %s' % client_sandbox) print('pilot sandbox: %s' % pilot_sandbox)
new session: [rp.session.6e80a856-4693-11f0-8eb1-764078d63b85]                 \
zmq proxy  : [tcp://172.17.0.2:10001]                                         ok
create pilot manager                                                          ok
create task manager                                                           ok
submit 1 pilot(s)
        pilot.0000   local.localhost           4 cores       0 gpus           ok
submit: ########################################################################
wait  : ########################################################################
     DONE      :     1
      FAILED    :     1
                                                                              ok
closing session rp.session.6e80a856-4693-11f0-8eb1-764078d63b85                \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 29.5s                                                       ok

task.000000: DONE
task.000001: FAILED
client sandbox: /home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/devel/docs/source/tutorials/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85
pilot  sandbox: /home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85/pilot.0000/

Investigating Task Failures

You created a task description, submitted your task, and they end up in FAILED state. On the API level, you can inspec the tasks stdout and stderr values as follows:

[3]:
import os

import radical.pilot as rp

for task in tasks:
    if task.state == rp.FAILED:
        print('%s stderr: %s' % (task.uid, task.stderr))
    elif task.state == rp.DONE:
        print('%s stdout: %s' % (task.uid, task.stdout))
task.000000 stdout: Wed Jun 11 07:12:43 UTC 2025

task.000001 stderr: /home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85//pilot.0000//task.000001/task.000001.exec.sh: line 56: data: command not found

Note though that the available length of both values is shortened to 1024 characters. If that is inefficient you can still inspect the complete values on the file system of the target resource. For that you would navigate to the task sandbox (whose value can be inspected via task.sandbox).

That sandbox usually has a set of files similar to the example shown below. The <task.uid>.out and <task.uid>.err files will have captured the task’s stdout and stderr streams, respectively:

[4]:
tid = tasks[1].uid

%cd $pilot_sandbox/$tid

!ls -l
!cat "$tid".err

/home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85/pilot.0000/task.000001
/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/envs/devel/lib/python3.9/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
total 16
-rw-r--r-- 1 docs docs  160 Jun 11 07:12 task.000001.err
-rwxr--r-- 1 docs docs 2575 Jun 11 07:12 task.000001.exec.sh
-rw-r--r-- 1 docs docs    0 Jun 11 07:12 task.000001.files
-rw-r--r-- 1 docs docs    0 Jun 11 07:12 task.000001.launch.out
-rwxr--r-- 1 docs docs 2261 Jun 11 07:12 task.000001.launch.sh
-rw-r--r-- 1 docs docs    0 Jun 11 07:12 task.000001.ofiles
-rw-r--r-- 1 docs docs    0 Jun 11 07:12 task.000001.out
-rw-r--r-- 1 docs docs  912 Jun 11 07:12 task.000001.prof
/home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85//pilot.0000//task.000001/task.000001.exec.sh: line 56: data: command not found

A very common problem for task failures is an invalid environment setup: scientific applications frequently requires software modules to be loaded, virtual environments to be activated, etc. Those actions are specified in the task description’s pre_exec statements. You may want to investigate <task.uid>.exec.sh in the task sandbox to check if the environment setup is indeed as you expect it to be.

Investigate RADICAL-Pilot Failures

If the investigation of the task sandbox did not yield any clues as to the origin of the failure, but your task still ends up in FAILED state or RP itself fails in any other way, we suggest the following sequence of commands, in that order, to investigate the problem further.

First, check the client side session sandbox for any ERROR log messages or error messages in general:

[5]:

%cd $client_sandbox ! grep 'ERROR' *log ! ls -l *.out *.err
[Errno 2] No such file or directory: '/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/devel/docs/source/tutorials/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85'
/home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85/pilot.0000/task.000001
/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/envs/devel/lib/python3.9/site-packages/IPython/core/magics/osm.py:393: UserWarning: using bookmarks requires you to install the `pickleshare` library.
  bkms = self.shell.db.get('bookmarks', {})
grep: *log: No such file or directory
-rw-r--r-- 1 docs docs 160 Jun 11 07:12 task.000001.err
-rw-r--r-- 1 docs docs   0 Jun 11 07:12 task.000001.launch.out
-rw-r--r-- 1 docs docs   0 Jun 11 07:12 task.000001.out

You would expect no ERROR lines to show up in the log files, and all stdout/stderr files of the RP components to be empty.

The next step is to repeat that process in the pilot sandbox:

[6]:

%cd $pilot_sandbox ! grep 'ERROR' *log ! ls -l *.out *.err
/home/docs/radical.pilot.sandbox/rp.session.6e80a856-4693-11f0-8eb1-764078d63b85/pilot.0000
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_0.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_0.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_collecting_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_collecting_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_executing.0000.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_executing.0000.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_executing_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_executing_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_schedule_pubsub.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_schedule_pubsub.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_scheduling.0000.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_scheduling.0000.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_scheduling_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_scheduling_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_input.0000.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_input.0000.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_input_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_input_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_output.0000.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_output.0000.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_output_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_staging_output_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_unschedule_pubsub.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 agent_unschedule_pubsub.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 bootstrap_0.err
-rw-r--r-- 1 docs docs 20171 Jun 11 07:12 bootstrap_0.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 client_pubsub.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 client_pubsub.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 client_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 client_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 control_pubsub.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 control_pubsub.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 raptor_scheduling_queue.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 raptor_scheduling_queue.out
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 state_pubsub.err
-rw-r--r-- 1 docs docs     0 Jun 11 07:12 state_pubsub.out

Here you will always find bootstrap_0.out to be populated with the output of RP’s shell bootstrapper. If no other errors in the log or stdio files show up, you may want to look at that bootstrap_0.out output to see if and why the pilot bootstrapping failed.

Ask for Help from the RADICAL Team

If neither of the above steps provided any insight into the causes of the observed failures, please execute the following steps:

  • create a tarball of the client sandbox

  • create a tarball of the session sandbox

  • open an issue at https://github.com/radical-cybertools/radical.pilot/issues/new and attach both tarballs

  • describe the observed problem and include the following additional information:

    • output of the radical-stack command

    • information of any change to the configuration file of the target platform

We will likely be able to infer the problem causes from the provided sandbox tarballs and will be happy to help you in correcting those, or we will ask for forther information about the environment your application is running in.