Debugging
RADICAL-Pilot is a complex runtime system which employes multiple distributed components to orchestrate workload execution. It is also a research software, funded by research grants. As such it is possibly it is not quite comparable to commercially supported software systems.
Also, RADICAL-Pilot targets mostly academic HPC environments and high end machines which are usually at the cutting edge of hard and software development. Those machines thus usually have their own custom and sometimes peculiar and evolving system environment.
All that is to say that it might be necessary to investigate various possible failure modes, both failures related to the execution of your workload tasks, and also possibly failures related to RADICAL-Pilot’s own operation.
This notebook attempts to guide you through different means to investigate possible failure modes. That is not necessarily an intuitive process, but hopefully serves to cover the most common problems. We want to encourage you to seek support from the RCT develope community via TODO if the presented means proof insufficient.
Setting the stage
We run a simple RP application which triggers specific failures on purpose - the resulting session will be used as demonstrator for the remainder of this notebook. We submit one task which suceeds (task.000000
) and one with an invalid executable which fails (task.000001
).
[1]:
import radical.pilot as rp
import radical.utils as ru
client_sandbox = None
pilot_sandbox = None
with rp.Session() as session:
pmgr = rp.PilotManager(session=session)
pilot = pmgr.submit_pilots(rp.PilotDescription({'resource': 'local.localhost',
'cores' : 4,
'runtime' : 10}))
tmgr = rp.TaskManager(session=session)
tmgr.add_pilots(pilot)
td_1 = rp.TaskDescription({'executable': 'date'})
td_2 = rp.TaskDescription({'executable': 'data'})
tasks = tmgr.submit_tasks([td_1, td_2])
tmgr.wait_tasks()
for task in tasks:
print('%s: %s' % (task.uid, task.state))
client_sandbox = ru.Url(pilot.client_sandbox).path + '/' + session.uid
pilot_sandbox = ru.Url(pilot.pilot_sandbox).path
print('client sandbox: %s' % client_sandbox)
print('pilot sandbox: %s' % pilot_sandbox)
task.000000: DONE
task.000001: FAILED
client sandbox: /home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/stable/docs/source/tutorials/rp.session.1b975200-ced6-11ef-8a35-0242ac110002
pilot sandbox: /home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002/pilot.0000/
Investigating Task Failures
You created a task description, submitted your task, and they end up in FAILED
state. On the API level, you can inspec the tasks stdout
and stderr
values as follows:
[2]:
import os
import radical.pilot as rp
for task in tasks:
if task.state == rp.FAILED:
print('%s stderr: %s' % (task.uid, task.stderr))
elif task.state == rp.DONE:
print('%s stdout: %s' % (task.uid, task.stdout))
task.000000 stdout: Thu Jan 9 22:07:37 UTC 2025
task.000001 stderr: /home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002//pilot.0000//task.000001/task.000001.exec.sh: line 55: data: command not found
Note though that the available length of both values is shortened to 1024 characters. If that is inefficient you can still inspect the complete values on the file system of the target resource. For that you would navigate to the task sandbox (whose value can be inspected via task.sandbox
).
That sandbox usually has a set of files similar to the example shown below. The <task.uid>.out
and <task.uid>.err
files will have captured the task’s stdout and stderr streams, respectively:
[3]:
tid = tasks[1].uid
%cd $pilot_sandbox/$tid
!ls -l
!cat "$tid".err
/home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002/pilot.0000/task.000001
total 16
-rw-r--r-- 1 docs docs 160 Jan 9 22:07 task.000001.err
-rwxr--r-- 1 docs docs 2408 Jan 9 22:07 task.000001.exec.sh
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.files
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.launch.out
-rwxr--r-- 1 docs docs 2145 Jan 9 22:07 task.000001.launch.sh
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.ofiles
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.out
-rw-r--r-- 1 docs docs 909 Jan 9 22:07 task.000001.prof
/home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002//pilot.0000//task.000001/task.000001.exec.sh: line 55: data: command not found
A very common problem for task failures is an invalid environment setup: scientific applications frequently requires software modules to be loaded, virtual environments to be activated, etc. Those actions are specified in the task description’s pre_exec
statements. You may want to investigate <task.uid>.exec.sh
in the task sandbox to check if the environment setup is indeed as you expect it to be.
Investigate RADICAL-Pilot Failures
If the investigation of the task sandbox did not yield any clues as to the origin of the failure, but your task still ends up in FAILED
state or RP itself fails in any other way, we suggest the following sequence of commands, in that order, to investigate the problem further.
First, check the client side session sandbox for any ERROR log messages or error messages in general:
[4]:
%cd $client_sandbox
! grep 'ERROR' *log
! ls -l *.out *.err
[Errno 2] No such file or directory: '/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/stable/docs/source/tutorials/rp.session.1b975200-ced6-11ef-8a35-0242ac110002'
/home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002/pilot.0000/task.000001
grep: *log: No such file or directory
-rw-r--r-- 1 docs docs 160 Jan 9 22:07 task.000001.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.launch.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 task.000001.out
You would expect no ERROR
lines to show up in the log files, and all stdout/stderr files of the RP components to be empty.
The next step is to repeat that process in the pilot sandbox:
[5]:
%cd $pilot_sandbox
! grep 'ERROR' *log
! ls -l *.out *.err
/home/docs/radical.pilot.sandbox/rp.session.1b975200-ced6-11ef-8a35-0242ac110002/pilot.0000
agent_0.log:1736460457.148 : agent_0 : 6583 : 139893594388032 : ERROR : invalid command: [register_named_env]
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_0.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_0.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_collecting_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_collecting_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_executing.0000.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_executing.0000.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_executing_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_executing_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_schedule_pubsub.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_schedule_pubsub.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_scheduling.0000.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_scheduling.0000.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_scheduling_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_scheduling_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_input.0000.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_input.0000.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_input_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_input_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_output.0000.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_output.0000.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_output_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_staging_output_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_unschedule_pubsub.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 agent_unschedule_pubsub.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 bootstrap_0.err
-rw-r--r-- 1 docs docs 17616 Jan 9 22:07 bootstrap_0.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 client_pubsub.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 client_pubsub.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 client_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 client_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 control_pubsub.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 control_pubsub.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 raptor_scheduling_queue.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 raptor_scheduling_queue.out
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 state_pubsub.err
-rw-r--r-- 1 docs docs 0 Jan 9 22:07 state_pubsub.out
Here you will always find bootstrap_0.out
to be populated with the output of RP’s shell bootstrapper. If no other errors in the log or stdio files show up, you may want to look at that bootstrap_0.out
output to see if and why the pilot bootstrapping failed.
Ask for Help from the RADICAL Team
If neither of the above steps provided any insight into the causes of the observed failures, please execute the following steps:
create a tarball of the client sandbox
create a tarball of the session sandbox
open an issue at https://github.com/radical-cybertools/radical.pilot/issues/new and attach both tarballs
describe the observed problem and include the following additional information:
output of the
radical-stack
commandinformation of any change to the resource configuration of the target resource
We will likely be able to infer the problem causes from the provided sandbox tarballs and will be happy to help you in correcting those, or we will ask for forther information about the environment your application is running in.