Debugging

RADICAL-Pilot is a complex runtime system which employes multiple distributed components to orchestrate workload execution. It is also a research software, funded by research grants. As such it is possibly it is not quite comparable to commercially supported software systems.

Also, RADICAL-Pilot targets mostly academic HPC environments and high end machines which are usually at the cutting edge of hard and software development. Those machines thus usually have their own custom and sometimes peculiar and evolving system environment.

All that is to say that it might be necessary to investigate various possible failure modes, both failures related to the execution of your workload tasks, and also possibly failures related to RADICAL-Pilot’s own operation.

This notebook attempts to guide you through different means to investigate possible failure modes. That is not necessarily an intuitive process, but hopefully serves to cover the most common problems. We want to encourage you to seek support from the RCT develope community via TODO if the presented means proof insufficient.

Setting the stage

We run a simple RP application which triggers specific failures on purpose - the resulting session will be used as demonstrator for the remainder of this notebook. We submit one task which suceeds (task.000000) and one with an invalid executable which fails (task.000001).

[1]:

import radical.pilot as rp
import radical.utils as ru

client_sandbox = None
pilot_sandbox  = None

with rp.Session() as session:

    pmgr  = rp.PilotManager(session=session)
    pilot = pmgr.submit_pilots(rp.PilotDescription({'resource': 'local.localhost',
                                                    'cores'   : 4,
                                                    'runtime' : 10}))

    tmgr  = rp.TaskManager(session=session)
    tmgr.add_pilots(pilot)
    td_1  = rp.TaskDescription({'executable': 'date'})
    td_2  = rp.TaskDescription({'executable': 'data'})
    tasks = tmgr.submit_tasks([td_1, td_2])
    tmgr.wait_tasks()

    for task in tasks:
        print('%s: %s' % (task.uid, task.state))

    client_sandbox = ru.Url(pilot.client_sandbox).path + '/' + session.uid
    pilot_sandbox  = ru.Url(pilot.pilot_sandbox).path

print('client sandbox: %s' % client_sandbox)
print('pilot  sandbox: %s' % pilot_sandbox)

task.000000: DONE
task.000001: FAILED
client sandbox: /home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/stable/docs/source/tutorials/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002
pilot  sandbox: /home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002/pilot.0000/

Investigating Task Failures

You created a task description, submitted your task, and they end up in FAILED state. On the API level, you can inspec the tasks stdout and stderr values as follows:

[2]:

import os

import radical.pilot as rp

for task in tasks:
    if task.state == rp.FAILED:
        print('%s stderr: %s' % (task.uid, task.stderr))
    elif task.state == rp.DONE:
        print('%s stdout: %s' % (task.uid, task.stdout))

task.000000 stdout: Thu Apr 18 05:29:39 UTC 2024

task.000001 stderr: /home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002//pilot.0000//task.000001/task.000001.exec.sh: 49: data: not found

Note though that the available length of both values is shortened to 1024 characters. If that is inefficient you can still inspect the complete values on the file system of the target resource. For that you would navigate to the task sandbox (whose value can be inspected via task.sandbox).

That sandbox usually has a set of files similar to the example shown below. The <task.uid>.out and <task.uid>.err files will have captured the task’s stdout and stderr streams, respectively:

[3]:

tid = tasks[1].uid

%cd $pilot_sandbox/$tid

!ls -l
!cat "$tid".err

/home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002/pilot.0000/task.000001
total 16
-rw-r--r-- 1 docs docs  147 Apr 18 05:29 task.000001.err
-rwxr--r-- 1 docs docs 1927 Apr 18 05:29 task.000001.exec.sh
-rw-r--r-- 1 docs docs    0 Apr 18 05:29 task.000001.launch.out
-rwxr--r-- 1 docs docs 2033 Apr 18 05:29 task.000001.launch.sh
-rw-r--r-- 1 docs docs    0 Apr 18 05:29 task.000001.out
-rw-r--r-- 1 docs docs  909 Apr 18 05:29 task.000001.prof
/home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002//pilot.0000//task.000001/task.000001.exec.sh: 49: data: not found

A very common problem for task failures is an invalid environment setup: scientific applications frequently requires software modules to be loaded, virtual environments to be activated, etc. Those actions are specified in the task description’s pre_exec statements. You may want to investigate <task.uid>.exec.sh in the task sandbox to check if the environment setup is indeed as you expect it to be.

Investigate RADICAL-Pilot Failures

If the investigation of the task sandbox did not yield any clues as to the origin of the failure, but your task still ends up in FAILED state or RP itself fails in any other way, we suggest the following sequence of commands, in that order, to investigate the problem further.

First, check the client side session sandbox for any ERROR log messages or error messages in general:

[4]:

%cd $client_sandbox
! grep 'ERROR' *log
! ls -l *.out *.err

[Errno 2] No such file or directory: '/home/docs/checkouts/readthedocs.org/user_builds/radicalpilot/checkouts/stable/docs/source/tutorials/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002'
/home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002/pilot.0000/task.000001
grep: *log: No such file or directory
-rw-r--r-- 1 docs docs 147 Apr 18 05:29 task.000001.err
-rw-r--r-- 1 docs docs   0 Apr 18 05:29 task.000001.launch.out
-rw-r--r-- 1 docs docs   0 Apr 18 05:29 task.000001.out

You would expect no ERROR lines to show up in the log files, and all stdout/stderr files of the RP components to be empty.

The next step is to repeat that process in the pilot sandbox:

[5]:

%cd $pilot_sandbox
! grep 'ERROR' *log
! ls -l *.out *.err

/home/docs/radical.pilot.sandbox/rp.session.97204d0a-fd44-11ee-9eb9-0242ac110002/pilot.0000
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_0.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_0.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_collecting_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_collecting_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_executing.0000.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_executing.0000.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_executing_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_executing_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_schedule_pubsub.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_schedule_pubsub.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_scheduling.0000.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_scheduling.0000.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_scheduling_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_scheduling_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_input.0000.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_input.0000.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_input_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_input_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_output.0000.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_output.0000.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_output_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_staging_output_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_unschedule_pubsub.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 agent_unschedule_pubsub.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 bootstrap_0.err
-rw-r--r-- 1 docs docs 14302 Apr 18 05:29 bootstrap_0.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 control_pubsub.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 control_pubsub.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 raptor_scheduling_queue.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 raptor_scheduling_queue.out
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 state_pubsub.err
-rw-r--r-- 1 docs docs     0 Apr 18 05:29 state_pubsub.out

Here you will always find bootstrap_0.out to be populated with the output of RP’s shell bootstrapper. If no other errors in the log or stdio files show up, you may want to look at that bootstrap_0.out output to see if and why the pilot bootstrapping failed.

Ask for Help from the RADICAL Team

If neither of the above steps provided any insight into the causes of the observed failures, please execute the following steps:

create a tarball of the client sandbox
create a tarball of the session sandbox
open an issue at https://github.com/radical-cybertools/radical.pilot/issues/new and attach both tarballs
describe the observed problem and include the following additional information:
- output of the radical-stack command
- information of any change to the resource configuration of the target resource

We will likely be able to infer the problem causes from the provided sandbox tarballs and will be happy to help you in correcting those, or we will ask for forther information about the environment your application is running in.