{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Debugging\n", "\n", "RADICAL-Pilot is a complex runtime system which employs multiple distributed components to orchestrate workload execution. It is also a **_research software_**, funded by research grants. As such, it is possibly not comparable to commercially supported software systems.\n", "\n", "Also, RADICAL-Pilot targets mostly academic HPC environments and high-end machines which are usually at the cutting edge of hard and software development. Those machines thus usually have their own custom and sometimes peculiar and evolving system environment.\n", "\n", "All that is to say that it might be necessary to investigate various possible failure modes, both failures related to the execution of your workload tasks, and also possibly failures related to RADICAL-Pilot's own operation.\n", "\n", "This notebook attempts to guide you through different means to investigate possible failure modes. That is not necessarily an intuitive process, but hopefully serves to cover the most common problems. We want to encourage you to seek support from the RADICAL development community via TODO if the presented means proof insufficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting the stage\n", "\n", "We run a simple RP application which triggers specific failures on purpose - the resulting session will be used as demonstrator for the remainder of this notebook. We submit one task which succeeds (`task.000000`) and one with an invalid executable which fails (`task.000001`). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env RADICAL_LOG_LVL=DEBUG\n", "\n", "%env RADICAL_REPORT=TRUE\n", "%env RADICAL_REPORT_ANIME=FALSE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import radical.pilot as rp\n", "import radical.utils as ru\n", "\n", "client_sandbox = None\n", "pilot_sandbox = None\n", "\n", "with rp.Session() as session:\n", "\n", " pmgr = rp.PilotManager(session=session)\n", " tmgr = rp.TaskManager(session=session)\n", " \n", " pilot = pmgr.submit_pilots(\n", " rp.PilotDescription({'resource' : 'local.localhost', \n", " 'cores' : 4,\n", " 'runtime' : 10,\n", " 'exit_on_error': False}))\n", " tmgr.add_pilots(pilot)\n", "\n", " td_1 = rp.TaskDescription({'executable': 'date'})\n", " td_2 = rp.TaskDescription({'executable': 'data'}) # <- unknown executable\n", " tasks = tmgr.submit_tasks([td_1, td_2])\n", " tmgr.wait_tasks()\n", "\n", " client_sandbox = ru.Url(pilot.client_sandbox).path + '/' + session.uid\n", " pilot_sandbox = ru.Url(pilot.pilot_sandbox).path\n", "\n", "for task in tasks:\n", " print('%s: %s' % (task.uid, task.state))\n", "print('client sandbox: %s' % client_sandbox)\n", "print('pilot sandbox: %s' % pilot_sandbox)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Investigating Task Failures\n", "\n", "You created a task description, submitted your task, and they end up in `FAILED` state. On the API level, you can inspec the tasks `stdout` and `stderr` values as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import radical.pilot as rp\n", "\n", "for task in tasks:\n", " if task.state == rp.FAILED:\n", " print('%s stderr: %s' % (task.uid, task.stderr))\n", " elif task.state == rp.DONE:\n", " print('%s stdout: %s' % (task.uid, task.stdout))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note though that the available length of both values is shortened to 1024 characters. If that is inefficient you can still inspect the complete values on the file system of the target resource. For that you would navigate to the task sandbox (whose value can be inspected via `task.sandbox`). \n", "\n", "That sandbox usually has a set of files similar to the example shown below. The `.out` and `.err` files will have captured the task's stdout and stderr streams, respectively:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tid = tasks[1].uid\n", "\n", "%cd $pilot_sandbox/$tid\n", "\n", "!ls -l\n", "!cat \"$tid\".err\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very common problem for task failures is an invalid environment setup: scientific applications frequently requires software modules to be loaded, virtual environments to be activated, etc. Those actions are specified in the task description's `pre_exec` statements. You may want to investigate `.exec.sh` in the task sandbox to check if the environment setup is indeed as you expect it to be." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Investigate RADICAL-Pilot Failures\n", "\n", "If the investigation of the task sandbox did not yield any clues as to the origin of the failure, but your task still ends up in `FAILED` state or RP itself fails in any other way, we suggest the following sequence of commands, in that order, to investigate the problem further.\n", "\n", "First, check the client side session sandbox for any ERROR log messages or error messages in general:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "%cd $client_sandbox\n", "! grep 'ERROR' *log\n", "! ls -l *.out *.err\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You would expect no `ERROR` lines to show up in the log files, and all stdout/stderr files of the RP components to be empty.\n", "\n", "The next step is to repeat that process in the pilot sandbox:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "%cd $pilot_sandbox\n", "! grep 'ERROR' *log\n", "! ls -l *.out *.err \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Here you will always find `bootstrap_0.out` to be populated with the output of RP's shell bootstrapper. If no other errors in the log or stdio files show up, you may want to look at that `bootstrap_0.out` output to see if and why the pilot bootstrapping failed.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ask for Help from the RADICAL Team\n", "\n", "If neither of the above steps provided any insight into the causes of the observed failures, please execute the following steps:\n", "\n", " - create a tarball of the client sandbox\n", " - create a tarball of the session sandbox\n", " - open an issue at https://github.com/radical-cybertools/radical.pilot/issues/new and attach both tarballs\n", " - describe the observed problem and include the following additional information:\n", " - output of the `radical-stack` command\n", " - information of any change to the configuration file of the target platform\n", " \n", "We will likely be able to infer the problem causes from the provided sandbox tarballs and will be happy to help you in correcting those, or we will ask for forther information about the environment your application is running in." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }