[Dwarf-Discuss] How best to represent multiple programs in a single file...

Tue Jan 4 18:29:38 GMT 2011

Hi Richard,

I don't know much about OpenCL on GPUs, but I do know a lot about debugging CUDA on GPUs and debugging the Cell Broadband Engine (PPC64/SPU).

IMHO, when debugging a GPU (or the Cell), a very important issue that must be dealt with is separating the address space of the host program (x86*/PPC) from the accelerator (GPU/SPU). In other words, the address space of the host code must be separated from the address space of the accelerator, and the debugger's modeling of the process, address space, and threads must change. So roughly speaking:

* The process becomes a "bag" of the following stuff:
** A host address space.
** A collection of host threads that share the host address space.
** A collection of threads with discrete address spaces representing the GPU/SPU address space and its associated execution context.

The TotalView CUDA debugging model is explained in more detail in the following document: http://www.totalviewtech.com/support/documentation/pdf/CUDATotalViewDebuggerUsersGuideSupplement_V4.pdf

Note that the "processor architecture" is modeled at the thread level, so a process can contain threads with different architectures. For example, in a Linux-x86_64 CUDA program, the host threads are linux-x86_64 and the GPU threads are SAS. And in a Linux-Cell program, the host threads are linux-ppc64 and the SPU threads are SPU.

Anyway... The interesting part for this discussion is on page 5. Associated with each address space (host and GPU) is one or more ELF images. The Linux address space has associated with it the Linux-x86_64 executable and shared libraries. The CUDA address spaces that have their own list of ELF images. There is no attempt to smash together host and GPU information into a single ELF image; I think attempting to do so is a fundamental mistake, and it will be nearly impossible to untangle the mess. You may already be doing this, but it seems to me that at a minimum you should be separating the GPU ELF image from the host ELF image, even if the GPU ELF image is embedded (as data) in the host ELF image.

In both the CUDA and Cell worlds, when an ELF image is loaded onto the accelerator, the debugger receives an event, much like a "dlopen()" event. For Cell, the SPU ELF image may be part of the host executable file or stored separately. For CUDA, the GPU ELF image is calculated by the device driver. In both cases, the SPU/GPU ELF image is a self-contained ELF file containing debug information purely for the accelerator; it does not contain any host debug information.

Some of what follows is easily observable in any CUDA program, but given that you're "@amd.com" I assume that at some level you're probably competing with NVIDIA. So you should stop reading now if you don't want to be contaminated with knowledge of how multiple programs are represented in CUDA.

More comments in-line below...

Relph, Richard wrote:
> Michael,
>     Thanks for the reply. Below, answers to your questions...
> 
>> A couple questions:
>>
>> How is this ELF executable file organized?
>>
>> How are these "multiple programs" represented in the executable file?
> Each kernel's "text" is concatenated and placed in the same section
> and entries are made in the ELF symbol table to note the offset in to
> that section for each kernel. But only ONE kernel is ever loaded at a
> time and the loaded kernel's text is always located starting at 0. So
> if we have N kernels, we have N address 0, N address 1, and so on.

OK, so this is a little different than the CUDA case. In CUDA, the GPU ELF image can contain multiple kernels. Each kernel is stored in its own .text_<kernel_name> section and dynamically relocated when the image is loaded onto the GPU. All of the GPU .text_<kernel_name> sections are linked at 0 and loaded at (a seemingly random) non-0 address. The debugger has a way to figure out which kernel is going to be executed.

When we get a "GPU ELF image loaded" event, the CUDA debug API gives us a copy of both the non-relocated GPU ELF image (from the linker) and the relocated GPU ELF image (from the device driver). Since TotalView is a parallel debugger and wants to share the GPU ELF image across process boundaries, we fix-up the non-relocated GPU ELF image such that the GPU .text_<kernel_name> sections have a unique (non-0) link address. TotalView then applies the .rel* section relocations to correct the link addresses in the DWARF section. This gives us a fixed-up non-relocated GPU ELF image that can be shared across processes. TotalView then uses the relocated GPU ELF image provided by the debug API to figure out the image load relocations performed by the device driver for that address space instance, which allows us to map a link address to/from a load address.

>> How are these "separately loaded kernels" selected and loaded?
> At execution time, the host program will access the entire ELF file
> and then load each kernel as needed... sort of like a DLL being
> identified and then entries in the DLL being invoked as needed...
> except that in our case, "relocation" or loading occurs as each
> kernel is invoked, not when the ELF is identified, which (I think) is
> the crux of the problem.

This shouldn't be a problem as long as there is no overlap in the kernel link addresses or you have a way to fix up the original non-relocated GPU ELF image. You said that the kernels are appended together, but is that a "dumb" append or are they linked together such that each kernel gets its own link address? If each kernel gets its own link address, the debugger can tell which kernel is loaded, the kernel boundaries are clear (e.g., in their own section), and the relocation information is preserved, then the debugger should be able to sort it out.

>> DWARF doesn't depend on any particular program structure.  There is
>> no requirement to have a main() function or a single entry point.
> Understood. But is there an implied requirement that a given
> instruction location map to no more than one 'source line'?

I'm not sure what DWARF says about this, but it sure doesn't make any sense to me to have a given instruction location map to more than one source line, block, or subroutine.

> That is,
> after the (fixed length) bootstrap code, we have at location x
> different functions depending on which kernel has been invoked. I
> don't see how to represent this one-to-many mapping in a single
> instance of, say, the line number program.

I don't think you can. Like I said above, for CUDA programs TotalView fixes up the GPU ELF images to separate out the kernels, and applies the .rel* section relocations to fix-up the DWARF.

>> It would seem that the DWARF data for your program, with multiple
>> "kernels", should describe them  correctly.  Of course, when the ELF
>> file is modified to strip out code, you need to make the corresponding
>> changes to the DWARF data as well.
> Understood. Given an original source program, it's "full"
> representation as code, and the corresponding DWARF, I can produce
> for each individual kernel it's 'stripped' code and DWARF for that.
> And I know how to concatenate the stripped code for each kernel in to
> a the form that our system expects so that each kernel can be loaded
> and executed separately. But I don't know how to store the DWARF for
> each kernel in a single ELF file so that the standard tools can still
> be useful.

I suppose your choices are:
1) have the device driver provide the relocated executable to the tool, or
2) preserve the .rel* relocation sections and require that the tool to apply the relocations.

AFAIK, CUDA-GDB (provided by NVIDIA) uses technique #1. TotalView uses technique #2.

Hope this help...

Cheers, John D.

> Thanks,
> Richard
> _______________________________________________
> Dwarf-Discuss mailing list
> Dwarf-Discuss at lists.dwarfstd.org
> http://lists.dwarfstd.org/listinfo.cgi/dwarf-discuss-dwarfstd.org
>