[Dwarf-Discuss] How best to represent multiple programs in a single file...

Tue Jan 4 19:19:15 GMT 2011

John,
    Thanks for the reply. Responses in-line...

> I don't know much about OpenCL on GPUs, but I do know a lot about
> debugging CUDA on GPUs and debugging the Cell Broadband Engine
> (PPC64/SPU).
That's helpful, I'm sure.

> IMHO, when debugging a GPU (or the Cell), a very important issue that
> must be dealt with is separating the address space of the host program
> (x86*/PPC) from the accelerator (GPU/SPU).
True enough. We're using DW_AT_address_class for this. In our implementation of OpenCL, there's host, constant, global, region, local, and private memory spaces, all distinct. We don't concern ourselves with host address space, since the kernel's we are describing and debugging don't have access to host address space. The kernel is strictly executed on the GPU, under the control of the CPU, which has the means for allocating, mapping to host memory, reading, and writing SOME of the address spaces.

> So roughly speaking:
> 
> * The process becomes a "bag" of the following stuff:
> ** A host address space.
> ** A collection of host threads that share the host address space.
> ** A collection of threads with discrete address spaces representing
> the GPU/SPU address space and its associated execution context.
Our ELF/DWARF is solely concerned with the last of these... and these are created dynamically by the host at run-time.

> Anyway... The interesting part for this discussion is on page 5.
> Associated with each address space (host and GPU) is one or more ELF
> images. The Linux address space has associated with it the Linux-x86_64
> executable and shared libraries. The CUDA address spaces that have
> their own list of ELF images. There is no attempt to smash together
> host and GPU information into a single ELF image; I think attempting to
> do so is a fundamental mistake, and it will be nearly impossible to
> untangle the mess. You may already be doing this, but it seems to me
> that at a minimum you should be separating the GPU ELF image from the
> host ELF image, even if the GPU ELF image is embedded (as data) in the
> host ELF image.
We don't try to comingle host and GPU code in a single image. We are solely concerned with the representation of the code and data that may be run on the GPU. Debugging host code is a separate problem, but one well understood by developers generally for their preferred development platform. We don't attempt to replace Visual Studio or gdb for host code debugging.

Perhaps backing up a bit will help here... OpenCL defines a set of APIs for creating, compiling, and executing kernels at run-time. Most OpenCL-based programs exist solely as a host program while on disk and not executing. It's only when the host code is executed that the OpenCL kernels are compiled (usually - there is an 'off-line' compilation capability, but it is not used much.) Yes, we have a full OpenCL compiler in the APIs that implement the OpenCL run-time. Compilation of an OpenCL program consisting of one or more kernels is one API call. Binding of buffers and values to kernel arguments occurs and then a specific kernel is 'enqueued' for execution on the GPU (which usually involves hundreds to millions of threads being dispatched on the GPU.) The ELF/DWARF we're referring to here is transient... it only lives as long as the host program wants it to. True, it CAN be saved, but seldom is.

> >> How are these "multiple programs" represented in the executable
> file?
> > Each kernel's "text" is concatenated and placed in the same section
> > and entries are made in the ELF symbol table to note the offset in to
> > that section for each kernel. But only ONE kernel is ever loaded at a
> > time and the loaded kernel's text is always located starting at 0. So
> > if we have N kernels, we have N address 0, N address 1, and so on.
> 
> OK, so this is a little different than the CUDA case. In CUDA, the GPU
> ELF image can contain multiple kernels. Each kernel is stored in its
> own .text_<kernel_name> section and dynamically relocated when the
> image is loaded onto the GPU. All of the GPU .text_<kernel_name>
> sections are linked at 0 and loaded at (a seemingly random) non-0
> address. The debugger has a way to figure out which kernel is going to
> be executed.
This sounds like it more closely matches the DLL model, where relocation of a set of kernels occurs when the ELF image is opened.

> >> How are these "separately loaded kernels" selected and loaded?
> > At execution time, the host program will access the entire ELF file
> > and then load each kernel as needed... sort of like a DLL being
> > identified and then entries in the DLL being invoked as needed...
> > except that in our case, "relocation" or loading occurs as each
> > kernel is invoked, not when the ELF is identified, which (I think) is
> > the crux of the problem.
> 
> This shouldn't be a problem as long as there is no overlap in the
> kernel link addresses or you have a way to fix up the original non-
> relocated GPU ELF image.
That's the problem. There IS overlap in the kernel link addresses. ALL kernels start at 0. Only 1 kernel is loaded at a time.
(As an aside, there's no "relocation" necessary. Each kernel 'image' is ready for loading at 0.)

> You said that the kernels are appended
> together, but is that a "dumb" append or are they linked together such
> that each kernel gets its own link address?
It's a dumb append. The first kernel is at offset 0, then the second kernel follows immediately after the first, and so on.

> If each kernel gets its own
> link address, the debugger can tell which kernel is loaded, the kernel
> boundaries are clear (e.g., in their own section), and the relocation
> information is preserved, then the debugger should be able to sort it
> out.
The debugger knows what kernel is 'loaded' because the debugger is snooping the OpenCL APIs and 'knows' which kernel the host program requested.

> >> DWARF doesn't depend on any particular program structure.  There is
> >> no requirement to have a main() function or a single entry point.
> > Understood. But is there an implied requirement that a given
> > instruction location map to no more than one 'source line'?
> 
> I'm not sure what DWARF says about this, but it sure doesn't make any
> sense to me to have a given instruction location map to more than one
> source line, block, or subroutine.
Agreed. Hence my sense that I need to split the DWARF information for each kernel somehow, so that the DWARF for one kernel doesn't get confused with the DWARF for other kernels. I was thinking of simply appending the kernel name to each of the standard DWARF section names to create a set of kernel-specific DWARF sections, but that would obvious 'break' tools like libdwarf, dwarfdump, etc. So I came to the list to see if there was something I'm missing...

> > That is,
> > after the (fixed length) bootstrap code, we have at location x
> > different functions depending on which kernel has been invoked. I
> > don't see how to represent this one-to-many mapping in a single
> > instance of, say, the line number program.
> 
> I don't think you can.
Sadly, I don't think I can either... But I was hoping. ;-)

Thanks,
Richard