[Dwarf-Discuss] Multiple address space architectures and DW_AT_frame_base

Mon May 23 15:28:20 GMT 2011

Richard,

Relph, Richard wrote:
> John,
> Thanks for the reply. Your implication engine is working well. ;-)
> OpenCL source, AMD GPU target. In between, though, we have an
> intermediately language called AMD IL (published specs available
> on-line).

Probably similar to NVidia's PTX (Parallel Thread Execution pseudo-assembly language), I assume.

> Rather than answer your many excellent questions, I'll provide a bit
> more background on my specific problem so you can see what I'm trying
> to deal with.
> 
> AMD IL describes the private memory space in terms of "indexed
> temporaries". This is just a syntactic device to defer actual
> locating of thread-local storage to the AMD IL compiler (yes, there's
> a compiler between AMD IL and the GPU-specific hardware instruction
> set. It knows nothing of DWARF or debugging.)

So that sounds like a fundamental difference between what you're doing an what happens with CUDA. In CUDA, the GPU ELF image is not at the PTX level, it's at the actual physical device level. That is, the DWARF The debugger sees reflects the properties of the actual hardware device, not PTX. Though, the compiler allows some PTX-isms to peek through the DWARF in the area of PTX virtual registers, which depending on PC, may bind to a hardware register, local memory space, or be "dead".

> Sometimes the OpenCL
> compiler will put variables in indexed temps (e.g., private arrays),
> sometimes in regular AMD IL registers (e.g., private scalars and
> vectors). While the hardware doesn't support the concept of a
> "stack", obviously the LLVM-based compiler assumes one, so we do our
> best using indexed temps and our rich register set.

CUDA didn't support stacks until the 3.1 release, which forced a substantive change in the way the debug information had to be emitted and handled.

> These indexed temps are referenced in AMD IL source as "x#[N]", where
> # and N can have very large ranges (at least 64K). # specifies which
> private memory space. N is the offset in that space. The AMD IL
> compiler (SC, short for shader compiler) will look at the need for
> registers throughout the "shader" (aka kernel) and decide if it can
> allocate general purpose registers to implement the indexed temps, or
> it has to resort to putting them in memory.

Yes, sounds familiar :-)

> The current OpenCL compiler for AMD IL puts indexed temps all in a
> single private memory space... #1. (Most variables end up in general
> purpose registers, not indexed temps.) The debugger, of course is
> oblivious to all this, but it uses an API provided by the debug agent
> to access objects. To access the objects, the debug agent wants to be
> as general as possible and assume as little as possible, since it
> wants to support more than just debugging OpenCL. The API it provides
> allows the debugger to specify both # and N.

OK, I see, I think. In the general case, there can be thousands of private memory spaces, even though your OpenCL compiler uses a much smaller number.

> Right now, since the compiler does put all thread local variables in
> a single private memory space that pretty well mimics a conventional
> architecture's stack, I'm trying to leverage the DW_AT_frame_base
> attribute of DW_TAG_subprogram. The "correct" thing for the current
> compiler to do is to indicate, somehow, that "the stack" is in AMD IL
> private memory space #1.

I don't know much about your hardware or software, but it sounds to me like a lot of the complication here stems from trying to make the DWARF target AMD IL instead of the actual hardware.

> That's my "today" problem. But to allow SC to make better choices
> about which indexed temps to put in registers and which to put in
> memory, the OpenCL compiler would have to split variables out from
> the current monolithic "stack" space in to individual "variable"
> spaces. Even non-pointer variables would have to indicate which
> private memory space they reside in.

Understood.

> In trying to "do the right thing" where specifying DWARF for AMD IL
> is concerned, I'm trying to allow description of as much of the AMD
> IL language's capabilities as possible so that the compiler making
> the decision can pass this information through the debugger to the
> debug API, where it is really needed. And avoid having to have the
> debugger or the debug agent assume such things.

So it sounds like your debug API also operates at the higher-level AMD IL instead of the hardware level.

> So I think I want a "location" to permit an "op" that specifies the
> memory space for a pending dereference - including the final implicit
> one - without mucking with offsets. Something like DW_AT_memory_class
> but in a location expression. Imagine needing to reference a pointer
> in one space, apply an index from another space, add an offset from
> another, to compute the address of an object in yet another space.

Sure, it happens all the time in CUDA. The solution I chose for TotalView, as I said in my previous email, is to represent memory spaces as type qualifiers. Behind the scenes, TotalView injects location operations that set the address space from which to read. The CUDA compiler chose to tack DW_AT_address_class attributes onto certain DIEs, but that's not what the debugger wanted. In short, TotalView translates the DW_AT_address_class attributes into type qualifiers, and then uses the type qualifies to modify the location operations during address resolution to set "segment values". The "segment values" are then used by the lowest levels of the debugger when calling into the CUDA debug API.

> But I'll settle for a solution to just my "today" problem... ;-)

IMHO, you won't be able to find the "right" solution on this mailing list. You'll have to hash this out with your debugger developer, since what you should be generating have to be something the debugger can digest. I would think that whoever is working on the debugger would have a very strong opinion about this... or are you also the debugger developer?

Cheers, John D.

> Richard
> 
> 
>> -----Original Message-----
>> From: John DelSignore [mailto:John.DelSignore at roguewave.com]
>> Sent: Friday, May 20, 2011 7:04 AM
>> To: Relph, Richard
>> Cc: DWARF
>> Subject: Re: [Dwarf-Discuss] Multiple address space architectures and
>> DW_AT_frame_base
>>
>> Hi Richard,
>>
>> I have experience with "imaginary" architectures like this :-). A
>> couple of comments/questions...
>>
>> Relph, Richard wrote:
>>> Imagine an architecture with multiple disjoint memory spaces.
>> It would be helpful if you could be more specific.
>>
>> I have to assume that this is GPU architecture (your email address gave
>> it away) where at the very least there is a host memory space and a
>> device memory space. The device memory space is further subdivided into
>> a discrete number of memory spaces, such as code, global, local,
>> constant, parameter, shared, texture, etc. Further, it's not a
>> segmented architecture, so there is no segmented addressing (as on the
>> x86), and there is exactly one of each kind of memory space.
>>
>> What is the language? Is it OpenCL, which has address space qualifiers?
>> From the OpenCL quick reference guide:
>>
>> "
>> Address Space Qualifiers [6.5]
>> __global, global __local, local
>> __constant, constant __private, private
>> "
>>
>>> Imagine
>>> that the frame base could be in any of them.
>> Are you saying the the stack itself could be in more than one memory
>> space?  Or is the stack always in, for example, the "local" memory
>> space?
>>
>> Does the architecture even have stacks or is everything inlined?
>>
>>> Other than expanding the
>>> size of an address, how can I do this?
>> Depends on what is actually going on.
>>
>>> I've thought about DW_AT_address_class, but I can't figure out how to
>>> associate it with a DW_TAG_subprogram's DW_AT_frame_base location.
>> The
>>> spec says DW_AT_address_class for DW_TAG_subprogram describes how to
>>> access the subprogram itself.
>> Maybe it's true, but I find it hard to believe that the stacks
>> themselves can be in more than one memory space.
>>
>> I can believe that the stack is in a "local" memory space, function
>> parameters are in a "parameter" memory space, function local variables
>> can be in either local memory or probably more commonly registers, you
>> can declare variables in "shared" memory space, you can have pointers
>> to "global" memory, etc.
>>
>> Is that true or do I have it all wrong?
>>
>>> DW_AT_address_class isn't technically allowed in a
>> DW_TAG_compile_unit
>>> (it will be the same address space across the entire compilation
>> unit).
>>> A location description doesn't have a mechanism to specify an address
>>> class. It would be be implied by the type information associated with
>>> the object being described (though if the object shifts address
>> spaces,
>>> that seems impossible to describe as well.)
>> If my above assumptions are true, I think the best way to represent
>> this is with address space type qualifiers. For many objects, the
>> address space is implied by the kind of object, but can also be
>> explicitly expressed through type qualification. For example, assume
>> you have the following device function:
>>
>> void f(int *p)
>> {
>>   int *q = p;
>> ...
>>
>> The debugger can know that "p" is a parameter, so the address space of
>> the pointer itself is implicitly "parameter", and the thing the pointer
>> points to might be in the "global" address space. As with "const" and
>> "volatile" and other kinds of type qualifiers, the compiler could
>> introduce address space qualifiers such as "parameter", "global",
>> "local", etc. So, "p" is os type "parameter pointer to global int", and
>> "q" is of type "local pointer to global int".
>>
>> This is exactly how TotalView represents address spaces in CUDA on
>> NVidia hardware.
>>
>> d1.-1> l
>>   87   // Matrix multiplication kernel called by MatrixMul()
>>   88   __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
>>   89   {
>>   90     // Block row and column
>>   91     int blockRow = blockIdx.y;
>>   92@>   int blockCol = blockIdx.x;
>>   93     // Each thread block computes one sub-matrix Csub of C
>>   94     Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
>>   95     // Each thread computes one element of Csub
>>   96     // by accumulating results into Cvalue
>> d1.-1> dwhat A
>> Name: A; Type: @parameter Matrix; Size: 24 bytes; Addr: 0x00000020
>> d1.-1> dwhat Csub
>> Name: Csub; Type: @local Matrix; Size: 24 bytes; Addr: 0x00fffb20
>> d1.-1> dwhat Matrix
>> Type name: struct Matrix; Size: 24 bytes; Category: Structure
>>     Fields in type:
>>     {
>>     width        int          (32 bits)
>>     height       int          (32 bits)
>>     stride       int          (32 bits)
>>     elements     float @generic * (64 bits)
>>     }
>> d1.-1> dwhat blockRow
>> Name: blockRow; Type: @register int; Size: 4 bytes; Addr: R0
>> d1.-1>
>>
>> As you might have observed above, the reason it is important to express
>> the address space in the type is for composability. In what address
>> space does the "elements" member of Matrix reside? The answer is, in
>> the address space an object of that type. So for A.elements is in
>> "@parameter" space and Csub.elements is in "@local" space. In both
>> cases, Matrix::elements points to "@generic" (aka global) space. Also,
>> by using type qualification, casts of arbitrary addresses can be
>> constructed:
>>
>> d1.-1> p &A
>>  &A = 0x00000020 -> (Matrix @parameter)
>> d1.-1> p A
>>  A = {
>>    width = 0x00000002 (2)
>>    height = 0x00000002 (2)
>>    stride = 0x00000002 (2)
>>    elements = 0xfb00000000 -> 0
>>  }
>> d1.-1> p {*(Matrix @parameter *)0x20}
>>  *(Matrix @parameter *)0x20 = {
>>    width = 0x00000002 (2)
>>    height = 0x00000002 (2)
>>    stride = 0x00000002 (2)
>>    elements = 0xfb00000000 -> 0
>>  }
>> d1.-1>
>>
>> The debugger can follow the type qualification chain to determine in
>> which address space the object/address resides.
>>
>>> Unfortunately, there is no type information for DW_AT_frame_base.
>>>
>>> Any ideas?
>> Get a debugger expert involved.
>>
>> Hope this helped!
>>
>> Cheers, John D.
>>
>>
>>> *Richard Relph
>>> *MTS | Stream Compute SW | AMD*
>>> *o. 408.749.6659
>>>
>>> 2.jpg
>>>
>>>
>>>
> 
> 
>