[Dwarf-Discuss] Multiple address space architectures and DW_AT_frame_base

Mon May 23 16:52:03 GMT 2011

John,
    Most interesting bits towards the bottom. The top mostly confirms or clarifies.

John DelSignore wrote:
> Richard,
> 
> Relph, Richard wrote:
> > John,
> > Thanks for the reply. Your implication engine is working well. ;-)
> > OpenCL source, AMD GPU target. In between, though, we have an
> > intermediately language called AMD IL (published specs available
> > on-line).
> 
> Probably similar to NVidia's PTX (Parallel Thread Execution pseudo-
> assembly language), I assume.
So I've been told... I don't know much about PTX.

> > Rather than answer your many excellent questions, I'll provide a bit
> > more background on my specific problem so you can see what I'm trying
> > to deal with.
> >
> > AMD IL describes the private memory space in terms of "indexed
> > temporaries". This is just a syntactic device to defer actual
> > locating of thread-local storage to the AMD IL compiler (yes, there's
> > a compiler between AMD IL and the GPU-specific hardware instruction
> > set. It knows nothing of DWARF or debugging.)
> 
> So that sounds like a fundamental difference between what you're doing
> an what happens with CUDA. In CUDA, the GPU ELF image is not at the PTX
> level, it's at the actual physical device level. That is, the DWARF The
> debugger sees reflects the properties of the actual hardware device,
> not PTX. Though, the compiler allows some PTX-isms to peek through the
> DWARF in the area of PTX virtual registers, which depending on PC, may
> bind to a hardware register, local memory space, or be "dead".
Interesting. Our IL-to-ISA translator is DWARF unaware and doesn't provide any information to the debug APIs. This obviously forced us to operate the debug APIs at the IL level. Not ideal, but the best we can do at the moment.

> > Sometimes the OpenCL
> > compiler will put variables in indexed temps (e.g., private arrays),
> > sometimes in regular AMD IL registers (e.g., private scalars and
> > vectors). While the hardware doesn't support the concept of a
> > "stack", obviously the LLVM-based compiler assumes one, so we do our
> > best using indexed temps and our rich register set.
> 
> CUDA didn't support stacks until the 3.1 release, which forced a
> substantive change in the way the debug information had to be emitted
> and handled.
> 
> > These indexed temps are referenced in AMD IL source as "x#[N]", where
> > # and N can have very large ranges (at least 64K). # specifies which
> > private memory space. N is the offset in that space. The AMD IL
> > compiler (SC, short for shader compiler) will look at the need for
> > registers throughout the "shader" (aka kernel) and decide if it can
> > allocate general purpose registers to implement the indexed temps, or
> > it has to resort to putting them in memory.
> 
> Yes, sounds familiar :-)
> 
> > The current OpenCL compiler for AMD IL puts indexed temps all in a
> > single private memory space... #1. (Most variables end up in general
> > purpose registers, not indexed temps.) The debugger, of course is
> > oblivious to all this, but it uses an API provided by the debug agent
> > to access objects. To access the objects, the debug agent wants to be
> > as general as possible and assume as little as possible, since it
> > wants to support more than just debugging OpenCL. The API it provides
> > allows the debugger to specify both # and N.
> 
> OK, I see, I think. In the general case, there can be thousands of
> private memory spaces, even though your OpenCL compiler uses a much
> smaller number.
Yup. Compiler uses precisely 1 at the moment. We could hard-wire this "knowledge" in to the debugger, or the debug APIs, but that's really not "correct" and it bugs me.

> > Right now, since the compiler does put all thread local variables in
> > a single private memory space that pretty well mimics a conventional
> > architecture's stack, I'm trying to leverage the DW_AT_frame_base
> > attribute of DW_TAG_subprogram. The "correct" thing for the current
> > compiler to do is to indicate, somehow, that "the stack" is in AMD IL
> > private memory space #1.
> 
> I don't know much about your hardware or software, but it sounds to me
> like a lot of the complication here stems from trying to make the DWARF
> target AMD IL instead of the actual hardware.
Well, it's certainly true that having a "black box" SC presents certain design challenges. ;-)
I suppose if we didn't have to abstract this mechanism for deferring data location to SC, we'd be better off. Not an option, though.

> > That's my "today" problem. But to allow SC to make better choices
> > about which indexed temps to put in registers and which to put in
> > memory, the OpenCL compiler would have to split variables out from
> > the current monolithic "stack" space in to individual "variable"
> > spaces. Even non-pointer variables would have to indicate which
> > private memory space they reside in.
> 
> Understood.
Whew. I was fretting over that paragraph a fair bit...

> > In trying to "do the right thing" where specifying DWARF for AMD IL
> > is concerned, I'm trying to allow description of as much of the AMD
> > IL language's capabilities as possible so that the compiler making
> > the decision can pass this information through the debugger to the
> > debug API, where it is really needed. And avoid having to have the
> > debugger or the debug agent assume such things.
> 
> So it sounds like your debug API also operates at the higher-level AMD
> IL instead of the hardware level.
Absolutely.

> > So I think I want a "location" to permit an "op" that specifies the
> > memory space for a pending dereference - including the final implicit
> > one - without mucking with offsets. Something like DW_AT_memory_class
> > but in a location expression. Imagine needing to reference a pointer
> > in one space, apply an index from another space, add an offset from
> > another, to compute the address of an object in yet another space.
> 
> Sure, it happens all the time in CUDA. The solution I chose for
> TotalView, as I said in my previous email, is to represent memory
> spaces as type qualifiers.
Well, this confuses me a bit. My understanding of DWARF type qualifiers is that they appear in DIEs to describe user variables, not in location descriptions that describe machine objects. So there's no type DIE for the "frame base" or "base registers" used in fbreg or breg location ops, so no place to put a type qualifier.
Or am I missing something basic? Is the CUDA DWARF spec published anywhere public?

> Behind the scenes, TotalView injects
> location operations that set the address space from which to read. The
> CUDA compiler chose to tack DW_AT_address_class attributes onto certain
> DIEs, but that's not what the debugger wanted.
Right. Nice to get confirmation that someone else independently chose to use DW_AT_address_class for this.

> In short, TotalView
> translates the DW_AT_address_class attributes into type qualifiers, and
> then uses the type qualifies to modify the location operations during
> address resolution to set "segment values". The "segment values" are
> then used by the lowest levels of the debugger when calling into the
> CUDA debug API.
I'll have to think about this. My uncertainty lies in the implicit assumption that every machine memory reference at the debug API level will have a corresponding DWARF type to figure out the memory space from.
This would lead to saying: x is a user variable in private memory space (#N) at fbreg + 12.
I was hoping to be able to say: x is a user variable at fbreg + 12, where fbreg is in private memory space (#N).
My original question was how to describe the "where" clause in DWARF.

But perhaps describing machine characteristics is not the "correct" way to do this. Still, it's tempting to define a location op (DW_OP_memory_space) that allows the latter form of expression. The subprogram DIE could then have a frame base attribute that says DW_OP_memory_space N, reg 2, meaning that r2 is the base register and it points in to memory space N. That would simplify the specification of all the user variables' types.

> > But I'll settle for a solution to just my "today" problem... ;-)
> 
> IMHO, you won't be able to find the "right" solution on this mailing
> list. You'll have to hash this out with your debugger developer, since
> what you should be generating have to be something the debugger can
> digest. I would think that whoever is working on the debugger would
> have a very strong opinion about this... or are you also the debugger
> developer?
Close... The entire development effort is in-house. I'm attached to the compiler team, and the debugger and debug API teams are content to have me define the DWARF.

Thank you for your thoughtful and thought-provoking replies. Very helpful. I owe you dinner and a beverage...

Thanks,
Richard