[Dwarf-Discuss] address values in constant forms

Fri Dec 11 23:10:30 GMT 2009

For some DWARF consumers it is important to know what parts of "constant
data" are meant as address constants as opposed to truly "uninterpreted
bytes".  Some obvious interpretations of the spec on the producer side can
lead to making this rather difficult.

Consider this C function:

	void foo (void)
	{
	  static int i[10];
	  int *const ip = &i[5];
	  const struct { int *pfield; intptr_t ifield; } s = { &i[7], 23 };
	  bar (&i[1], ip, s.pfield);
	}

A clever compiler will have produced no storage associated with "ip" or
"s", but wants their DW_TAG_variable entries to indicate their values.
How should it do so?

The first obvious approach is to give each a DW_AT_const_value.  This
attribute is specified to have only constant or block forms.  Hence, we
get for "ip" a DW_FORM_data* whose value after relocation is that of 5
int-sizes past the static "i" symbol's value.  For "s", a DW_FORM_block*
whose contents after relocation are the two words &i[7] (i.e. the address
7 int-sizes past the "i" symbol's value) and 23.

I said "after relocation" twice there.  In the normal course of events,
the relocation that happens is at final-link time of the executable or
DSO containing these DWARF entries.  By the time the DWARF consumer
comes along, such relocation information has already been lost.

There are a variety of reasons that a consumer might want to distinguish
address constants from other integer constants.  The reason I will focus
on here is to support position-independent code, as is normal for code
compiled into a DSO.  With PIC, while the individual relocation
information has been lost at static link time, DWARF consumers understand
the runtime semantics of applying a runtime address offset to the static
address constants.  To do this, they of course must know when an integer
constant is in fact an address constant.

I see two broad perspectives that one can take to consider issues such as
this one.

1. The DWARF data is to be taken as a whole, wherein the semantics of one
   component of the data are not well-defined without integrating all the
   knowledge represented in DWARF.  In this instance, that means that
   DW_AT_const_value does not give exact "uninterpreted bytes" on its own.
   Instead, it gives a value blob that is only meaningful as specially
   interpreted in the context where it appears.  That is, you cannot know
   the runtime byte-equivalent of "((char[sizeof(int *)] *) &ip)" from the
   DW_AT_const_value attribute alone.  Instead, you must consider the
   value blob from DW_AT_const_value along with that entry's DW_AT_type
   and the complete layout and semantics it refers to.  i.e., you follow
   the type entries to discover that "ip" has a DW_TAG_pointer_type and
   thus discern that the value blob from its DW_FORM_data* should be
   understood as an address constant.  Likewise, for "s" you see it has a
   structure type containing DW_TAG_pointer_type in the first field, and
   so know to interpret that portion of the value blob from DW_FORM_block*
   as an address constant.

   Taking this tack has some practical complications, philosophy aside.
   For example, consider:

	int *const a = 0;
	int *const b = (int *) 0x1234;

   The representations of these would be no different from the case above.
   For "a", it might be simple enough to say that the address constant 0
   is special and consumers know not to apply any runtime offset to it.
   For "b", there is no obvious distinction at all and it's hard to see
   how a DWARF consumer could possibly know to distinguish this from the
   "&symbol+n" cases--the integer address literal should not be adjusted
   at runtime while the symbol-relative address literal should be--so as
   to come to the correct answers about what values both "a" and "b"
   actually have in the runtime semantics of the program.

2. Each "stratum" of DWARF data gives complete information at its level of
   concern, without reference to any higher-level understanding of that
   information.  In this instance, that means that DW_AT_const_value and
   location expressions and so forth give the "raw bytes" stratum of DWARF
   information about the program.  This stratum of DWARF alone should give
   complete information on how to reconstruct a target memory image of
   uninterpreted bytes that corresponds to the semantics of the program.
   That is, DW_AT_const_value or complex location expressions can be
   thought of as describing a target memory image, and it's as if the
   variable had a trivial location expression that yields the address of
   that memory image.  (The only wrinkle in that metaphor as a way to
   imagine reading the runtime semantic value is that this putative target
   memory image can be described to have inaccessible holes.  The metaphor
   breaks down further if considered for values that a debugger can
   change, as in general each bit of that memory image can also be
   described as residing in multiple copies in target memory/register bits.)

   This is the perspective that I favor.  I'll admit that my strongest
   reason for this view is simply that it fits the existing software
   structure layers of the DWARF consumer code that I work with.  But,
   IMHO this is also the "natural layering" for thinking about DWARF in
   the abstract and an important way to keep the tasks of DWARF consumers
   tractable and comprehensible both to implement and to specify.

Within this latter paradigm, I see two basic ways to think about the
particular subject of address constants.

1. All forms of relocation are outside the scope of DWARF.  This says that
   a consumer applies some external means to resolve all the DWARF data
   into precise memory-image bits appropriate for the given runtime
   context, before interpreting the raw bytes of DWARF encoding as
   specified.  For example, carry ELF relocation sections for .debug_info
   that indicate all address adjustments that need to applied at runtime.
   In this instance, that means relocs for the parts of .debug_info where
   the DW_AT_const_value data* for "i" appears and where the "pfield"
   portion of the DW_AT_const_value block* for "s" appears.

   In this view, there is little practical reason to have a DW_FORM_addr
   or DW_OP_addr distinct from the data* forms.  They serve to indicate an
   "addressness" semantic, but a "raw bytes stratum" doesn't really make a
   distinction between an address and another integer in a semantic sense,
   and there is no material need to treat them differently in coming up
   with the imagined target memory image.

   IMHO this would be an undue burden on DWARF consumers.  In current
   practice, final-linked objects (executables and DSOs) do not carry any
   relocation information for the DWARF data and consumers do not consider
   the possibility of having to use any.

2. The "raw bytes stratum" distinguishes two kinds of "raw bytes": truly
   uninterpreted bytes, and bytes that are part of an address constant.
   An address constant is known to require adjustment by the consumer in
   ways that are well-defined on the platform (though themselves entirely
   outside the scope of DWARF).

   In this view, the precise meaning of DW_FORM_addr or DW_OP_addr is that
   it's an address constant requiring appropriate adjustment.  In contrast
   data* forms are uninterpreted integer constants at this stratum,
   even when by the holistic semantics they are pointer/address values.

   Concretely, this means address forms are appropriate for the "ip" and
   "s.pfield" examples above, but integer (data*) forms are appropriate
   for the "a" and "b" examples.

So I would like to settle on the latter view.  I don't really know how to
make this more explicit in the standard as the general paradigm.
(Probably some careful scanning of all the "address" wording in the spec
would indicate a good approach to that.)  For now I'll concentrate on
specific ramifications for the concrete examples I've given above.

(These references are to the version 4 "WORKING DRAFT 3" of May 22, 2009.)
I note that in 2.3.8.2 Concrete Inlined Instances, for a different use of
DW_AT_const_value the wording is, "... whose value may be of any form that
is appropriate for the representation of the subroutine's return value."
That wording is unlike the other cases, which specifically list the forms
allowed.  This looser wording would seem, on its face, to permit address
forms for this one use of DW_AT_const_value.  I'm not entirely clear on
whether the tables like Figure 21 are meant to be normative in giving
exclusive lists of acceptable forms/classes--if so, I guess that serves to
disambiguate the text that seems to conflict on its face, and so rule out
address forms even for a concrete inlined subroutine's return value.

IMHO it would be natural and sensical for DW_AT_const_value to admit the
address form/class across the board.  This easily covers the "ip"
example above.  For the cases where it applies, this is the most compact
representation and the most intuitively straightforward one for that
case.  The changes to the DW_AT_const_value wording for this are simple
and obvious, as is adding "address" to the Figure 21 table.  It's not
entirely clear to me how this sort of compatibility is meant to be
handled, but I suppose that a producer would be obliged to restrict
using DW_FORM_addr in DW_AT_const_value to only when also setting the CU
version field to 4 (if these changes were made to the next version 4 draft).

With version 4 as it stands in draft 3, we can represent the "ip" case
another way.  Instead of DW_AT_const_value, give it DW_AT_location of:

	DW_OP_addr (&i[5] address) DW_OP_stack_value

(This consumes three bytes more in .debug_info than DW_AT_const_value.)
Likewise, this style can cover the more complex "s" case too (here
assuming 8 is the address size):

	DW_OP_addr (symbol i + 8*7) DW_OP_stack_value DW_OP_piece 8
	DW_OP_lit23 DW_OP_stack_value DW_OP_piece 8

I don't see any better alternative for complex cases such as this one.
For the simple cases, extending DW_AT_const_value makes sense to me.
But with the spec as it stands today, we can just recommend producers to
use DW_OP_addr DW_OP_stack_value locations instead.  Whether or not we
extend DW_AT_const_value, I think we should endeavor to clarify the spec
as to the requirement to distinguish adjustable address constants from
other integer literals in this way.

Note that this clarification entails drawing the distinction exactly
this way: the "a" and "b" examples must use DW_FORM_data* and must not
use DW_FORM_addr, to indicate that runtime address adjustment does not
apply.  This really means that handling address adjustments is the only
purpose at all for address forms, and thus the "raw bytes stratum" does
not give consumers a reliable semantic distinction between address and
non-address for any other purpose.  (I didn't mention any other such
purposes, but did say there are others.  For a high-level purpose like
deciding how to display a value to the user, it might be deemed
worthwhile to use symbolic adjustability as the deciding factor, but if
real "addressness" is the criterion of interest then it would have to be
determined from higher-level information such as types being pointers.)

Comments?

Thanks,
Roland