[Dwarf-Discuss] string reduction techniques

Todd Allen todd.allen@concurrent-rt.com
Mon Nov 1 20:52:25 GMT 2021


Dave,

If I understand right: The space saving you're expecting is the near-elimination
of DW_AT_name strings.  If they are only simple names like "T" and "int", they
can be placed into the string table once each, and it should be very small.  But
you're expecting the DW_AT_linkage_name attributes still to have lots of
replication because of the large composed names.  So I gather that was where
your estimate of 1/2 reduction came from.

I was trying to figure out how we came to opposite conclusions, and I think it's
that I have this (implicit) assumption of a sort of "DWARF Moore's Law", that
the size of debug info/strings/etc. would double periodically, just based on the
tendency of software systems to grooooooow.  I'm likening it to Moore's Law,
because I expect it's the same sort of vague, rough estimate that somehow still
applies to the real world.

Assuming it does apply, your halving of the string table amounts to buying
yourself one doubling period, and then you're back to requiring DWARF64 string
tables.  (Meanwhile, DWARF64 gives us 32 doubling periods over DWARF32.  So
hopefully that will last us for a while...)

I can't be sure about this exponential growth.  I don't have the data to back it
up.  But I will say, when we created DWARF64, I was skeptical that it would be
needed during my career.  And yet here we are...

...

The reduction for DW_AT_linkage_name does seem like a tougher nut to crack.  As
you mentioned, there is a tendency to eliminate *some* of the replication
because of the mangler's use of substitution strings (S_, S0_, S1_, etc.)  But
that same feature probably would make it a lot harder to do anything clever
about chopping up the linkage names into substrings.

Honestly, I've never been sure why gcc generates DW_AT_linkage_name.  Our
debugger almost never uses it.  (There is one use to detect "GNU indirect"
functions.)  I wonder if it would be possible to avoid them if you provided
enough info about the template parameters, if the debugger had its own name
mangler.  I had to write one for our debugger a couple years ago, and it
definitely was a persnickety beast.  But doable with enough information.  Mind
you, I'm not sure there is enough information to do it perfectly with the state
of DWARF & gcc right now.

Todd

On Mon, Nov 01, 2021 at 01:06:33PM -0700, David Blaikie wrote:
>    Hey Todd,
> 
>    Just some details regarding the string reduction strategies I'm pursuing
>    to address DWARF32 overflowing .debug_str.dwo/.debug_str_offsets.dwo
>    sections in some large binaries at Google.
> 
>    So the extreme cases I'm dealing with are predominantly C++ Expression
>    templates (in TensorFlow and Eigen) - these produce types with very large
>    DW_AT_names ("f1<int>") and DW_AT_linkage_names (eg: "_Z2f1IiEvv") (but
>    with many more template parameters, none of which are ever user-written
>    but deduced).
> 
>    So the main fix I'm pursuing (roughly called "simplified template names")
>    is to omit template parameter lists from DW_AT_names of templates in most
>    cases, allowing the consumer to reconstruct the name from
>    DW_AT_template_*_parameters itself, recursively. Further discussion and
>    details
>    here: [1]https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg/m/-dhJ0hO1AAAJ
>    - in terms of how this affects scaling factors, it means that adding an
>    additional template instantiation of existing types would add no new data
>    to .debug_str (eg: going from a program with "t1<int>" to "t1<t1<int>>"
>    would add no new entries to .debug_str). Not all names can be readily
>    reconstructed - so I'm opting the feature out on those, but we could have
>    a more deeper discussion about how to handle them if we wanted to make
>    this a full-fledged/robust feature (maybe one the DWARF spec
>    suggests/encourages).
> 
>    GDB seems to handle this sort of debug info OK - I guess someone did real
>    work to support that at some point (so maybe some other debugger already
>    generates DWARF like this).
> 
>    The other half, though, is DW_AT_linkage_names - and in theory similar
>    rebuilding could be done, but that'd require baking a lot fo
>    implementation knowledge into the DWARF Consumer that DWARF is meant to
>    help avoid... so I'm unsure what the right solution is there just now, but
>    there's a few ideas I'm still kicking around. At least linkage names have
>    less redundancy (within a single name they avoid redundancy - "t1<t1<int>,
>    t1<int>>" only ends up with a single description of "t1<int>" instead of
>    two of them like you get with the DW_AT_name) than DW_AT_names, so they do
>    scale a bit better already.
> 
>    Happy to discuss these ideas in specific, or their impact on debug_str
>    growth in more detail any time (here, video chat, discords, etc).
> 
>    - Dave
> 
> References
> 
>    Visible links
>    1. https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg/m/-dhJ0hO1AAAJ



More information about the Dwarf-discuss mailing list