[Dwarf-discuss] DWARF problem with Debugging Information
Michael Eager
eager@eagercon.com
Tue Apr 2 14:00:24 GMT 2024
Hi Akin --
I've CC'ed the DWARF discussion group. There may be others who have
thoughts about your question. Most questions about the use of DWARF
should be directed to the mailing list. You may need to join the
mailing list to submit questions or reply.
A decompiler is an ambitious project. The compilation process is one
where considerable information about the source is discarded as code is
generated. DWARF will only be able to replace some of that info.
Regarding your question about literal values:
The line table associates object code addresses with source line
numbers. The line table does not describe data, whether variables or
literal.
Variables are described by DW_TAG_variable entries. They have a
DW_AT_location attribute which (when decoded) describes where the
variable is stored.
For the most part, literal values are not described in DWARF. Literal
values may be stored in memory or they may be dynamically generated in
the object code. In either case, DWARF does not contain information
describing literals.
In your example, line 6 is the printf of globalVar which starts at
address 0x1160. That is clear from the line table. (Executing objdump
with -S will make this more obvious.) The line table only describes
where the object code for a source line can be found. It does not
describe any references to variables, literals, functions, or anything
else. So, no, there is no entry for address 0x1168 where address 2009
(which I presume is the format string) is referenced.
I hope this helps.
On 3/31/24 08:15, Burhan Akin Y wrote:
> Dear Mr. Eager,
>
> I am a student from the Heidelberg University in Germany and I am
> interested in using DWARF Debugging Information for our decompiler
> development project.
>
> The idea is to train a Seq2Seq model on translating between Disassembly
> and C Code (our focus is non-optimized programs).
> https://github.com/nokitoino/DecompilerAI
> <https://github.com/nokitoino/DecompilerAI>
>
> We are trying to solve a little problem. Our training is done
> function-wise so far.
>
> The problem is that some literal values are not stored locally at the
> current dumped disassembly of the function, but rather at some memory
> offset.
>
> The idea is to prepare the training data like this:
>
> E.g.:
> int globalVar = 10; -> int globalVar = /*int_4010*/;
> char* test = "Test"; -> char* test = /*str_2004*/;
> ...
> Where 2004 is the address of where the string is stored at in the
> .rodata section, and 4010 the integer value in the .data section.
>
> That means, the model should learn to put the memory addresses when it
> is predicting C code from disassembly (since trained function-wise, it
> has no access to any other information like the .rodata, but only to the
> addresses.). And later we can manually post-process it to extract the
> string from the memory.
>
> We have thought about using DWARF decodedline, which tells us which line
> in our C source code associates with which address on the disassembly.
>
> If you take a look at the image I have attached on this e-mail, you will
> notice that not all literal occurences are associated with a memory
> address. The printf("%d",globalVar) should have an association to line
> 1168, where the address 2009 is.
>
> Could I have some explanation for this behaviour and advices on how we
> could solve this problem?
>
> We would acknowledge any help.
>
> Best regards
>
> Akin Yilmaz
>
>
--
Michael Eager
-------------- next part --------------
A non-text attachment was scrubbed...
Name: DWARF Demonstration.png
Type: image/png
Size: 109914 bytes
Desc: not available
URL: <https://lists.dwarfstd.org/pipermail/dwarf-discuss/attachments/20240402/8ddb4ee0/attachment-0001.png>
More information about the Dwarf-discuss
mailing list