This is the =readme= file for the "decomp" decompiler by Jim Reuter. ----- This README file describes the decompiler that resides in this directory. WHAT IT IS: The decompiler does just what the name implies--it decompiles object files into C source files (almost). WHO WROTE IT: Jim Reuter reuter@cookie.dec.com ...!decwrl!cookie.dec.com!reuter COOKIE::REUTER WHY I WROTE IT: To decompile Peter Langston's EMPIRE game and other goodies into source form. This was desired because I wanted to port the games to VMS and to fix some bugs in the games. HOW TO USE IT: In order to decompile object files, the files must contain certain information. Specifically, this decompiler only works with 4.X BSD "a.out" files that contain global symbol information AND relocation tables. It can't deal with stripped executable files or executable files that have no relocation table. Any unlinked object file has the necessary information. The decompiler was motivated by a desire to get sources from some object files that had no sources available. These files were compiled with the "cc -O" command. No "-g" symbolic debugger information was available. Therefore the decompiler was not designed to understand this information. If you have files compiled with the "-g" switch, the decompiler won't use the extra information. The decompiler does not generate C source. It generates something close to C source that needs some hand editing to become the real thing. Writing a decompiler that does everything would be extremely difficult, requiring more time than the hand-editing needed to finish the job (at least in my case). The decompiler makes several passes over an object file. The first pass (over the symbol table) finds function entry points. The next pass builds tables of parameters, local variables, register usage, and branch targets for each function. The third pass breaks the machine instructions into basic blocks. Two more passes mangle the directed graph of basic blocks into a form that represents the high-level language tree structure of the program. The final pass prints formatted almost-C output. The decompiler control flow analysis knows how to make 'if then else' statements, 'while ()' loops, 'do while ()' loops, and 'switch()' groups. It knows when to generate 'break' and 'continue' statements. It does not know about 'for ()' loops. Any loops that do not map to a 'while()' or 'do while()' form get placed in a 'while (1) { }' form with appropriate 'if' statements to break out of the loop. When code does not fit into any of the nicely structured forms, trusty old 'goto' statements and labels are used. (Take that, N. Wirth.) Because of the amount of hand-munging needed to get real source code, some form of checking is useful. Here's what I did when decompiling Empire: 1) decompile an object file 2) munge the decompiler output with "sed" scripts 3) hand edit the sed output into compilable source code 4) compile this source into object form 5) decompile this new object file 6) "diff" the decompiler output for my object file against the decompiler output of the original object file. Look for major discrepencies. I obtained nearly error free decompilation using this method. WHAT IT DOES NOT DO: The decompiler accomplishes a lot with control flow analysis, but not everything. In order to do a complete job of decompiling, data flow analysis must be used. The thought of this level of complexity was too much for me. Besides, the C compiler itself doesn't go to that much trouble. So the decompiler doesn't do it. What does this mean? 1) The decompiler doesn't know what a compound statement is. So code that uses lots of scratch registers to compute compound statements will be decompiled into lots of simple statements that use lots of scratch variables. The code is correct C, but not terribly intuitive. Hand-editing is used to make improvements. 2) The decompiler can't build argument lists for functions. It knows when an argument is being put on the stack, and it knows when a call occurs, but because of limitation (1), it can't put the two together. Also, because of the lack of data flow analysis, it does not know when a function returns a value. This means that it CANNOT GENERATE CORRECT FUNCTION CALLS or 'return' statements. Instead it generates a code that is obviously wrong, but visually very easy to see and edit into the correct form (assuming that you have a decent a screen editor.) The decompiled code is of the form: @arg@ = foo; @arg@ = bar; @val@ = mumble( @2 args@ ); The user is responsible for editing this into the form: floop = mumble( bar, foo ); Return statements are decompiled into: return @value@; The user is responsible for determining whether register R0 contained a value that will be used by the caller. NOTE that the @arg@ statements appear in REVERSE ORDER from the parameter order in the function call. Also note that the function call skeleton tells you the correct number of arguments. USE THIS NUMBER! Code can appear in the form: @arg@ = fee; @arg@ = fi + 1; @arg@ = fo; @val@ = fum( @2 args@ ); @arg@ = r00l; @val@ = foo( @2 args@ ); which translates to: foo( fum( fo, fi + 1 ), fee ); BE SURE YOU UNDERSTAND THIS EXAMPLE before attempting to edit the decompiler output. (Hint: editing the file from the bottom up helps.) 3) The decompiler can't distinguish between crappy user code and peephole optimizer munging. Sometimes the peephole optimizer will smash two separate but similar pieces of code together. For example, ... fprintf( stderr, "bye bye" ); return; ... fprintf( stderr, "hi ma!" ); return; may appear as ... @arg@ = D0456l; G0020: @arg@ = __iob->O025b; @val@ = fprintf( @2 args@ ); return @value@; ... @arg@ = D0464l; goto G0020; See how the "fprintf( stderr, " and "return" portion of each sub-block was collapsed by the peephole optimizer? Have fun with these. 4) Integers used as unsigned types are difficult to detect. Sometimes the decompiler will get it right. The only real problem here is with unsigned comparisons. To help detect problems, any unsigned comparisons will appear with the suffix 'u', such as in if ( blah < u bletch ) { ... This is not correct C code, but will force the user to recognize what is happening. The user is responsible for checking the type declarations for correctness. 5) Because VAX conditional branches use global condition codes, several branches can follow one conditional test. The decompiler cannot repeat the source form of the test in each if statement because the source form may have side effects like "++" in it. The decompiler punts and uses the string "@prev@" to indicate that the if statement uses the same test as the conditional test that preceedes it. The user must fix these cases. 6) The format of Unix 4.2BSD object files, especially the values used in the symbol table, are poorly documented. I determined most of the information used by the decompiler empirically. No doubt, I missed a few cases. Whenever the decompiler becomes confused by an addressing mode, global symbol information, relocation information, or whatever, it will print errors on stderr and will also embed the error messages at the proper place in the decompiled output. All embedded error messages contain the string "ERR". 7) The decompiler understands VAX the indexed addressing modes, but cannot easily generate semantically correct source code. The user is responsible for examining anything that looks like an array reference to verify correctness. Note how the decompiler always uses syntatically incorrect strings when it can't do something. The @blah@ form is used for things that are simply beyond the capabilities of the decompiler. The "ERR" string is used when something unexpected occurs. The ">=u" case is used to show unsigned comparisons. 8) Some kinds of instructions are just plain difficult to turn directly into source code. For example, bitfield insertion and extraction is not converted into source code. Instead, the assembler instruction is dropped right into the decompiled code and the user is responsible for intuiting the data structure and editing the code accordingly. 9) The structuring part of the decompiler knows about compound "if" statements and will try to generate these. Normally this works quite well and results in very clean decompiled code. Compound "if" detection fails when the "if" expression is complex enough to require temporary variables--the resulting code no longer contains adjacent tests. Large, complex "if" statements that have this problem can turn into pretty messy decompiled code. 10) The MOD (%) operator generates a small jumble of division, multiplication, and subtraction. Learn to recognize this pattern and convert it into the simpler mod operator. Items (1) and (2) will cause the most trouble. There are several other things you should know about the decompiler output format that will help you reconstruct the source program. a) The decompiler will always use the correct name for a symbol when the name appears in the global symbol table. (Functions will always have the correct names.) When a name isn't available, the decompiler will generate one. It uses a naming convention that avoids editing ambiguity problems. b) Labels are always of the form "Gnnnn", where 'n' is a decimal digit. There are ALWAYS four digits--leading zeros are used when needed. This makes editing much less painful (e.g. no confusion between "G12" and "G120".) c) Function arguments always appear as "Annnt", where 'n' is a decimal digit and 't' is a type suffix. The 'nnn' is actually the decimal offset of the argument on the stack. The type suffix indicates the data type of the symbol and is derived by analyzing how the code uses the data. If a function uses the second argument as both an "int" and a "char", you will see two arguments: A004c and A004l. The user is responsible for recognizing these cases and fixing the code. The type suffixes are: c: char uc: unsigned char s: short us: unsigned short l: long (int on VAX) ul: unsigned long p: pointer (the decompiler can't compute pointer type) f: floating d: double q: quadword (if you get this from the 4.2 C compiler, good luck) ERR: something is terribly wrong--time to panic For the following items, assume 'n' is a decimal digit and 't' is a type suffix, as above. d) All 'static' types global to the object file appear as "Dnnnnt" (D stands for Data). These variables are also declared at the end of the decompiled source file, with initializers. In some cases, these variables represent program constants (these are marked with "/* const */") and the constant values can be edited back into the source code. Note that CONSTANT DETECTION IS NOT RELIABLE. If no code is found that writes the variable, it is assumed to be constant. Without data flow analysis, the decompiler can't find indirect references to the variable through pointers. Treat "/* const */" as a hint, not a fact, and use your intuition to decide. Also, the initialized data declarations may contain holes, for the same reason that constant hints are not reliable (pointers). e) Local (stack) variables have the symbol format "Lnnnnt". The digits 'nnnn' represent the decimal offset from the beginning of the stack frame. f) Register local variables have the symbol format "rnnt". The 'nn' part is the actual register number. Since the Unix C compiler likes to use register 0 as the return value from functions, and registers 0 and 1 as the main scratch registers, you can use this information to intuit return values and pieces of compound statements. g) Pointer offsets (usually structure elements) have the symbol format "Onnnt". In all cases, the "nnn..." parts of all the above symbols are based on actual program structure. For arguments ("Annnt") the number is the offset up the stack from the argument pointer. For local variables ("Lnnnnt") the number is the offset down the stack from same. For registers ("rnnt") the number is the register number. For static data ("Dnnnnt") the number is based on the offset into the object file. For offsets ("Onnnt") the number is the actual offset value. All values are decimal. The decompiler cannot detect implicit type casting. In all cases, the user is responsible for recognizing cases where the same offset is used as different types. HACKER HINTS: A good hacker with lots of spare time might want to improve this program. I suggest the following: 1) Find and fix omissions in the many combinations of symbol and relocation types. These are the things that generate the "ERR" messages. Any other improvements require some form of dataflow analysis. If you are really dedicated and want to do this, I would strongly suggest doing the dataflow analysis after the basic block analysis (done by block()) and before any of the structuring (hier(), hll(), and format()). This is because dataflow analysis could eliminate some of the problems encountered by the structuring. The kinds of things that dataflow analysis could fix are: 2) Get rid of temporary variables. PCC only uses R0 and R1 as temporaries, so temporaries would be easy to detect. Just be sure to generate source code that reflects the correct implicit and explicit operator precedence. If done correctly, this would also fix up the compound "if" problem mentioned above. 3) Assemble correct argument lists. The hardest parts of this would be a) fixing peephole optimizer munging and b) understanding whether or not a function returns a value. These last two changes would dramatically reduce the amount of hand- editing needed. If anybody actually does improve this thing, I'd appreciate getting a copy of the changes. Jim Reuter ----- CategoryDecompilation