Journal Articles

Overload Journal #119 - February 2014 + Programming Topics
Browse in : All > Journals > Overload > o119 (6)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Anatomy of a Java Decompiler

Author: Martin Moene

Date: 03 February 2014 10:21:07 +00:00 or Mon, 03 February 2014 10:21:07 +00:00

Summary: Java byte code can be reverted back into source code. Lee Benfield and Mike Strobel show how.

Body: 

A decompiler, simply put, attempts to reverse the transformation of source code to object code. But there are many interesting complexities – Java source code is structured; bytecode certainly isn’t. Moreover, the transformation isn’t one-to-one: two different Java programs may yield identical bytecode. We need to apply heuristics in order to get a reasonable approximation of the original source.

(A tiny) bytecode refresher

In order to understand how a decompiler works, it’s necessary to understand the basics of byte code. If you’re already familiar with byte code, feel free to skip ahead to the next section.

The JVM is a stack-based machine (as opposed to a register-based machine), meaning instructions operate on an evaluation stack. Operands may be popped off the stack, various operations performed, and the results pushed back onto the stack for further evaluation. Consider the following method:

  public static int plus(int a, int b) {
    int c = a + b;
    return c;
  }

Note: All byte code shown in this article is output from javap, e.g., javap -c -p MyClass. (Comments added for clarity.)

public static int plus(int, int);
 Code:
  stack=2, locals=3, arguments=2
   0: iload_0
      // load 'a' from slot 0, push onto stack
   1: iload_1
      // load 'b' from slot 1, push onto stack
   2: iadd
      // pop 2 integers, add them together,
      // and push the result
   3: istore_2
      // pop the result, store as 'c' in slot 2
   4: iload_2
      // load 'c' from slot 2, push onto stack
   5: ireturn
      // return the integer at the top of the stack
			
Listing 1

A method’s local variables (including arguments to the method) are stored in what the JVM refers to as the local variable array. We’ll refer to a value (or reference) stored in location #x in the local variable array as ‘slot #x’ for brevity (see JVM Specification §2.6.2 [JVMa]).

For instance methods, the value in slot #0 is always the this pointer. Then come the method arguments, from left to right, followed by any local variables declared within the method. In the example above, the method is static, so there is no this pointer; slot #0 holds parameter x, and slot #1 holds parameter y. The local variable sum resides in slot #2.

It’s interesting to note that each method has a max stack size and max local variable storage, both of which are determined at compile time.

One thing that’s immediately obvious from here, which you might not initially expect, is that the compiler made no attempt to optimise the code. In fact, javac almost never emits optimised bytecode. This has multiple benefits, including the ability to set breakpoints at all executable source lines: if we were to eliminate the redundant load/store operations, we’d lose that capability. Thus, most of the heavy lifting is performed at runtime by a just-in-time (JIT) compiler.

Decompiling

So, how can you take unstructured, stack-based byte code and translate it back into structured Java code? One of the first steps is usually to get rid of the operand stack, which we do by mapping stack values to variables and inserting the appropriate load and store operations.

A ‘stack variable’ is only assigned once, and consumed once. You might note that this will lead to a lot of redundant variables – more on this later! The decompiler may also reduce the byte code into an even simpler instruction set, but we won’t consider that here.

We'll use the notation s0 (etc.) to represent stack variables, and v0 to represent true local variables referenced in the original byte code (and stored in slots).

Bytecode Stack Variables Copy Propagation
0 iload_0 s0 = v0
1 iload_1 s1 = v1
2 iadd s2 = s0 + s1
3 istore_2 v2 = s2 v2 = v0 + v1
4 iload_2 s3 = v2
5 ireturn return s3 return v2

We get from bytecode to variables by assigning identifiers to every value pushed or popped, e.g., iadd pops two operands to add and pushes the result.

Restoring variable names
If variables are reduced to slot references in the byte code, how then do we restore the original variable names? It’s possible we can’t. To improve the debugging experience, the byte code for each method may include a special section called the local variable table. For each variable in the original source, there exists an entry that specifies the name, the slot number, and the bytecode range for which the name applies. This table (and other useful metadata) can be included in the javap disassembly by including the -l option. For our plus() method above, the table javap emits looks like this:
Start Length Slot Name Signature
0 6 0 a I
0 6 1 b I
4 2 2 c I

Here, we see that v2 refers to an integer variable originally named c at bytecode offsets #4–5.

For classes that have been compiled without local variable tables (or which had them stripped out by an obfuscator), we have to generate our own names. There are many strategies for doing this; a clever implementation might look at how a variable is used for hints on an appropriate name.

We then apply a technique called copy propagation to eliminate some of the redundant variables. Copy propagation is a form of inlining in which references to variables are simply replaced with the assigned value, provided the transformation is valid.

But what are the types?
When dealing with stack values, the JVM uses a more simplistic type system than Java source. Specifically, boolean, char, and short values are manipulated using the same instructions as int values. Thus, the comparison v0 != 0, could be interpreted as:
	v0 != false ? v1 : v2

or:

	v0 != 0 ? v1 : v2 
or even:
  v0 != false ? v1 == true : v2 == true 

and so on!

In this case, however, we are fortunate to know the exact type of v0, as it is contained within the method descriptor:

    descriptor: (ZII)I
    flags: ACC_PUBLIC, ACC_STATIC

This tells us the method signature is of the form:

    public static int plus(boolean, int, int)

We can also infer that v3 should be an int (as opposed to a boolean) because it is used as the return value, and the descriptor tells us the return type. We are then left with:

  v3 =  v0 ? v1 : v2
  return v3

As an aside, if v0 were a local variable (and not a formal parameter) then we might not know it represents a boolean value and not an int. Remember that local variable table we mentioned earlier, which tells us the original variable names? It also includes information about the variable types, so if your compiler is configured to emit debug information, we can look to that table for type hints. There is another table called LocalVariableTypeTable [JVMb] that contains similar information; the key difference is that a LocalVariableTypeTable may include detailed information about generic types, whereas a LocalVariableTable cannot. It’s worth noting that these tables are unverified metadata, so they are not necessarily trustworthy. A particularly devious obfuscator might choose to fill these tables with lies, and the resulting bytecode would still be valid! Use them at your own discretion.

What do we mean by ‘valid’? Well, there are some important restrictions. Consider the following:

  0: s0 = v1
  1: v1 = s4
  2: v2 = s0 <-- s0 cannot be replaced with v1

Here, if we were to replace s0 with v1, the behaviour would change, as the value of v1 changes after s0 is assigned, but before it is consumed. To avoid these complications, we only use copy propagation to inline variables that are assigned exactly once.

One way to enforce this might be to track all stores to non-stack variables, i.e., we know that v1 is assigned v10 at #0, and also v11 at #2. Since there is more than one assignment to v1, we cannot perform copy propagation.

Our original example, however, has no such complications, and we end up with a nice, concise result:

  v2 = v0 + v1
  return v2

Stack analysis

In the previous example, we could guarantee which value was on top of the stack at any given point, and so we could name s0, s1, and so on.

So far, dealing with variables has been pretty straightforward because we’ve only explored methods with a single code path. In a real world application, most methods aren’t going to be so accommodating. Each time you add a loop or condition to a method, you increase the number of possible paths a caller might take. Consider this modified version of our earlier example:

  public static int plus(boolean t, int a, int b) {
    int c = t ? a : b;
    return c;
  }

Now we have control flow to complicate things; if we try to perform the same assignments as before, we run into a problem.

Bytecode Stack Variables
0 iload_0 s0 = v0
1 ifeq 8 if (s0 == 0) goto #8
4 iload_1 s1 = v1
5 goto 9 goto #9
8 iload_2 s2 = v2
9 istore_3 v3 = {s1,s2}
10 iload_3 s4 = v3
11 ireturn return s4

We need to be a little smarter about how we assign our stack identifiers. It’s no longer sufficient to consider each instruction on its own; we need to keep track of how the stack looks at any given location, because there are multiple paths we might take to get there.

When we examine #9, we see that istore_3 pops a single value, but that value has two sources: it might have originated at either #5 or #8. The value at the top of the stack at #9 might be either s1 or s2, depending on whether we came from #5 or #8, respectively. Therefore, we need to consider these to be the same variable – we merge them, and all references to s1 or s2 become references to the unambiguous variable s{1,2}. After this ‘relabelling’, we can safely perform copy propagation.

After Relabelling After Copy Propagation
0 s0 = v0
1 if (s0 == 0) goto #8 if (v0 == 0) goto #8
4 s{1,2} = v1 s{1,2} = v1
5 goto #9 goto #9
8 s{1,2} = :v2 s{1,2} = v2
9 v3 = s{1,2} v3 = s{1,2}
10 s4 = v3
11 return s4 return v3

Notice the conditional branch at #1: if the value of s0 is zero, we jump to the else block; otherwise, we continue along the current path. It’s interesting to note that the test condition is negated when compared to the original source. We’ve now covered enough to dive into...

Conditional expressions

At this point, we can determine that our code could be modelled with a ternary operator (?:): we have a conditional, with each branch having a single assignment to the same stack variable s{1,2}, after which the two paths converge. Once we’ve identified this pattern, we can roll the ternary up straight away.

After Copy Prop. Collapse Ternary
0
1 if (v0 == 0) goto #8
4 s{1,2} = v1
5 goto 9
8 s{1,2} = v2
9 v3 = s{1,2} v3 = v0 != 0 ? v1 : v2
10
11 return v3 return v3

Note that, as part of the transformation, we negated the condition at #9. It turns out that javac is fairly predictable in how it negates conditions, so we can more closely match the original source if we flip those conditions back.

Short-Circuit Operators ('&&' and '||')

  public static boolean fn(boolean a,
                           boolean b, boolean c){
    return a || b && c;
  }

How could anything be simpler? The bytecode is, unfortunately, a bit of a pain...

Bytecode Stack Variables After Copy Propagation
0 iload_0 s0 = v0
1 ifne #12 if (s0 != 0) goto #12 if (v0 != 0) goto #12
4 iload_1 s1 = v1
5 ifeq #16 if (s1 == 0) goto #16 if (v1 == 0) goto #16
8 iload_2 s2 = v2
9 ifeq #16 if (s2 == 0) goto #16 if (v2 == 0) goto #16
12 iconst_1 s3 = 1 s{3,4} = 1
13 goto #17 goto 17 goto 17
16 iconst_0 s4 = 0 s{3,4} = 0
17 ireturn return s{3,4} return s{3,4}

The ireturn instruction at #17 could return either s3 or s4, depending on what path is taken. We alias them as described above, and then we perform copy propagation to eliminate s0, s1, and s2.

We are left with three consecutive conditionals at #1, #5 and #7. As we mentioned earlier, conditional branches either jump or fall through to the next instruction.

The bytecode above contains sequences of conditional branches that follow specific and very useful patterns:

Conditional Conjunction (&&) Conditional Disjunction (||)
T1: 

  if (c1) goto L1
  if (c2) goto L2
L1:
  ...

Becomes:

  if (!c1 && c2) goto L2
L2:
  ...
T1:
  if (c1) goto L2
  if (c2) goto L2
L1:
  ...

Becomes:

  if (c1 || c2) goto L2
L1:
  ...

If we consider neighbouring pairs of conditionals above, #1...#5 do not conform to either of these patterns, but #5...#9 is a conditional disjunction (||), so we apply the appropriate transformation:

    1:  if (v0 != 0) goto #12
   5:  if (v1 == 0 || v2 == 0) goto #16
  12:  s{3,4} = 1
  13:  goto #17
  16:  s{3,4} = 0
  17:  return s{3,4}

Note that every transform we apply might create opportunities to perform additional transforms. In this case, applying the || transform restructured our conditions, and now #1...#5 conform to the && pattern! Thus, we can further simplify the method by combining these lines into a single conditional branch:

   1:  if (v0 == 0 && (v1 == 0 || v2 == 0)) goto #16
  12:  s{3,4} = 1
  13:  goto #17
  16:  s{3,4} = 0
  17:  return s{3,4}

Does this look familiar? It should: this bytecode now conforms to the ternary (?:) operator pattern we covered earlier. We can reduce #1...#16 to a single expression, then use copy propagation to inline s{3,4} into the return statement at #17:

  return (v0 == 0 && (v1 == 0 || v2 == 0)) ? 0 : 1;

Using the method descriptor and local variable type tables described earlier, we can infer all the types necessary to reduce this expression to:

  return (v0 == false && (
          v1 == false || v2 == false))
          ? false : true;

Well, this is certainly more concise than our original decompilation, but it’s still pretty jarring. Let’s see what we can do about that. We can start by collapsing comparisons like x == true and x == false to x and !x, respectively. We can also eliminate the ternary operator by reducing x ? false : true as the simple expression !x.

  return !(!v0 && (!v1 || !v2));

Better, but it’s still a bit of a handful. If you remember your high school discrete maths, you can see that De Morgan’s theorem can be applied here:

  !(a || b) --> (!a) && (!b)
  !(a && b) --> (!a) || (!b)

And thus:

  return ! ( !v0 && ( !v1 || !v2 ) )

becomes:

  return ( v0 || !(!v1 || !v2 ) )

and eventually:

  return ( v0 || (v1 && v2) )

Hurrah!

Dealing with method calls

We’ve already seen what it looks like for a method to be called: arguments ‘arrive’ in the locals array. To call a method, arguments must be pushed onto the stack, following a this pointer in the case of instance methods). Calling a method in bytecode is as you’d expect:

  push arg_0
  push arg_1
  invokevirtual METHODREF

We’ve specified invokevirtual above, which is the instruction used to invoke most instance methods. The JVM actually has a handful of instructions used for method calls, each with different semantics:

  1. invokeinterface invokes interface methods.
  2. invokevirtual invokes instance methods using virtual semantics, i.e., the call is dispatched to the proper override based on the runtime type of the target.
  3. invokespecial invokes a specific instance method (without virtual semantics); it is most commonly used for invoking constructors, but is also used for calls like super.method().
  4. invokestatic invokes static methods.
  5. invokedynamic is the least common (in Java), and it uses a ‘bootstrap’ method to invoke a custom call site binder. It was created to improve support for dynamic languages, and has been used in Java 8 to implement lambdas.

The important detail for a decompiler writer is that the class’s constant pool [JVMc] contains details for any method called, including the number and type of its arguments, as well as its return type. Recording this information in the caller class allows the runtime to verify that the intended method exists at runtime, and that it conforms to the expected signature. If the target method is in third party code, and its signature changes, then any code that attempts to call the old version will throw an error (as opposed to producing undefined behaviour).

Going back to our example above, the presence of the invokevirtual opcode tells us that the target method is an instance method, and therefore requires a this pointer as an implicit first argument. The METHODREF in the constant pool tells us that the method has one formal parameter, so we know we need to pop an argument off of the stack in addition to the target instance pointer. We can then rewrite the code as:

  arg_0.METHODREF(arg_1)

Of course, bytecode isn’t always so friendly; there’s no requirement that stack arguments be pushed neatly onto the stack, one after the other. If, for example, one of the arguments was a ternary expression, then there would be intermediate load, store, and branch instructions that will need to be transformed independently. Obfuscators might rewrite methods into a particularly convoluted sequence of instructions. A good decompiler will need to be flexible enough to handle many interesting edge cases that are beyond the scope of this article.

There has to be more to it than this...

So far, we’ve limited ourselves to analysing a single sequence of code, beginning with a list of simple instructions and applying transformations that produce more familiar, higher-level constructs. If you are thinking that this seems a little too simplistic, then you are correct. Java is a highly structured language with concepts like scopes and blocks, as well as more complicated control flow mechanisms. In order to deal with constructs like if/else blocks and loops, we need to perform a more rigorous analysis of the code, paying special attention to the various paths that might be taken. This is called control flow analysis.

We begin by breaking down our code into contiguous blocks that are guaranteed to execute from beginning to end. These are called basic blocks, and we construct them by splitting up our instruction list along places where we might jump to another block, as well as any instructions which might be jump targets themselves.

We then build up a control flow graph (CFG) by creating edges between the blocks to represent all possible branches. Note that these edges may not be explicit branches; blocks containing instructions that might throw exceptions will be connected to their respective exception handlers. We won’t go into detail about constructing CFGs, but some high level knowledge is required to understand how we use these graphs to detect code constructs like loops.

Figure 1 shows some examples of control flow graphs.

Figure 1

The aspects of control flow graphs that we are most interested in are domination relationships:

Basic loops and control flow

Consider the following Java method:

  public static void fn(int n) {
    for (int i = 0; i < n; ++i) {
      System.out.println(i);
    }
  }

and its disassembly:

 0:  iconst_0
 1:  istore_1
 2:  iload_1
 3:  iload_0
 4:  if_icmpge 20
 7:  getstatic #2  // System.out:PrintStream
10:  iload_1
11:  invokevirtual #3  // PrintStream.println:(I)V
14:  iinc 1, 1
17:  goto 2
20:  return

Let’s apply what we discussed above to convert this into a more readable form, first by introducing stack variables, then performing copy propagation.

Bytecode Stack Variables After Copy Propagation
0</CELL> iconst_0 s0 = 0
1 istore_1 v1 = s0 v1 = 0
2 iload_1 s2 = v1
3 iload_0 s3 = v0
4 if_icmpge 20 if (s2 >= s3) goto 20 if (v1 >= v0) goto 20
7 getstatic #2 s4 = System.out
10 iload_1 s5 = v1
11 invokevirtual #3 s4.println(s5) System.out.println(v1)
14 iinc 1, 1 v1 = v1 + 1 v1 = v1 + 1
17 goto 2 goto 2 goto 4
20 return return return

Notice how the conditional branch at #4 and the goto at #17 create a logical loop. We can see this more easily by looking at a control flow graph (Figure 2).

Figure 2

From the graph, it’s obvious that we have a neat cycle, with an edge from the goto back to a conditional branch. In this case, the conditional branch is referred to as a loop header, which can be defined as a dominator with a loop-forming backward edge. A loop header dominates all nodes within the loop’s body.

We can determine whether a condition is a loop header by looking for a loop-forming back edge, but how do we do that? A simple solution is to test whether the condition node is in its own dominance frontier. Once we know we have a loop header, we have to figure out which nodes to pull into the loop body; we can do that by finding all nodes dominated by the header. In pseudocode, our algorithm would look something like Listing 2.

findDominatedNodes(header)
    q := new Queue()
    r := new Set()

    q.enqueue(header)

    while (not q.empty())
        n := q.dequeue()

        if (header.dominates(n))
            r.add(n)

            for (s in n.successors())
                q.enqueue(n)

    return r
			
Listing 2

Once we’ve figured out the loop body, we can transform our code into a loop. Keep in mind that our loop header may be a conditional jump out of the loop, in which case we need to negate the condition.

  v1 = 0
  while (v1 < v0) {
    System.out.println(v1)
    v1 = v1 + 1
  }
  return

And voilà, we have a simple pre-condition loop! Most loops, including while, for and for-each, compile down to the same basic pattern, which we detect as a simple while loop. There’s no way to know for sure what kind of loop the programmer originally wrote, but for and for-each follow pretty specific patterns that we can look for. We won’t go into the details, but if you look at the while loop above, you can see how the original for loop’s initializer (v1 = 0) precedes the loop, and its iterator (v1 = v1 + 1) has been inserted at the end of the loop body. We’ll leave it as an exercise to think about when and how one might transform while loops into for or for-each. It’s also interesting to think about how we might have to adjust our logic to detect post-conditional (do/while) loops.

We can apply a similar technique to decompile if/else statements. The bytecode pattern is pretty straightforward:

  begin:
    iftrue(!condition) goto #else
    // 'if' block begins here
    ...
    goto #end
  else:
    // 'else' block begins here
    ...   
  end:
    // end of 'if/else'

Here, we use iftrue as a pseudo-instruction to represent any conditional branch: test a condition, and if it passes, then branch; otherwise, continue. We know that the if block starts at the instruction following the condition, and the else block starts at the condition’s jump target. Finding the contents of those blocks is as simple as finding the nodes dominated by those starting nodes, which we can do using the same algorithm we described above.

We’ve now covered basic control flow mechanisms, and while there are others (e.g., exception handlers and subroutines), those are beyond the scope of this introductory article.

Wrap-up

Writing a decompiler is no simple task, and the experience easily translates into a book’s worth of material – perhaps even a series of books! Obviously, we could not cover everything in a single article, and you probably wouldn’t want to read it if we’d tried. We hope that by touching on the most common constructs – logical operators, conditionals, and basic control flow – we have given you an interesting glimpse into the world of decompiler development.

Now go write your own! :)

References

This article originally appeared as a Blog post in the Java Advent Calendar, and is released under the Creative Commons 3.0 Attribution license.

(In general, just read the JVM spec end to end!)

[JVMa] JVM Specification §2.6.2 – http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html#jvms-2.6.2

[JVMb] JVM Specification §4.7.14 – http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.14

[JVMc] JVM Specification §4.4 – http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4

Notes: 

More fields may be available via dynamicdata ..