Lab-3: Binary Mystery
For this lab, you need to use a native x86 Linux machine instead of inside a docker container. First, you follow instructions below to log into such a Linux server maintained by CIMS. Next, you clone the lab's repository on this CIMS Linux server. Finally, you can follow the instructions to do the lab on the CIMS Linux server.
Log into a CIMS Linux server from your laptop
You will do the lab on a CIMS server called snappy2.cims.nyu.edu, which you access via a gateway host called access.cims.nyu.edu. To do so, open a terminal on your laptop (if you use Windows, type wsl in PowerShell).
$ ssh@access.cims.nyu.edu [Enter your password. Then select 1 or 2 for further authentication]
Continue to log into snappy1 by typing
[netid@access2 ~]$ ssh snappy2
Cloing the lab repository
First, click on Lab3's github classroom invitation link (posted on Campuswire) and select your NYU netid. Next, clone your repo by typing the following
snappy2$ mkdir -p cso-labs snappy2$ cd cso-labs snappy2$ git clone git@github.com:nyu-cso-sp26/lab3-<YourGithubUsername>.git lab3
Do lab3 by uncovering the mystery of x86 assembly
The lab contains two parts. In part-1, you reconstruct semantically equivalent C functions based on their assembly and execution. In part-2, you perform a control flow hijacking attack by exploiting a buffer overflow vulnerability in the given program.Part-1: Reconstruct equivalent C code from assembly
In this part, you are to reconstruct five C functions by examining their corresponding machine code and running the executable in gdb.
The five C functions you are to reconstruct are named ex[1-5](...). We have compiled their C implementation into object files ex[1-5]_sol.o and withheld the source code. We have also implemented a tester that tests these five functions using various inputs. The tester's binary tester_sol is also given to you. If you make and run tester_sol, you should see that it passes all tests, as expected.
Figure out what each function ex[1-5> does according to hints, and write the corresponding C code in ex[1-5].c.
No goto's
For this lab, the only files that you should modify are ex[1-5].c. Furthermore, your implementation should not contain any goto statements. .Test your solution
After you've finished each function (remember to remove the assert(0) statement), you compile and create a new tester that links to your implementation ex[1-5].o instead of ex[1-5]_sol.o. You can check whether your reconstructed functions are correct or not by running this new tester:
$ make $ ./tester Testing ex1... ex1 passed Testing ex2... ex2 passed Testing ex3... ex3 passed Testing ex4... ex4 passed Testing ex5... ex5 passedThe above ouput ocurrs when all your ex{1-5} functions pass the test.
To test multiple times, run ./tester -r with the -r option. This runs the tester using a new seed for its random number generator.
Some of you might want to skip around and implement the five ex* functions in arbitary order. This is a good strategy if you are stuck on some function. To test just ex2, type ./tester -t 2. Ditto with other functions.
Note: Passing the test does not guarantee that your implementation is not necessarily correct. During grading, we may manually examine your source code to determine its correctness.
Part-1 Hints
Suppose you set out to figure out what function ex1 (implemented in ex1_sol.o) does. There are two approaches to do this. You should use them both to help uncover the mystery.
- Approach 1:
Disassemble the object files. Read the assembly to get some initial understanding of what the function tries to achieve. To disassemble ex1_sol.o, type:
$ objdump -d ex1_sol.o
- Approach 2: (You must master this to do well in test-2)
Run the function ex1 in gdb.
You might wonder how to run ex1 in gdb, since not knowing the function signature of ex1 makes it hard to write your own C code to correctly invoke ex1. How to run ex1 then? It turns out that you can utilize the tester that we have given you to run ex1 and observe how it executes.
To run the test with the given ex1 function, you need to link the test object file tester.o together with the given ex{1-5}-sol.o files. We have made this step easy by including appropriate Makefile rules. When you type make, you will see that it generates two binary executables: tester, and tester-sol. The executable file tester links tester.o with object files ex{1-5}.o which are generated from your ex{1-5}.c files. The executable file tester-sol links tester.o with the given object files ex{1-5}_sol.o. Thus, when you run ./tester_sol, the tester invokes the given functions, and needless to say, all tests should pass.
Run by typing gdb -x ./.gdbinit tester_sol. The commandline option -x ./.gdbinit makes gdb load configurations from file ./gdbinit which we have given to you in the lab's repository. In general, your workflow in gdb will usually proceed like the following:
- Set a breakpoint to step the execution whenever the function ex1 is invoked. (gdb) b ex1
- Run the program until the breakpoint is triggered. (gdb) r
- Dissemble the function under investigation. (gdb) disass ex1
- Execute the instructions one by one. (gdb) nexti
- Form a hypothesis on what the function signature is.
Does the function take any arguments? If so, how many? Suppose you want to know if ex1 has at least one argument. We know a function's first argument is stored in register %rdi if there is a first argument. You can examine the function body of ex1 to see if it uses register %rdi without writing to it first. If so, ex1 must have a first argument. You can also examine the instructions leading to call of ex1. Did the caller write to %rdi before the call instruction to set up the argument? If so, that's also indication ex1 has a first argument.
What is the type of each function argument? Is it an integer or a pointer (aka memory address)? Suppose you want to deduce the type of ex1's first argument. You can examine the content of register %rdi while the execution is inside ex1. Type (gdb)info registers which gives you the content of all registers. You can also print the content of a specific register (gdb) p /x $rdi (Note that registers start with "$" in gdb instead of '%'. And the /x option prints out the hex value). If you hypothesize the argument is a memory address, you can examine the memory contents staring from that specific address with (gdb)x/20xb $rdi (this prints out 20 bytes in hex starting from the memory address contained in %rdi).
- Execute the function body instruction by instruction and figure out what it does. Verify your hypothesis during execution by examining register values and memory contents.
Do not try to match assembly
It is not the right approach to try to match the object code of your C function line-by-line to those contained in ex{1-5}-sol.o. Doing so is painful and not necessary. Differences in the compiler versions, compilation flags, and small differences in C code will all result in different object code, although they do not affect the code's semantics. Therefore, trying to find a C function that generates the same object code is likely futile.
Explanations on some unfamiliar assembly and others
For this lab, you need to review the lecture notes and textbook to refresh your understanding of x86 assembly. Below are some additional information not covered in the lecture notes that are helpful for this lab as well.
- One can explicitly refer to lower-order bits of the registers. The
names that you may find used in this lab are:
register : name to refer its lower-order portion %rax : %eax(lower-32 bit), %ax(lower-16-bit), %al(lower-8-bit). %rcx : %ecx(lower-32 bit), %cx(lower-16-bit), %cl(lower-8-bit). %rdx : %edx(lower-32 bit), %dx(lower-16-bit), %dl(lower-8-bit). %rbx : %ebx(lower-32 bit), %bx(lower-16-bit), %bl(lower-8-bit). %r8 : %r8d(lower-32 bit), %r8w(lower-16-bit),%r8b(lower-8-bit). ... %r15 :%r15d(lower-32 bit),%r15w(lower-16-bit),%r15b(lower-8-bit).
Note: For some reason, gdb does not recognize %r8b as a valid register name. Please just print register %r8 and manually find out its lower-8-bit to obtain the value for %r8b. - Often in the dissembled output, you encounter some instructions without any mnemonics suffix. For example, the mov instead of movl or movq (where l or q is called the mnemoics). In these scenarios, then treat the missing suffix as one that corresponds to the size of the destination register operand. For example, mov $1, %ebx is equivalent to movl $1, %ebx and mov %rax, %rbx is equivalent to movq %rax, %rbx.
- movzbl instruction moves the 1-byte source operand (the b mnemonic) to the 4-byte destination operand (the l mnemonic) with zero extension.
Instruction movslq moves the 4-byte source operand to the 8-byte destination operand with sign extension. That is, if the source operand is negative in two's complement (i.e. has 1 in its most significant bit), then the instruction pads 1s (i.e. fills the most significant 4-byte with 1s). There are more details on zero-extension and sign extension on Page 184-185 of the textbook.
-
The two byte instruction "repz retq" behaves identically as the one byte instruction retq.
- If you disassemble an object file, (e.g. "objdump -d ex1_sol.o"), you should not expect valid address for functions, because linking has not yet happened. If you want to see valid function addresses (i.e. those that appear as the operand for the call instruction), disassemble the binary executable (tester or tester-sol) or disassemble in gdb.
For those of you who want to go out in the world to explore other object files, you will find the official Intel instruction set manual useful. Note that in the Intel manual, the source and destination operands are reversed in an instruction (i.e. destination operand first, source operand last). In the lecture notes and gdb/objdump's disassembled output, the destination operand appears last in an instruction. These differences are due to two assembly syntaxes, AT&T syntax and Intel syntax. The GNU software (gcc, gdb etc) and lecture notes use AT&T syntax which puts the destination operand last and Intel manual (of course) uses Intel syntax which puts the destination operand first.
Part-2: Buffer Overflow Hack
You are given a program overflow whose source code has been withheld. This program takes a single command line input and prints some output. Your job is to craft a malicious input such that the program will end up invoking a function called success that it is never supposed to invoke under its normal control flow.
void success()
{
printf("you successfully hijacked overflow's control flow\n");
exit(0);
}
Put your malicious argument in a file called bad_arg and run program overflow. If you are successful, you should see the following:
snappy2$ cat bad_arg |xargs ./overflow you successfully hijacked overflow's control flow
In the above command, cat bad_arg prints the content of file bad_arg to stdout which is connected via a pipe | to the stdin of command xargs ./overflow. xargs reads stuff from stdin and uses them as arguments for command overflow.
