Fixing a Memory Corruption Bug in Go

Fixing a Memory Corruption Bug in Go

A tale of two gophers

Purego started as a small pipe dream of an idea but has now grown to 1.6K⭐️ on GitHub and many contributors including names like Netgate. I'm excited to see the future of this project. It's evolved and grown as I have learned more about low-level programming. However, while testing Ebitengine which uses Purego for a new release on iOS 17, an issue was discovered:

objc[700]: autorelease pool page 0x103560000 corrupted
  magic     0x035647c0 0x00000001 0x03564800 0x00000001
  should be 0xa1a1a1a1 0x4f545541 0x454c4552 0x21455341
  pthread   0x103564880
  should be 0x16d1c3000

So... yeah corruption is not good. This looked very similar to an issue reported for the macOS Sonoma beta while trying to use Oto (ebitengine/oto#220). It was originally chalked up to beta software. But now discovering that a live version of iOS was having an issue it was time to find a fix.

During this time user @jwijenbergh was working on getting Linux callback support working. I had tried months ago and only succeeded in getting amd64 working. The qsort test would always fail on arm64. I suspected that this iOS issue was the same thing that was blocking Linux arm64 callbacks from working but I didn't have proof yet.

My action plan was to figure out the Linux issue in hopes that it would solve the iOS one; they are both arm64 so I was hopeful. I used Purego's qsort callback test for all my experiments. It was the only test that failed while trying to port Linux so it was a good candidate for investigation.

The first important thing to notice was that the Linux test failed withCGO_ENABLED=1 and CGO_ENABLED=0. These environment flags affect which part of Purego gets built into the binary. So if the issue occurred either way, then I was pretty confident that my problem was not inside the internal/fakecgo package.

Here's the test I was working with.

func Test_qsort(t *testing.T) {
    // ...library loading removed for brevity
    data := []int{88, 56, 100, 2, 25}
    sorted := []int{2, 25, 56, 88, 100}
    compare := func(a, b *int) int {
        return *a - *b
    }
    var qsort func(data []int, nitms uintptr, size uintptr, compar func(a, b *int) int)
    purego.RegisterLibFunc(&qsort, libc, "qsort")
    qsort(data, uintptr(len(data)), unsafe.Sizeof(int(0)), compare)
    for i := range data {
        if data[i] != sorted[i] {
            t.Errorf("got %d wanted %d at %d", data[i], sorted[i], i)
        }
    }
}

I needed to determine if the problem was in the calling convention code (RegisterLibFunc) or the callback code (the compare function). I did this by replacing the compare Go function with one written in C.

//int compare(const void * a, const void * b) {
//   return ( *(int*)a - *(int*)b );
//}
import "C"

func Test_qsort(t *testing.T) {
    // ...
    var qsort func(data []int, nitms uintptr, size uintptr, compar uintptr)
    purego.RegisterLibFunc(&qsort, libc, "qsort")
    qsort(data, uintptr(len(data)), unsafe.Sizeof(int(0)), C.compare)
    // ...
}

This code passed on my Linux arm64 machine! This meant the problem was indeed inside the calling convention code. I was a little nervous now because this code is entirely assembly and not very trivial.

I started by removing all the parts that weren't necessary to make the qsort test pass. This included floating point registers and all the general registers after the first two. Each time I made a change I tested on my macOS M1 machine to make sure it still passed and then tried the same code on Linux. I removed as much as I could while still passing on macOS but the code still failed on Linux. I finally decided to nuke the code and write the callback entirely in assembly.

TEXT callbackasm1(SB), NOSPLIT|NOFRAME, $0    
    MOVD    (R0), R2   // *x -> R2
    MOVD    (R1), R1   // *y -> R1
    SUB     R1, R2, R0 // R2 - R1 -> R0
    RET

This code passed on Linux! 🥳 However, I wanted to write Go code not assembly so I still had to backtrack and find the root cause. Since I had a passing test, I kept adding each line back of the original code until the code stopped working again. Strangely, the test stopped passing on this line.

TEXT callbackasm1(SB), NOSPLIT|NOFRAME, $0
    //...
    MOVD ·callbackWrap_call(SB), R0 // <----

If you can't read assembly, all this line does is move the address of the variable callbackWrap_call into R0. Now that's very strange. How can a load into a scratch register cause memory corruption?

I used a tool called Lensm. It allows seeing the source code with the disassembled instructions that the machine executes. I did this because Go's assembler is a pseudo-assembler. This means that each instruction in the code does not represent exactly what the machine will run, so there can be differences (The Design of the Go Assembler).

After looking at the assembly, I noticed something quite odd. What is this R27 register doing here? Looking at the calling convention for arm64 (AAPCS64), it states that R27 is a callee saved register. This means that any function you call must not change the value of R27. I confirmed this was the problem by putting 0xdeadbeef into R27 which caused a segfault at that address. I had found the culprit!

But why did Go do this? I don't know for sure, yet. The thing is Go's ABI is different than the system one. In Go, each calling function is required to save most registers instead of the callee. So it would always be legal to erase the value of R27 in Go but this assembly code is being called by C so it's important not to change the value. I think the assembler should have only used the R0 register and not introduced this other register. But maybe there's a good reason so if anyone knows please tell me.

After all this searching, all I had to do now was save R27 before the move instruction and restore it before returning to the caller. Doing this made the Linux arm64 test pass. @HajimeHoshi went ahead and tested this code on iOS 17 and it indeed fixed the problem. Thanks to HajimeHoshi and jwijenbergh for testing and experimenting - they are my two superhero gophers!

As of commit #163, Purego now supports Linux callbacks on amd64 and arm64!


If you enjoyed this article please consider supporting me by donating here on Hashnode or GitHub.

Did you find this article valuable?

Support Over-Engineered by becoming a sponsor. Any amount is appreciated!