Saturday, May 18, 2013

The headaches of NDK debugging

Short story: if you're sharing libraries between different NDK projects and you're installing some of those libraries in /system/lib, and you're seeing weird stack-trace truncation (see below), make sure that

  1. You're installing the stripped version of the library in /system/lib, matching the one installed with your APK; and
  2. You really only have one version of the library installed, and it matches your app's target ABI.

Longer story follows.

More times than I can count, I've been stymied by an almost useless crash message such as:

I/DEBUG ( 9198): pid: 25709, tid: 25717 >>> edu.umich.bdh.broken_app <<<
I/DEBUG ( 9198): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 00000020 <snip> I/DEBUG ( 9198): #00 pc 000294ce /system/lib/libbroken.so (silence) (memory dump in hex)

…and that's it -- the whole backtrace.  That's not a stack.  A stack implies layers; none exist there. If I was lucky, the ndk-stack tool would give me line number information, but that was hit or miss.

If you're looking carefully, though, you may have noticed something clearly wrong about this "trace" -- the location of my broken library. It's sitting in /system/lib, but if I'm really using the NDK (which I am), it should be somewhere in /data/.../libs, because that's where NDK libraries go when you install the APK that contains them.

This betrays the fact that I am clearly doing something strange with the NDK -- which is (1) true; and (2) the focus of the rest of this post. Let me explain.

When I started porting Intentional Networking to Android, the NDK was in revision 3, and it lacked support for something that I relied on -- namely, C++ exceptions. At that time I was using the CrystaX version of the NDK, which did support exceptions. Besides this, I was also using it in a strange way. I've always had Intentional Networking implemented as a system-wide shared library, as a lot of my tests and tools were based on command-line tools. So, when I started, I actually used the AOSP's build system instead of the NDK, but I pointed it at the gcc binaries from the CrystaX NDK -- or something like that. (I swear I'm not making this up, but the details are a bit hazy.)

In this state, debugging native code was a huge pain. The ndk-gdb script didn't exist yet, so I found the arcane sequences of GDB commands through Google, which set the search paths for debug symbols so that GDB was somewhat useful. The only problem was that the build system generated stripped libraries by default -- not so useful for debugging. At this point, I assumed that I had to grab the non-stripped version and install that instead. Later, when the Google NDK added support for the C++ features that I needed, I switched to use it instead, and I carried over this practice of installing unstripped libraries.

Trouble was, I still saw nonsense like the above when not running GDB. More than that, today I saw a lovely stack trace starting at a non-executable line of my code and disappearing up into garbage. (In this case, the stack wasn't corrupted; it was just a garden-variety segfault that I couldn't trace back at all.)

What I hadn't realized all this time is that ndk-gdb expects the stripped version of the library to be installed, and ndk-build generates a non-stripped version which ndk-gdb uses to find the symbols. Remembering the crazy hackery I'd perpetrated upon the recommended NDK development steps, I decided to change my build to install the stripped version. I also noticed that GDB was showing a (truncated) backtrace ending in libc.so. The strange thing was that its path contained "armeabi, but I thought I had built my library for armeabi-v7a (for some floating-point optimizations).

I'm not 100% sure of the causality involved here, but I do know that when I changed my app to only build for armeabi-v7a instead of both, and when I cleared out the old build files, rebuilt, and installed the stripped libraries, I immediately got a full stack trace in GDB on the next run. Again, I'm not sure this was the reason, or which parts of it actually solved the problem, but it might be something to try if you have perpetrated such hackery and you're in the same hole.