Monday, September 22, 2014

Aggressively Unreliable Transport Protocol

I always knew that UDP is an unreliable transport protocol, but Today I Learned that it is far less reliable than that.

To wit, consider the following code, which, trivial though it is, demonstrates my misunderstanding:

#!/usr/bin/env python

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('127.0.0.1', 0))
sock.settimeout(0.5)
addr = sock.getsockname()

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
msg = "hello how are you today"
print "Sending data, len {0}".format(len(msg))
s.sendto(msg, addr)
s.close()

first_chunk = 4
print "Getting first {0} bytes".format(first_chunk)
recvd = sock.recvfrom(first_chunk)[0]
print "Got {0} bytes ({1})".format(len(recvd), recvd)
if len(msg) > len(recvd):
    left = len(msg) - len(recvd)
    print "Waiting for {0} more bytes".format(left)
    recvd += sock.recvfrom(left)[0]
sock.close()
assert msg == recvd

I expected the first recvfrom call to leave the remaining 19 bytes in the buffer, to be returned by the second recvfrom call. No such luck. Looking at the manpage for recvfrom, I can more or less piece together the reason. The sender sent a 23-byte datagram, the receiver asked for 4 bytes, and the kernel said to itself, "Huh, I've got all these bytes hanging around, but you only asked for 4. I GUESS YOU'LL NEVER EVER BE ABLE TO HANDLE ALL THIS DATA" and discarded the remaining 19 bytes rather than leaving them in the buffer. Turns out that this is a decision left up to the implementation. (I tested this on OS X 10.9.4.)

This is more explicit if you're using sendmsg/recvmsg; there's a flag that gets set if this kind of truncation occurs. Again, I get that it's not a stream, but wow, I didn't expect to lose most of the data by calling recvfrom twice for one sent datagram.

I guess this outs me has never having written a proper UDP application. Oops. It's true; I spent my grad school (and before) days writing atop TCP and getting to know its quirks instead. Learn something new every day.

Saturday, May 18, 2013

The headaches of NDK debugging

Short story: if you're sharing libraries between different NDK projects and you're installing some of those libraries in /system/lib, and you're seeing weird stack-trace truncation (see below), make sure that

  1. You're installing the stripped version of the library in /system/lib, matching the one installed with your APK; and
  2. You really only have one version of the library installed, and it matches your app's target ABI.

Longer story follows.

More times than I can count, I've been stymied by an almost useless crash message such as:

I/DEBUG ( 9198): pid: 25709, tid: 25717 >>> edu.umich.bdh.broken_app <<<
I/DEBUG ( 9198): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 00000020 <snip> I/DEBUG ( 9198): #00 pc 000294ce /system/lib/libbroken.so (silence) (memory dump in hex)

…and that's it -- the whole backtrace.  That's not a stack.  A stack implies layers; none exist there. If I was lucky, the ndk-stack tool would give me line number information, but that was hit or miss.

If you're looking carefully, though, you may have noticed something clearly wrong about this "trace" -- the location of my broken library. It's sitting in /system/lib, but if I'm really using the NDK (which I am), it should be somewhere in /data/.../libs, because that's where NDK libraries go when you install the APK that contains them.

This betrays the fact that I am clearly doing something strange with the NDK -- which is (1) true; and (2) the focus of the rest of this post. Let me explain.

When I started porting Intentional Networking to Android, the NDK was in revision 3, and it lacked support for something that I relied on -- namely, C++ exceptions. At that time I was using the CrystaX version of the NDK, which did support exceptions. Besides this, I was also using it in a strange way. I've always had Intentional Networking implemented as a system-wide shared library, as a lot of my tests and tools were based on command-line tools. So, when I started, I actually used the AOSP's build system instead of the NDK, but I pointed it at the gcc binaries from the CrystaX NDK -- or something like that. (I swear I'm not making this up, but the details are a bit hazy.)

In this state, debugging native code was a huge pain. The ndk-gdb script didn't exist yet, so I found the arcane sequences of GDB commands through Google, which set the search paths for debug symbols so that GDB was somewhat useful. The only problem was that the build system generated stripped libraries by default -- not so useful for debugging. At this point, I assumed that I had to grab the non-stripped version and install that instead. Later, when the Google NDK added support for the C++ features that I needed, I switched to use it instead, and I carried over this practice of installing unstripped libraries.

Trouble was, I still saw nonsense like the above when not running GDB. More than that, today I saw a lovely stack trace starting at a non-executable line of my code and disappearing up into garbage. (In this case, the stack wasn't corrupted; it was just a garden-variety segfault that I couldn't trace back at all.)

What I hadn't realized all this time is that ndk-gdb expects the stripped version of the library to be installed, and ndk-build generates a non-stripped version which ndk-gdb uses to find the symbols. Remembering the crazy hackery I'd perpetrated upon the recommended NDK development steps, I decided to change my build to install the stripped version. I also noticed that GDB was showing a (truncated) backtrace ending in libc.so. The strange thing was that its path contained "armeabi, but I thought I had built my library for armeabi-v7a (for some floating-point optimizations).

I'm not 100% sure of the causality involved here, but I do know that when I changed my app to only build for armeabi-v7a instead of both, and when I cleared out the old build files, rebuilt, and installed the stripped libraries, I immediately got a full stack trace in GDB on the next run. Again, I'm not sure this was the reason, or which parts of it actually solved the problem, but it might be something to try if you have perpetrated such hackery and you're in the same hole.

Monday, October 3, 2011

Android NDK, shared libs, and C++ exceptions

So, I was getting all set to write a nice long post about an issue I've noticed a couple times in different manifestations, the net result of which is that you can't throw a C++ exception across a shared-library boundary in Android NDK code.  However, I just noticed that the reasons for this are summed up quite nicely here:

http://groups.google.com/group/android-ndk/browse_thread/thread/c1c001a95f478400

The one thing I'll add is that I've seen this manifest in two very different ways, but with two different NDK toolchains.  Back before the NDK supported C++ exceptions and STL containers, I saw a similar issue using r3 of CrystaX's custom NDK.  The result was a segfault when such an exception was thrown.  This appeared to be a known issue, so I didn't pursue it further and changed the project I was working on to build with a static library, as both discussions suggest as a temporary workaround.

Later, when the official NDK added support for C++ exceptions and STL containers, I moved to using it instead, and eventually I wrote some code that tried to throw an exception across a shared-library boundary.  The results were somewhat different, though.  Instead of a segfault, the result was that the exception was just not caught, resulting in many minutes of staring at code, wondering if I was losing my mind.  I knew one function was throwing an exception, and I was looking right at the block that caught that same exception.  So, I added a catch (...) {} block and copied some of the code from libstdc++'s implementation of terminate to print out the exception info before dying.  This time the exception was caught.  Even weirder, it still printed out the correct type of the exception, though it didn't seem to be able to use that information to pass the exception to the correct catch block.

Anyways.  As you can see if you read the above link(s), this is all due to libstdc++ (the GNU version, in my case) being linked statically, each shared library having its own copy, and that breaking exception handling somehow.  Static linking of the exception-throwing library indeed solves this for me, but I hope the new version of the NDK with support for the GNU C++ shared libraries comes soon.

In the meantime, if you're going to build a static library with the NDK, you'll need to trick the NDK build system into actually doing something.  Check out this StackOverflow question.

Monday, June 27, 2011

Replacing bits of the Android system code

If you've been googling around trying to figure out why your Android device doesn't seem to want to load your custom-built framework.jar (or other file in /system/framework), you may have come across this reply from Android framework engineer Dianne Hackborn:
You need to flash the entire device with your own build.  You can't just selectively replace pieces of a user build.
That didn't sound right to me, since I'd done exactly that on the HTC Dream and was just now having trouble doing it on a Nexus One.  Then again, I was running Cupcake on the old phone, so maybe this was an effect of a newer version of Android?

Nah, I was just missing a couple steps.

The first clue that something was different was that /system/framework was littered with a bunch of extra .odex files, one for each jar/apk file.  These files are optimized versions of the classes.dex files that would otherwise live inside each jar/apk file, generated at first boot by dexopt.  There's a great article over at AddictiveTips that gives an overview of why those files exist.

One nasty misconception I got from reading that article, though, is that the .odex files inside /system/framework are re-generated on boot if they're missing.  This is not the case!  Or at the very least, it's not as simple as that.  If you do like I did - e.g. deleting framework.odex and replacing framework.jar with your rebuilt version - you may be greeted by an unbootable phone.  The reason for this is that the .odex files also store dependency information, the unfortunate result of which is that if you touch one, you probably have to regenerate them all.  This process is scriptable, sure, but it seems easy to get wrong.

So, rather than mucking about with generating .odex files on the phone, I decided to just track down a deodexed ROM, flash it, and then ignore .odex entirely from that point forward.  Luckily, it didn't take me too long to find a custom deodexed ROM of Android 2.3.4 for the Nexus One.  Turns out that deodexing is something that ROM makers do all the time, probably for similar reasons to mine.

Once you've flashed a deodexed ROM, the .odex files are gone, and replacing any of the files in /system/framework is as simple as replacing the file with your custom-built copy (making sure to build with the right device-specific setup and backing up the original files first).

Saturday, June 25, 2011

Building for the right device

I don't know all the reasons for this, and it may be obvious to more seasoned Android hackers than myself, but:

If you're rebuilding framework.jar for a device, it's important to set up your build environment for that specific device.

Though I'm focusing on framework.jar at the moment, this probably pertains to several other of the core system-level classes.

I'm using a Nexus One (aka HTC Passion), so for me, this meant roughly following the directions discussed in this somewhat outdated, but still useful, post.

Here's the short version (the main steps that resolved things for me):
  1. Make sure your Android source tree contains a device/htc/passion folder.
    (The name will be different depending on your device.)
  2. Change to that folder and run the extract-files.sh script.
    You'll need your device running and connected via USB (with USB debugging enabled, naturally) to do this.
  3. Set up your build environment by running . ./build/envsetup.sh from the top directory of the source tree.
    This should be familiar if you've spent any time at all working with the Android source.
  4. Type lunch full_passion-userdebug.
    This is the key step; as far as I can tell, it sets variables and pulls in the right files to build system classes and JARs specific to the Nexus One. (Again, the exact command will vary depending on your device; try print_lunch_menu to see the available options. Also, you may need to clone the relevant git repository under device/<vendor>/<model>.
  5. Type make to build the new configuration.
This assumes that, like me, you've already built the generic source tree, tried to replace something like framework.jar on the phone, and had it fail mysteriously. (In my case, there was a NullPointerException while loading resources pertaining to the mobile carrier - something that sounded suspiciously related to something device-specific.) Unfortunately, this does require rebuilding everything from the start, into a new subdirectory of out/target/product/. Once it's done, though, building the individual piece you're interested in changing will be quick.

I get the sense that this is a pretty basic step that Android porters deal with all the time, but when I was building for the HTC Dream, the generic build (that is, lunch 1 or lunch generic-eng) was sufficient for replacing services.jar. Maybe it's more that framework.jar has some device-specific components that I wasn't aware of.

Friday, June 24, 2011

File System Navigator

So, I think I had actually heard this a while ago, but I was still amused to (re-)discover today that my new blog's namesake was actually a real (albeit demo) tool developed by SG IRIX. :D

How not to break the world

Turns out, having a backup is a Good Thing™.

This is of course obvious, but the subtler point is having a backup is a Good Thing™ even when you don't think you need one.

If, like me, you're hacking on pieces of an Android system, you should probably be abiding by this simple rule that I've now (after many painful lessons) set in place for myself: have a backup in place before you do anything that touches anything you need root perms for, in any way.

Fortunately, there are some really easy ways to do this, which make it pretty much second nature. If you've already rooted your phone, you can simply go to the Android Market and install ROM Manager. This is a brilliant little piece of software that vastly simplifies the process of playing with system-level bits.

If you're not willing to go all out and root your phone (e.g. maybe you're using a dev phone that only does root through "adb root"), you can still install ClockworkMod Recovery by itself, using fastboot. This is actually a component of ROM Manager, made by the same developer. For full details on how to install and use it, I refer you to this excellent post. (I must make a note to visit Addictive Tips more often; their guides are extremely thorough and useful. They're like the Ars Technica of random, obscure knowledge.)

ClockworkMod Recovery actually does a lot of other nice things that the stock recovery mode doesn't, like letting you mount /system and /data or giving you a root shell, all without booting the phone. Very useful when you broke something that makes the phone stop booting.

With either of these options in place, you can now do a one-step backup whenever you're about to do something dangerous. The backup is saved to your device's SD card, but it's just a directory with a bunch of flashable images (.img files) and an md5 checksum, so you can copy them off the SD card to store on another machine, if you want. (I think. YMMV; I haven't actually tried this.)

Now that I'm safe and sound with a simple and reliable backup method, next time I will talk about how to actually make some of the changes that I've just prevented from breaking the world. Here's a teaser: deodexing. If you know what that means, you might already know how to do this. Makers of custom ROMs are doing it all the time. Anyways, more on that next time.