Archive for the 'p.m.o' Category

The DEPTH of me

When adding a new directory to the Mozilla codebase, one usually needs to add a Makefile.in file, with some magic incantations at the beginning of it:

DEPTH = ../..
topsrcdir = @top_srcdir@
srcdir = @srcdir@
VPATH = @srcdir@

include $(DEPTH)/config/autoconf.mk

etc.

Some even add:

relativesrcdir = foo/bar

In the above, DEPTH and relativesrcdir both need to be carefully filled depending where the Makefile.in is.

As of bug 774032, landed over the week-end (along with some now hopefully fixed tree breakage for some people, sorry for the inconvenience), there are two additional substitution variables for Makefile.in that replace the need to be careful.

Now the boilerplate for a new Makefile.in is:

DEPTH = @DEPTH@
topsrcdir = @top_srcdir@
srcdir = @srcdir@
VPATH = @srcdir@

include $(DEPTH)/config/autoconf.mk

etc.

And, if needed,

relativesrcdir = @relativesrcdir@

2012-08-06 18:52:34+0900

p.m.o | No Comments »

Building a Linux kernel module without the exact kernel headers

Imagine you have a Linux kernel image for an Android phone, but you don't have the corresponding source, nor do you have the corresponding kernel headers. Imagine that kernel has module support (fortunately), and that you'd like to build a module for it to load. There are several good reasons why you can't just build a new kernel from source and be done with it (e.g. the resulting kernel lacks support for important hardware, like the LCD or touchscreen). With the ever-changing Linux kernel ABI, and the lack of source and headers, you'd think you're pretty much in a dead-end.

As a matter of fact, if you build a kernel module against different kernel headers, the module will fail to load with errors depending on how different they are. It can complain about bad signatures, bad version or other different things.

But more on that later.

Configuring a kernel

The first thing is to find a kernel source for something close enough to the kernel image you have. That's probably the trickiest part with getting a proper configuration. Start from the version number you can read from /proc/version. If, like me, you're targeting an Android device, try Android kernels from Code Aurora, Linaro, Cyanogen or Android, whichever is closest to what is in your phone. In my case, it was msm-3.0 kernel. Note you don't necessarily need the exact same version. A minor version difference is still likely to work. I've been using a 3.0.21 source, which the kernel image was 3.0.8. Don't however try e.g. using a 3.1 kernel source when the kernel you have is 3.0.x.

If the kernel image you have is kind enough to provide a /proc/config.gz file, you can start from there, otherwise, you can try starting from the default configuration, but you need to be extra careful, then (although I won't detail using the default configuration because I was fortunate enough that I didn't have to, there will be some details further below as to why a proper configuration is important).

Assuming arm-eabi-gcc is in your PATH, and that you have a shell opened in the kernel source directory, you need to start by configuring the kernel and install headers and scripts:

$ mkdir build
$ gunzip -c config.gz > build/.config # Or whatever you need to prepare a .config
$ make silentoldconfig prepare headers_install scripts ARCH=arm CROSS_COMPILE=arm-eabi- O=build KERNELRELEASE=`adb shell uname -r`

The silentoldconfig target is likely to ask you some questions about whether you want to enable some things. You may want to opt for the default, but that may also not work properly.

You may use something different for KERNELRELEASE, but it needs to match the exact kernel version you'll be loading the module from.

A simple module

To create a dummy module, you need to create two files: a source file, and a Makefile.

Place the following content in a hello.c file, in some dedicated directory:

#include <linux/module.h>       /* Needed by all modules */
#include <linux/kernel.h>       /* Needed for KERN_INFO */
#include <linux/init.h>         /* Needed for the macros */
static int __init hello_start(void)
{
  printk(KERN_INFO "Hello world\n");
  return 0;
}
static void __exit hello_end(void)
{
  printk(KERN_INFO "Goodbye world\n");
}
module_init(hello_start);
module_exit(hello_end);

Place the following content in a Makefile under the same directory:

obj-m = hello.o

Building such a module is pretty straightforward, but at this point, it won't work yet. Let me enter some details first.

The building of a module

When you normally build the above module, the kernel build system creates a hello.mod.c file, which content can create several kind of problems:

MODULE_INFO(vermagic, VERMAGIC_STRING);

VERMAGIC_STRING is derived from the UTS_RELEASE macro defined in include/generated/utsrelease.h, generated by the kernel build system. By default, its value is derived from the actual kernel version, and git repository status. This is what setting KERNELRELEASE when configuring the kernel above modified. If VERMAGIC_STRING doesn't match the kernel version, loading the module will lead to the following kind of message in dmesg:

hello: version magic '3.0.21-perf-ge728813-00399-gd5fa0c9' should be '3.0.8-perf'

Then, there's the module definition.

struct module __this_module
__attribute__((section(".gnu.linkonce.this_module"))) = {
 .name = KBUILD_MODNAME,
 .init = init_module,
#ifdef CONFIG_MODULE_UNLOAD
 .exit = cleanup_module,
#endif
 .arch = MODULE_ARCH_INIT,
};

In itself, this looks benign, but the struct module, defined in include/linux/module.h comes with an unpleasant surprise:

struct module
{
        (...)
#ifdef CONFIG_UNUSED_SYMBOLS
        (...)
#endif
        (...)
        /* Startup function. */
        int (*init)(void);
        (...)
#ifdef CONFIG_GENERIC_BUG
        (...)
#endif
#ifdef CONFIG_KALLSYMS
        (...)
#endif
        (...)
(... plenty more ifdefs ...)
#ifdef CONFIG_MODULE_UNLOAD
        (...)
        /* Destruction function. */
        void (*exit)(void);
        (...)
#endif
        (...)
}

This means for the init pointer to be at the right place, CONFIG_UNUSED_SYMBOLS needs to be defined according to what the kernel image uses. And for the exit pointer, it's CONFIG_GENERIC_BUG, CONFIG_KALLSYMS, CONFIG_SMP, CONFIG_TRACEPOINTS, CONFIG_JUMP_LABEL, CONFIG_TRACING, CONFIG_EVENT_TRACING, CONFIG_FTRACE_MCOUNT_RECORD and CONFIG_MODULE_UNLOAD.

Start to understand why you're supposed to use the exact kernel headers matching your kernel?

Then, the symbol version definitions:

static const struct modversion_info ____versions[]
__used
__attribute__((section("__versions"))) = {
	{ 0xsomehex, "module_layout" },
	{ 0xsomehex, "__aeabi_unwind_cpp_pr0" },
	{ 0xsomehex, "printk" },
};

These come from the Module.symvers file you get with your kernel headers. Each entry represents a symbol the module requires, and what signature it is expected to have. The first symbol, module_layout, varies depending on what struct module looks like, i.e. depending on which of the config options mentioned above are enabled. The second, __aeabi_unwind_cpp_pr0, is an ARM ABI specific function, and the last, is for our printk function calls.

The signature for each function symbol may vary depending on the kernel code for that function, and the compiler used to compile the kernel. This means that if you have a kernel you built from source, modules built for that kernel, and rebuild the kernel after modifying e.g. the printk function, even in a compatible way, the modules you built initially won't load with the new kernel.

So, if we were to build a kernel from the hopefully close enough source code, with the hopefully close enough configuration, chances are we wouldn't get the same signatures as the binary kernel we have, and it would complain as follows, when loading our module:

hello: disagrees about version of symbol symbol_name

Which means we need a proper Module.symvers corresponding to the binary kernel, which, at the moment, we don't have.

Inspecting the kernel

Conveniently, since the kernel has to do these verifications when loading modules, it actually contains a list of the symbols it exports, and the corresponding signatures. When the kernel loads a module, it goes through all the symbols the module requires, in order to find them in its own symbol table (or other modules' symbol table when the module uses symbols from other modules), and check the corresponding signature.

The kernel uses the following function to search in its symbol table (in kernel/module.c):

bool each_symbol_section(bool (*fn)(const struct symsearch *arr,
                                    struct module *owner,
                                    void *data),
                         void *data)
{
        struct module *mod;
        static const struct symsearch arr[] = {
                { __start___ksymtab, __stop___ksymtab, __start___kcrctab,
                  NOT_GPL_ONLY, false },
                { __start___ksymtab_gpl, __stop___ksymtab_gpl,
                  __start___kcrctab_gpl,
                  GPL_ONLY, false },
                { __start___ksymtab_gpl_future, __stop___ksymtab_gpl_future,
                  __start___kcrctab_gpl_future,
                  WILL_BE_GPL_ONLY, false },
#ifdef CONFIG_UNUSED_SYMBOLS
                { __start___ksymtab_unused, __stop___ksymtab_unused,
                  __start___kcrctab_unused,
                  NOT_GPL_ONLY, true },
                { __start___ksymtab_unused_gpl, __stop___ksymtab_unused_gpl,
                  __start___kcrctab_unused_gpl,
                  GPL_ONLY, true },
#endif
        };

        if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data))
                return true;

        (...)

The struct used in this function is defined in include/linux/module.h as follows:

struct symsearch {
        const struct kernel_symbol *start, *stop;
        const unsigned long *crcs;
        enum {
                NOT_GPL_ONLY,
                GPL_ONLY,
                WILL_BE_GPL_ONLY,
        } licence;
        bool unused;
};

Note: this kernel code hasn't changed significantly in the past four years.

What we have above is three (or five, when CONFIG_UNUSED_SYMBOLS is defined) entries, each of which contains the start of a symbol table, the end of that symbol table, the start of the corresponding signature table, and two flags.

The data is static and constant, which means it will appear as is in the kernel binary. By scanning the kernel for three consecutive sequences of three pointers within the kernel address space followed by two integers with the values from the definitions in each_symbol_section, we can deduce the location of the symbol and signature tables, and regenerate a Module.symvers from the kernel binary.

Unfortunately, most kernels these days are compressed (zImage), so a simple search is not possible. A compressed kernel is actually a small bootstrap binary followed by a compressed stream. It is possible to scan the kernel zImage to look for the compressed stream, and decompress it from there.

I wrote a script to do decompression and extraction of the symbols info automatically. It should work on any recent kernel, provided it is not relocatable and you know the base address where it is loaded. It takes options for the number of bits and endianness of the architecture, but defaults to values suitable for ARM. The base address, however, always needs to be provided. It can be found, on ARM kernels, in dmesg:

$ adb shell dmesg | grep "\.init"
<5>[01-01 00:00:00.000] [0: swapper]      .init : 0xc0008000 - 0xc0037000   ( 188 kB)

The base address in the example above is 0xc0008000.

If like me you're interested in loading the module on an Android device, then what you have as a binary kernel is probably a complete boot image. A boot image contains other things besides the kernel, so you can't use it directly with the script. Except if the kernel in that boot image is compressed, in which case the part of the script that looks for the compressed image will find it anyways.

If the kernel is not compressed, you can use the unbootimg program as outlined in this old post of mine to get the kernel image out of your boot image. Once you have the kernel image, the script can be invoked as follows:

$ python extract-symvers.py -B 0xc0008000 kernel-filename > Module.symvers

Symbols and signature info could also be extracted from binary modules, but I was not interested in that information so the script doesn't handle that.

Building our module

Now that we have a proper Module.symvers for the kernel we want to load our module in, we can finally build the module:

(again, assuming arm-eabi-gcc is in your PATH, and that you have a shell opened in the kernel source directory)

$ cp /path/to/Module.symvers build/
$ make M=/path/to/module/source ARCH=arm CROSS_COMPILE=arm-eabi- O=build modules

And that's it. You can now copy the resulting hello.ko onto the device and load it.

and enjoy

$ adb shell
# insmod hello.ko
# dmesg | grep insmod
<6>[mm-dd hh:mm:ss.xxx] [id: insmod]Hello world
# lsmod
hello 586 0 - Live 0xbf008000 (P)
# rmmod hello
# dmesg | grep rmmod
<6>[mm-dd hh:mm:ss.xxx] [id: rmmod]Goodbye world

2012-08-06 15:11:41+0900

p.d.o, p.m.o | 15 Comments »

What is a Web App?

Is it this, this or that?

2012-07-20 11:04:59+0900

p.d.o, p.m.o | 6 Comments »

Comment spam

Three weeks ago, I slightly modified the comment system on this blog for an experiment. This blog is a standard wordpress installation. Comments are normally directed to the wp-comments-post.php script by the HTML form. What I did is:

  • Create a comments-post.php wrapper script that just includes wp-comments-post.php (this allows things to still work properly after wordpress upgrades),
  • Make the HTML form direct to a comments-post.php script,
  • Add a usedForm=1 parameter to the HTML form action, such that comments-post.php is supposed to always be called with it,
  • Add a simple javascript that adds a hasJS=1 parameter to the HTML form action when the page is loaded, and a Submit=1 parameter when the form is submitted.

During the past three weeks, on this blog, there were 7170 comments, 8 of which were actual comments. 7162 were spam (~99.9%).

  • 3165 spams (~44.1%) were sent to the original WordPress comment handler (wp-comments-post.php) from 1589 unique IP addresses.
  • 0 spam were sent to the new comment handler without a query string (comments-post.php), but 1 was sent with an empty query string (comments-post.php?).
  • 18 spams were sent to the new comment handler with a lowercased query string (comments-post.php?usedform=1) from 6 unique IP addresses.
  • 3971 spams (~55.4%) were sent to the new comment handler with the form query string (comments-post.php?usedForm=1) from 1153 unique IP addresses.
  • 7 spams (~0.1%) were sent to the new comment handler with the full query string, including what is added through javascript (comments-post.php?usedForm=1&hasJS=1&Submit=1) from 5 unique IP addresses.

This means a large portion of spammers didn't care about actually checking the comment forms and used the standard wordpress url, and another large portion don't run javascript on their bots, although a very few do.

2012-07-15 11:35:54+0900

p.d.o, p.m.o, website | 1 Comment »

The tale of a weird crash, and a 2.5 year-old bug

10 days ago, I landed bug 616262, and Windows Mochitest-Other immediately turned perma-orange on an a11y test, in the form of a crash. It hadn't happened when I was testing the patch queue on Try, nor did it happen on PGO and debug builds on mozilla-central.

That looked like a good candidate for a compiler bug.

The first thing I tried to do is to find what particular change had made it orange, since it was green on my earlier attempts on Try. After some bisecting through Try pushes over the week-end, it turned out the changeset immediately after the one I had based my attempts on Try on was turning the test crashy. Unfortunately, it was a merge changeset, so I had to check the merged branch. After some more bisecting, it turned out that only four consecutive changesets were making the test non-crashy, and the one I had been using was the last one of them. Moreover, none of them was a11y-related.

That looked like a good candidate for a compiler bug.

Since I was at a dead-end trying to find some changeset that triggered the crash, and since using Try was already a slow process, I went ahead trying to reproduce locally... which didn't happen. I never upgraded MSVC on my Windows install, so I was still using 2005, while our build slaves now use 2010. So I upgraded MSVC to 2010, and finally was able to reproduce locally.

That looked like a good candidate for a MSVC 2010 bug.

The stack trace that MSVC was giving me when the crash was occurring was not very useful. The crash was supposedly happening in nsTSubstring_CharT::SetCapacity, called from Accessible::GetARIAName. Sometimes it would happen in arena_malloc, called from the same Accessible::GetARIAName. Unfortunately, the stack trace wasn't going higher, which suggested stack corruption.

Because an unoptimized build would not crash and because optimizations were making things hard to poke from within the debugger, I added some printfs in Accessible::GetARIAName. It didn't crash anymore.

That really looked like a good candidate for a MSVC 2010 bug.

Since the function was called a whole lot before actually crashing, I needed to determine what code in Accessible::GetARIAName was being reached before the crash. One of the things I tested was this patch:

--- a/accessible/src/generic/Accessible.cpp
+++ b/accessible/src/generic/Accessible.cpp
@@ -2429,6 +2429,8 @@ Accessible::GetARIAName(nsAString& aName
   if (NS_SUCCEEDED(rv)) {
     label.CompressWhitespace();
     aName = label;
+  } else {
+    MOZ_CRASH();
   }
 
   if (label.IsEmpty() &&

For those not familiar with Mozilla codebase, MOZ_CRASH, as its name suggests, triggers a crash. So if the else part is ever reached, the build would crash. It turns out it did... not. At all.

Comparing at assembly level, the functions with and without the patch above were strictly identical, except that one specific jump would go to the crashing code instead of going to the rest of the function. The crashing code wasn't even added within the function, but at the end.

At this point, I was pretty certain that the problem was, in fact, not in the function where the crash was occurring. The base observation was that adding code may un-trigger the crash, suggesting something weird happening depending on where some function appearing after Accessible::GetARIAName is located in the binary. Sure enough, reordering the source files in accessible/src/generic made the crash disappear with an unmodified Accessible::GetARIAName.

So, after validating that adding a small function after an unmodified Accessible::GetARIAName was not triggering the crash, I went on to find the last place where adding the function would not stop triggering the crash. Which I found to be that spot. With a dummy function before DocAccessible::ContentRemoved, the crash wouldn't occur, and with the same dummy function after, it would. But DocAccessible::ContentRemoved is empty, how can adding a function before or after make a difference?!?

Since these DocAccessible functions are good candidates for Identical Code Folding, I tried disabling it, and it surely did make things worse: the addition of the dummy function stopped "fixing" the crash.

That really looked like a good candidate for something that was going to be near impossible to debug.

At this point, I really wished I had a more reliable stack trace, and with fingers crossed (since so many factors were, in fact, "fixing" the crash), I tried again with frame pointers enabled. And fortunately, the crash was still happening. After some self-punishment for not having tried that earlier, I finally got a more meaningful stack trace showing the Accessible::GetARIAName (indirectly) coming from nsTextEquivUtils::AppendFromAccessible, itself being called from nsTextEquivUtils::AppendFromAccessibleChildren, itself called from nsTextEquivUtils::AppendFromAccessible.

Fortunately, the place indirectly calling Accessible::GetARIAName was reached much less often than Accessible::GetARIAName, so it was possible to use a breakpoint there, continue until the nth time, and then step until Accessible::GetARIAName. Finally I knew where and why the crash was happening: really in Accessible::GetARIAName, because mContent was NULL.

Obviously, when the crash doesn't happen, mContent is never NULL. After some fiddling with both builds with the crash and builds without, I found that nsTextEquivUtils::AppendFromAccessible was never called for an Accessible with a NULL mContent when the crash doesn't occur. Which makes sense, but makes the problem upper in the stack.

After more fiddling, and finding out that both crashing and non-crashing builds were initializing a nsHTMLWin32ObjectAccessible with a NULL mContent, I had to find why either the Accessible's mContent value changed in one case and not the other, or why nsTextEquivUtils::AppendFromAccessible was called for that nsHTMLWin32ObjectAccessible in one case and not the other.

And in the end, it turned out the difference was that the gRoleToNameRulesMap value for nsHTMLWin32ObjectAccessible's Role was wrong in one case and not the other, making nsTextEquivUtils::AppendFromAccessible being called or not.

So the root cause of all this nightmarish chase was that nsHTMLWin32ObjectAccessible's Role is ROLE_EMBEDDED_OBJECT, and that the gRoleToNameRulesMap array stopped at ROLE_GRID_CELL (which happens to be ROLE_EMBEDDED_OBJECT - 1).

The changeset that added ROLE_EMBEDDED_OBJECT but forgot to add the corresponding gRoleToNameRulesMap entry is 2.5 years old, and other additional roles have been added since then.

Conclusion, the problem was not the compiler doing something broken, but it was the linker laying things out differently in some cases, leading to different values returned when reading past gRoleToNameRulesMap, depending on what the linker put there.

It looks like we've been pretty (un?)lucky not to have been hit by this earlier. It's now fixed on all branches, and bug 616262 landed again, with a pretty green Windows Mochitest-Other.

As of writing, I still don't know why the stack trace was truncated in the first place, since the stack was, in fact, not corrupted. I think I don't care anymore.

2012-06-25 16:48:24+0900

p.m.o | 6 Comments »

Attempting to close a LinkedIn account

Following the trend, I attempted to close my LinkedIn account. Closing a LinkedIn account involves confirming and confirming and confirming again. Once it's all done, you'd expect to, well, be done with it.

I'm outraged at the result:

  • My public profile is still there. I can't be sure but I guess people with a connection to me can still see the full profile.
  • I'm still receiving LinkedIn connection emails (You, know, those "Learn about xxxxxxx, your new connection..." emails ; I must have had pending outgoing invitations).
  • I can still reset my password.
  • I can still login.

The only upside is that after I login, I can only see a page saying "Your LinkedIn account has been temporarily restricted". "Contact our customer service team to get this resolved as soon as possible."

Update: After contacting their customer service, the account was closed and the public profile is now unavailable.

2012-06-09 14:52:53+0900

p.d.o, p.m.o | 8 Comments »

A new Jemalloc has landed

... disabled by default.

Firefox 3, released close to 4 years ago, came with its own memory allocator: Jemalloc. Jason Evans (original author, and to this day, still upstream maintainer) and our own Stuart Parmenter worked hard to have it work in Firefox for Windows and Linux. Sadly, a lot of this work stayed Firefox-only. Time passed, we added support for OSX, fixed various issues and added various features. Mostly, this all stayed Firefox-only.

By then, the original Jemalloc was a 0.x version. And while we have been busy growing our own fork, Jason has been busy growing Jemalloc. We've sometimes retrofitted new things from the new Jemalloc into ours, but all in all, both grew in very separate ways, and it was hard to benefit from each other's work.

During the past weeks, I've been working on getting the upstream Jemalloc development branch in shape so that we can use it in Firefox. It involved porting that code-base to Windows, fixing it for OSX, and fix some other issues found on the way. All this work was incorporated upstream and is part of the latest Jemalloc 3.0.0 release.

What landed today on mozilla-central is a pristine copy of Jemalloc 3.0.0 and the necessary bits to build and link it in Firefox. It is however disabled by default until we fix all remaining issues. See the dependencies of bug 762449 for what is left to be done.

If you want to give a hand to make Jemalloc 3 the default, you first need to enable it at build time, which is achieved by adding the following to your .mozconfig:

export MOZ_JEMALLOC=1

Once Jemalloc 3 is the default, it will be straightforward to update our copy to use the latest from upstream.

2012-06-07 17:37:44+0900

p.m.o | 12 Comments »

With a little help from the kernel

[ Disclaimer: simplified, high-level view ahead. ]

When a program reads or writes data at a given virtual address, it uses instructions telling the CPU to do so. When the CPU doesn't know the address, it faults. When it knows the address, but its access rights don't allow the read or write operation the program wanted, it faults, too. Operating systems do trap these faults, and the system kernel handles them, allowing the program to continue.

As a virtual address can point to a various range of different things, the kernel keeps track of what address ranges are backed by what. The most typical backing is physical memory: a given virtual address corresponds to a given physical RAM address.

Other typical backings include zero-memory (memory full of zero), copy-on-write, file-backed mappings (mmap with a file descriptor), etc. Or a combination of those.

When a file is mapped into memory by a program, the program may access data from that file through "standard" reads/write to memory, and the kernel does its job of getting the data from disk, putting it in physical memory, and telling the CPU to look there.

When physical memory becomes short for the demand, the kernel may choose to throw away anything that it can get back in physical memory later, like file-backed mappings, which can be read again from disk when needed. Another strategy is to move parts of physical memory to disk. This is "swapping" or "paging".

Anyways. When faulty.lib loads a library from a Zip archive, it reserves (shared) memory for the uncompressed library, and marks it as non-readable and non-writable. When code or data from the library is accessed, the kernel handles the CPU fault, and ends up throwing a segmentation fault signal (SIGSEGV) to the process. The process handles the signal, and fills the memory buffer with parts of the uncompressed library that are necessary, and flags them with the appropriate access rights. On further accesses to the same location, the already uncompressed data will be accessed directly.

The downside of this approach is that besides paging/swapping, there is no way to get rid of the unused parts in case of memory pressure. And since Android devices don't do paging/swapping, it's effectively wasted memory.

The facility we're using on Android for that shared memory, ashmem (currently in staging for mainline kernel), has a mechanism that could almost help us: a program can "unpin" ashmem ranges, indicating to the kernel memory regions it is allowed to throw away when it is under memory pressure. Further accesses to memory that the kernel threw away are like accesses to anonymous memory for the first time: zeroed-out.

If the program does NULL checks, it can figure whether the kernel may have thrown data away. But in faulty.lib’s case, that’s not quite possible. Any part of the code in a library may directly jump into a region that the kernel freed, and the resulting zeroed-out memory will just be executed instead of being filled.

So, in faulty.lib's case, it would be interesting if the kernel had a special backing for such userspace-filled memory regions, where it would consider throwing them away like it does for "unpinned" ashmem. Afterwards, accesses to these memory regions would trigger some signal for the program to fill the memory again.

The current proposal, now part of a plumber's wishlist thanks to Lennart Poettering, involves a new flag for madvise() and would make the kernel send a SIGBUS signal to a process when memory is accessed after the kernel has thrown it away. This proposal has received some interest from Andi Kleen.

And it would be useful for more than just faulty.lib: application caches (images, network, etc.) (although ashmem fulfills that need to some extent), JIT code, live decompression of content other than libraries, you name it.

2012-05-14 18:21:42+0900

faulty.lib | No Comments »

Rebuilding libxul made slightly easier, finally

One of the longstanding problems when modifying code in the mozilla code base, is that when you change some file under e.g. content/, and you don't want to waste the whole lot of time it takes to run a complete make -f client.mk, you need to build under content/, then layout/build/, and finally toolkit/library/. And you need to remember that (or use tools that remember for you).

These days are finally over. After several attempts a year ago (!), and again several attempts during the past weeks, bug 644608 is finally on mozilla-central and is likely to stick, this time. There may be some corner cases, in which case please file bugs.

Anyways, Now, you just need to build under e.g. content/ and toolkit/library/. No need to rebuild layout/build/ anymore.

2012-04-12 19:41:53+0900

p.m.o | 6 Comments »

libgcc.a symbol visibility considered harmful

I recently got to rebuild an Android NDK with a fresh toolchain again, and hit an interesting problem. I actually had hit it before, but only this time I fully analyzed what's going on. [As a side note, if you build such a NDK, don't use mpfr 3.1.0, as there is a bug in the libtool it ships]

Linking an application or a library pulls many things, that aren't part of the code being built. One of these many things is the libgcc static library. Part of libgcc consists in an implementation of the platform ABI. On Android systems, this means the ARM EABI. GCC, when compiling some instructions, will generate ABI calls. For example, integer divisions may call __aeabi_idiv.

Consider the following minimized real world scenario:

$ echo "int foo(int a) { return 42 % a; }" > foo.c
$ arm-linux-androideabi-gcc -o libfoo.so -shared foo.c -mandroid

GCC will emit a call to __aeabi_idivmod for the % operation. With GCC 4.6.3, this function is in _divsi3.o under libgcc.a. That function itself calls __aeabi_idiv0, which lives in _dvmd_lnx.o under libgcc.a.

When statically linking, ld will thus include foo.o, _divsi3.o and _dvmd_lnx.o, meaning it will include all functions from these object files. That is, foo, __divsi3, __aeabi_idiv, __aeabi_idivmod, __aeabi_idiv0 and __aeabi_ldiv0. And more than being included, these functions are exported, because symbol visibility in libgcc.a is default. So while we expect exporting foo from our library, we're actually exporting much more, including functions that just happened to be near the ones that our code (indirectly) uses.

Now, let's say we want to build another library, using that foo function from libfoo:

$ cat > bar.c <<EOF
extern int foo(int a);
long long bar(long long a) { return foo(a) % a; }
EOF
$ arm-linux-androideabi-gcc -o libbar.so -shared bar.c -mandroid

(The code above has absolutely no meaning, it just triggers the same function calls as what I was getting in the actual real world case)

When statically linking the above code, GCC will generate a call to __aeabi_ldivmod, which calls __aeabi_ldiv0, and many other things, directly or indirectly. When linking as above, nothing particularly nasty is going to happen. However, linking as above is actually wrong: the resulting library has an undefined reference to the foo symbol, and doesn't depend on libfoo. At runtime, if libfoo wasn't already loaded somehow, loading libbar would fail.

The proper way to link is the following:

$ arm-linux-androideabi-gcc -o libbar.so -shared bar.c -mandroid -L. -lfoo

A feature of ELF static linking is that when it resolves undefined symbols, the linker will choose to use the first occurrence of a symbol it finds in the various objects and libraries given on its command line. So with the command line above, for each __aeabi_* symbol, it will first look in libfoo if there isn't one. And while __aeabi_ldivmod is not in libfoo, __aeabi_ldiv0 is (see above).

So instead of including the code for __aeabi_ldiv0 from libgcc.a, it will call the copy from libfoo.

This wouldn't be so much of a problem if __aeabi_ldiv0 wasn't a weak symbol.

Enters faulty.lib. In the real world case, libfoo is loaded by the system dynamic linker, and libbar by faulty.lib. When resolving symbols for libbar, faulty.lib has to resolve libfoo symbols with the system linker, using dlsym(). On Android, dlsym() returns NULL for weak (defined) symbols, so faulty.lib can't resolve __aeabi_ldiv0.

The real world case wasn't a problem with GCC 4.4.3 from the vanilla Android NDK because in that GCC version, __aeabi_ldivmod doesn't call __aeabi_ldiv0.

This wouldn't happen if shared libraries wouldn't expose random platform ABI specific bits depending on what they use and depending on other symbols that happen to be in the same object files.

A similar issue happened a little while ago on Debian powerpc because a shared library was exporting ABI specific bits. Even worse, the toolchain was assuming the symbols would come from libgcc.a and generated wrong relocations for these symbols.

Update: Interestingly, the __aeabi_* symbols are hidden, in libgcc.a as provided on the Debian armel port.

2012-03-06 17:19:34+0900

faulty.lib, p.d.o, p.m.o | 5 Comments »