Archive for June, 2012

The tale of a weird crash, and a 2.5 year-old bug

10 days ago, I landed bug 616262, and Windows Mochitest-Other immediately turned perma-orange on an a11y test, in the form of a crash. It hadn't happened when I was testing the patch queue on Try, nor did it happen on PGO and debug builds on mozilla-central.

That looked like a good candidate for a compiler bug.

The first thing I tried to do is to find what particular change had made it orange, since it was green on my earlier attempts on Try. After some bisecting through Try pushes over the week-end, it turned out the changeset immediately after the one I had based my attempts on Try on was turning the test crashy. Unfortunately, it was a merge changeset, so I had to check the merged branch. After some more bisecting, it turned out that only four consecutive changesets were making the test non-crashy, and the one I had been using was the last one of them. Moreover, none of them was a11y-related.

That looked like a good candidate for a compiler bug.

Since I was at a dead-end trying to find some changeset that triggered the crash, and since using Try was already a slow process, I went ahead trying to reproduce locally... which didn't happen. I never upgraded MSVC on my Windows install, so I was still using 2005, while our build slaves now use 2010. So I upgraded MSVC to 2010, and finally was able to reproduce locally.

That looked like a good candidate for a MSVC 2010 bug.

The stack trace that MSVC was giving me when the crash was occurring was not very useful. The crash was supposedly happening in nsTSubstring_CharT::SetCapacity, called from Accessible::GetARIAName. Sometimes it would happen in arena_malloc, called from the same Accessible::GetARIAName. Unfortunately, the stack trace wasn't going higher, which suggested stack corruption.

Because an unoptimized build would not crash and because optimizations were making things hard to poke from within the debugger, I added some printfs in Accessible::GetARIAName. It didn't crash anymore.

That really looked like a good candidate for a MSVC 2010 bug.

Since the function was called a whole lot before actually crashing, I needed to determine what code in Accessible::GetARIAName was being reached before the crash. One of the things I tested was this patch:

--- a/accessible/src/generic/Accessible.cpp
+++ b/accessible/src/generic/Accessible.cpp
@@ -2429,6 +2429,8 @@ Accessible::GetARIAName(nsAString& aName
   if (NS_SUCCEEDED(rv)) {
     label.CompressWhitespace();
     aName = label;
+  } else {
+    MOZ_CRASH();
   }
 
   if (label.IsEmpty() &&

For those not familiar with Mozilla codebase, MOZ_CRASH, as its name suggests, triggers a crash. So if the else part is ever reached, the build would crash. It turns out it did... not. At all.

Comparing at assembly level, the functions with and without the patch above were strictly identical, except that one specific jump would go to the crashing code instead of going to the rest of the function. The crashing code wasn't even added within the function, but at the end.

At this point, I was pretty certain that the problem was, in fact, not in the function where the crash was occurring. The base observation was that adding code may un-trigger the crash, suggesting something weird happening depending on where some function appearing after Accessible::GetARIAName is located in the binary. Sure enough, reordering the source files in accessible/src/generic made the crash disappear with an unmodified Accessible::GetARIAName.

So, after validating that adding a small function after an unmodified Accessible::GetARIAName was not triggering the crash, I went on to find the last place where adding the function would not stop triggering the crash. Which I found to be that spot. With a dummy function before DocAccessible::ContentRemoved, the crash wouldn't occur, and with the same dummy function after, it would. But DocAccessible::ContentRemoved is empty, how can adding a function before or after make a difference?!?

Since these DocAccessible functions are good candidates for Identical Code Folding, I tried disabling it, and it surely did make things worse: the addition of the dummy function stopped "fixing" the crash.

That really looked like a good candidate for something that was going to be near impossible to debug.

At this point, I really wished I had a more reliable stack trace, and with fingers crossed (since so many factors were, in fact, "fixing" the crash), I tried again with frame pointers enabled. And fortunately, the crash was still happening. After some self-punishment for not having tried that earlier, I finally got a more meaningful stack trace showing the Accessible::GetARIAName (indirectly) coming from nsTextEquivUtils::AppendFromAccessible, itself being called from nsTextEquivUtils::AppendFromAccessibleChildren, itself called from nsTextEquivUtils::AppendFromAccessible.

Fortunately, the place indirectly calling Accessible::GetARIAName was reached much less often than Accessible::GetARIAName, so it was possible to use a breakpoint there, continue until the nth time, and then step until Accessible::GetARIAName. Finally I knew where and why the crash was happening: really in Accessible::GetARIAName, because mContent was NULL.

Obviously, when the crash doesn't happen, mContent is never NULL. After some fiddling with both builds with the crash and builds without, I found that nsTextEquivUtils::AppendFromAccessible was never called for an Accessible with a NULL mContent when the crash doesn't occur. Which makes sense, but makes the problem upper in the stack.

After more fiddling, and finding out that both crashing and non-crashing builds were initializing a nsHTMLWin32ObjectAccessible with a NULL mContent, I had to find why either the Accessible's mContent value changed in one case and not the other, or why nsTextEquivUtils::AppendFromAccessible was called for that nsHTMLWin32ObjectAccessible in one case and not the other.

And in the end, it turned out the difference was that the gRoleToNameRulesMap value for nsHTMLWin32ObjectAccessible's Role was wrong in one case and not the other, making nsTextEquivUtils::AppendFromAccessible being called or not.

So the root cause of all this nightmarish chase was that nsHTMLWin32ObjectAccessible's Role is ROLE_EMBEDDED_OBJECT, and that the gRoleToNameRulesMap array stopped at ROLE_GRID_CELL (which happens to be ROLE_EMBEDDED_OBJECT - 1).

The changeset that added ROLE_EMBEDDED_OBJECT but forgot to add the corresponding gRoleToNameRulesMap entry is 2.5 years old, and other additional roles have been added since then.

Conclusion, the problem was not the compiler doing something broken, but it was the linker laying things out differently in some cases, leading to different values returned when reading past gRoleToNameRulesMap, depending on what the linker put there.

It looks like we've been pretty (un?)lucky not to have been hit by this earlier. It's now fixed on all branches, and bug 616262 landed again, with a pretty green Windows Mochitest-Other.

As of writing, I still don't know why the stack trace was truncated in the first place, since the stack was, in fact, not corrupted. I think I don't care anymore.

2012-06-25 16:48:24+0900

p.m.o | 6 Comments »

Attempting to close a LinkedIn account

Following the trend, I attempted to close my LinkedIn account. Closing a LinkedIn account involves confirming and confirming and confirming again. Once it's all done, you'd expect to, well, be done with it.

I'm outraged at the result:

  • My public profile is still there. I can't be sure but I guess people with a connection to me can still see the full profile.
  • I'm still receiving LinkedIn connection emails (You, know, those "Learn about xxxxxxx, your new connection..." emails ; I must have had pending outgoing invitations).
  • I can still reset my password.
  • I can still login.

The only upside is that after I login, I can only see a page saying "Your LinkedIn account has been temporarily restricted". "Contact our customer service team to get this resolved as soon as possible."

Update: After contacting their customer service, the account was closed and the public profile is now unavailable.

2012-06-09 14:52:53+0900

p.d.o, p.m.o | 8 Comments »

A new Jemalloc has landed

... disabled by default.

Firefox 3, released close to 4 years ago, came with its own memory allocator: Jemalloc. Jason Evans (original author, and to this day, still upstream maintainer) and our own Stuart Parmenter worked hard to have it work in Firefox for Windows and Linux. Sadly, a lot of this work stayed Firefox-only. Time passed, we added support for OSX, fixed various issues and added various features. Mostly, this all stayed Firefox-only.

By then, the original Jemalloc was a 0.x version. And while we have been busy growing our own fork, Jason has been busy growing Jemalloc. We've sometimes retrofitted new things from the new Jemalloc into ours, but all in all, both grew in very separate ways, and it was hard to benefit from each other's work.

During the past weeks, I've been working on getting the upstream Jemalloc development branch in shape so that we can use it in Firefox. It involved porting that code-base to Windows, fixing it for OSX, and fix some other issues found on the way. All this work was incorporated upstream and is part of the latest Jemalloc 3.0.0 release.

What landed today on mozilla-central is a pristine copy of Jemalloc 3.0.0 and the necessary bits to build and link it in Firefox. It is however disabled by default until we fix all remaining issues. See the dependencies of bug 762449 for what is left to be done.

If you want to give a hand to make Jemalloc 3 the default, you first need to enable it at build time, which is achieved by adding the following to your .mozconfig:

export MOZ_JEMALLOC=1

Once Jemalloc 3 is the default, it will be straightforward to update our copy to use the latest from upstream.

2012-06-07 17:37:44+0900

p.m.o | 12 Comments »

Iceweasel ESR in squeeze-backports

In case this went unnoticed, Iceweasel ESR has been available in squeeze-backports for a few weeks, now. I highly recommend anyone using Iceweasel on the Debian stable release to upgrade to that version, at the very least. Even newer versions are available through the pkg-mozilla archive.

2012-06-02 11:33:08+0900

firefox | 9 Comments »