The tale of a weird crash, and a 2.5 year-old bug
10 days ago, I landed bug 616262, and Windows Mochitest-Other immediately turned perma-orange on an a11y test, in the form of a crash. It hadn't happened when I was testing the patch queue on Try, nor did it happen on PGO and debug builds on mozilla-central.
That looked like a good candidate for a compiler bug.
The first thing I tried to do is to find what particular change had made it orange, since it was green on my earlier attempts on Try. After some bisecting through Try pushes over the week-end, it turned out the changeset immediately after the one I had based my attempts on Try on was turning the test crashy. Unfortunately, it was a merge changeset, so I had to check the merged branch. After some more bisecting, it turned out that only four consecutive changesets were making the test non-crashy, and the one I had been using was the last one of them. Moreover, none of them was a11y-related.
That looked like a good candidate for a compiler bug.
Since I was at a dead-end trying to find some changeset that triggered the crash, and since using Try was already a slow process, I went ahead trying to reproduce locally... which didn't happen. I never upgraded MSVC on my Windows install, so I was still using 2005, while our build slaves now use 2010. So I upgraded MSVC to 2010, and finally was able to reproduce locally.
That looked like a good candidate for a MSVC 2010 bug.
The stack trace that MSVC was giving me when the crash was occurring was not very useful. The crash was supposedly happening in nsTSubstring_CharT::SetCapacity
, called from Accessible::GetARIAName
. Sometimes it would happen in arena_malloc
, called from the same Accessible::GetARIAName
. Unfortunately, the stack trace wasn't going higher, which suggested stack corruption.
Because an unoptimized build would not crash and because optimizations were making things hard to poke from within the debugger, I added some printf
s in Accessible::GetARIAName
. It didn't crash anymore.
That really looked like a good candidate for a MSVC 2010 bug.
Since the function was called a whole lot before actually crashing, I needed to determine what code in Accessible::GetARIAName
was being reached before the crash. One of the things I tested was this patch:
--- a/accessible/src/generic/Accessible.cpp +++ b/accessible/src/generic/Accessible.cpp @@ -2429,6 +2429,8 @@ Accessible::GetARIAName(nsAString& aName if (NS_SUCCEEDED(rv)) { label.CompressWhitespace(); aName = label; + } else { + MOZ_CRASH(); } if (label.IsEmpty() &&
For those not familiar with Mozilla codebase, MOZ_CRASH
, as its name suggests, triggers a crash. So if the else
part is ever reached, the build would crash. It turns out it did... not. At all.
Comparing at assembly level, the functions with and without the patch above were strictly identical, except that one specific jump would go to the crashing code instead of going to the rest of the function. The crashing code wasn't even added within the function, but at the end.
At this point, I was pretty certain that the problem was, in fact, not in the function where the crash was occurring. The base observation was that adding code may un-trigger the crash, suggesting something weird happening depending on where some function appearing after Accessible::GetARIAName
is located in the binary. Sure enough, reordering the source files in accessible/src/generic
made the crash disappear with an unmodified Accessible::GetARIAName
.
So, after validating that adding a small function after an unmodified Accessible::GetARIAName
was not triggering the crash, I went on to find the last place where adding the function would not stop triggering the crash. Which I found to be that spot. With a dummy function before DocAccessible::ContentRemoved
, the crash wouldn't occur, and with the same dummy function after, it would. But DocAccessible::ContentRemoved
is empty, how can adding a function before or after make a difference?!?
Since these DocAccessible
functions are good candidates for Identical Code Folding, I tried disabling it, and it surely did make things worse: the addition of the dummy function stopped "fixing" the crash.
That really looked like a good candidate for something that was going to be near impossible to debug.
At this point, I really wished I had a more reliable stack trace, and with fingers crossed (since so many factors were, in fact, "fixing" the crash), I tried again with frame pointers enabled. And fortunately, the crash was still happening. After some self-punishment for not having tried that earlier, I finally got a more meaningful stack trace showing the Accessible::GetARIAName
(indirectly) coming from nsTextEquivUtils::AppendFromAccessible
, itself being called from nsTextEquivUtils::AppendFromAccessibleChildren
, itself called from nsTextEquivUtils::AppendFromAccessible
.
Fortunately, the place indirectly calling Accessible::GetARIAName
was reached much less often than Accessible::GetARIAName
, so it was possible to use a breakpoint there, continue until the nth time, and then step until Accessible::GetARIAName
. Finally I knew where and why the crash was happening: really in Accessible::GetARIAName
, because mContent
was NULL.
Obviously, when the crash doesn't happen, mContent
is never NULL. After some fiddling with both builds with the crash and builds without, I found that nsTextEquivUtils::AppendFromAccessible
was never called for an Accessible with a NULL mContent
when the crash doesn't occur. Which makes sense, but makes the problem upper in the stack.
After more fiddling, and finding out that both crashing and non-crashing builds were initializing a nsHTMLWin32ObjectAccessible with a NULL mContent
, I had to find why either the Accessible's mContent
value changed in one case and not the other, or why nsTextEquivUtils::AppendFromAccessible
was called for that nsHTMLWin32ObjectAccessible
in one case and not the other.
And in the end, it turned out the difference was that the gRoleToNameRulesMap
value for nsHTMLWin32ObjectAccessible
's Role was wrong in one case and not the other, making nsTextEquivUtils::AppendFromAccessible
being called or not.
So the root cause of all this nightmarish chase was that nsHTMLWin32ObjectAccessible
's Role is ROLE_EMBEDDED_OBJECT, and that the gRoleToNameRulesMap
array stopped at ROLE_GRID_CELL (which happens to be ROLE_EMBEDDED_OBJECT - 1).
The changeset that added ROLE_EMBEDDED_OBJECT but forgot to add the corresponding gRoleToNameRulesMap
entry is 2.5 years old, and other additional roles have been added since then.
Conclusion, the problem was not the compiler doing something broken, but it was the linker laying things out differently in some cases, leading to different values returned when reading past gRoleToNameRulesMap
, depending on what the linker put there.
It looks like we've been pretty (un?)lucky not to have been hit by this earlier. It's now fixed on all branches, and bug 616262 landed again, with a pretty green Windows Mochitest-Other.
As of writing, I still don't know why the stack trace was truncated in the first place, since the stack was, in fact, not corrupted. I think I don't care anymore.
2012-06-25 16:48:24+0900
Responses are currently closed, but you can trackback from your own site.
2012-06-25 17:27:18+0900
Is there a way to add a static check to the build to verify that all roles and role maps correspond (cover all cases)? In the spirit of the compiler warning for a missing case for an enum in a switch statement. The compiler won’t catch this for you, but some static analyzer might.
2012-06-25 17:47:34+0900
Z.T.: See https://bugzilla.mozilla.org/show_bug.cgi?id=716644 and https://bugzilla.mozilla.org/show_bug.cgi?id=766240
2012-06-25 18:20:02+0900
Mike, this is really amazing! Thank you again for your persistence in tracking this one down and finding the root cause! I believe this made sure that none from the accessibility team will ever forget adding new role mappings to that array again! :)
2012-06-25 18:22:35+0900
Marco: I just wanted to land bug 616262 :)
2012-06-25 18:31:43+0900
Thanks again Mike. Great work. (https://bugzilla.mozilla.org/show_bug.cgi?id=765172#c19)
2012-07-04 19:05:35+0900
“As of writing, I still don’t know why the stack trace was truncated in the first place, since the stack was, in fact, not corrupted. I think I don’t care anymore.”
Still would be interesting in that it could affect other crashes being reported wrongly (with incomplete stacks) and we might be able to improve things if we find out what the cause here was.
Having fixed the crash is more important, though, thanks for being persistent and digging this out! :)