{"id":2601,"date":"2012-06-25T16:48:24","date_gmt":"2012-06-25T14:48:24","guid":{"rendered":"http:\/\/glandium.org\/blog\/?p=2601"},"modified":"2019-09-03T15:30:35","modified_gmt":"2019-09-03T06:30:35","slug":"the-tale-of-a-weird-crash-and-a-2-5-year-old-bug","status":"publish","type":"post","link":"https:\/\/glandium.org\/blog\/?p=2601","title":{"rendered":"The tale of a weird crash, and a 2.5 year-old bug"},"content":{"rendered":"<p>10 days ago, I landed <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=616262\">bug 616262<\/a>, and Windows Mochitest-Other immediately turned <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=765172\">perma-orange on an a11y test<\/a>, in the form of a crash. It hadn't happened when I was testing the patch queue on Try, nor did it happen on PGO and debug builds on mozilla-central.<\/p>\n<p>That looked like a good candidate for a compiler bug.<\/p>\n<p>The first thing I tried to do is to find what particular change had made it orange, since it was green on my earlier attempts on Try. After some bisecting through Try pushes over the week-end, it turned out the changeset immediately after the one I had based my attempts on Try on was turning the test crashy. Unfortunately, it was a merge changeset, so I had to check the merged branch. After some more bisecting, it turned out that <em>only four<\/em> consecutive changesets were making the test <em>non<\/em>-crashy, and the one I had been using was the last one of them. Moreover, <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/rev\/8cf5423b0213\">none<\/a> of <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/rev\/b30e903d23a0\">them<\/a> was <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/rev\/7625b37383fe\">a11y<\/a>-<a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/rev\/3f408698a03f\">related<\/a>.<\/p>\n<p>That looked like a good candidate for a compiler bug.<\/p>\n<p>Since I was at a dead-end trying to find some changeset that triggered the crash, and since using Try was already a slow process, I went ahead trying to reproduce locally... which didn't happen. I never upgraded MSVC on my Windows install, so I was still using 2005, while our build slaves now use 2010. So I upgraded MSVC to 2010, and finally was able to reproduce locally.<\/p>\n<p>That looked like a good candidate for a MSVC 2010 bug.<\/p>\n<p>The stack trace that MSVC was giving me when the crash was occurring was not very useful. The crash was supposedly happening in <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/xpcom\/string\/src\/nsTSubstring.cpp#l537\"><code>nsTSubstring_CharT::SetCapacity<\/code><\/a>, called from <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>. Sometimes it would happen in <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/memory\/mozjemalloc\/jemalloc.c#l4190\"><code>arena_malloc<\/code><\/a>, called from the same <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>. Unfortunately, the stack trace wasn't going higher, which suggested stack corruption.<\/p>\n<p>Because an unoptimized build would not crash and because optimizations were making things hard to poke from within the debugger, I added some <code>printf<\/code>s in <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>. It didn't crash anymore.<\/p>\n<p>That really looked like a good candidate for a MSVC 2010 bug.<\/p>\n<p>Since the function was called a whole lot before actually crashing, I needed to determine what code in <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a> was being reached before the crash. One of the things I tested was this patch:<\/p>\n<blockquote>\n<pre>--- a\/accessible\/src\/generic\/Accessible.cpp\r\n+++ b\/accessible\/src\/generic\/Accessible.cpp\r\n@@ -2429,6 +2429,8 @@ Accessible::GetARIAName(nsAString& aName\r\n   if (NS_SUCCEEDED(rv)) {\r\n     label.CompressWhitespace();\r\n     aName = label;\r\n+  } else {\r\n+    MOZ_CRASH();\r\n   }\r\n \r\n   if (label.IsEmpty() &&\r\n<\/pre>\n<\/blockquote>\n<p>For those not familiar with Mozilla codebase, <code>MOZ_CRASH<\/code>, as its name suggests, triggers a crash. So if the <code>else<\/code> part is ever reached, the build would crash. It turns out it did... not. At all.<\/p>\n<p>Comparing at assembly level, the functions with and without the patch above were strictly identical, except that one specific jump would go to the crashing code instead of going to the rest of the function. The crashing code wasn't even added within the function, but at the end.<\/p>\n<p>At this point, I was pretty certain that the problem was, in fact, <em>not<\/em> in the function where the crash was occurring. The base observation was that adding code may un-trigger the crash, suggesting something weird happening depending on where some function appearing after <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a> is located in the binary. Sure enough, reordering the source files in <code>accessible\/src\/generic<\/code> made the crash disappear with an unmodified <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>.<\/p>\n<p>So, after validating that adding a small function after an unmodified <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a> was not triggering the crash, I went on to find the last place where adding the function would not stop triggering the crash. Which I found to be <a href=\"https:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/DocAccessible.cpp#l1297\">that spot<\/a>. With a dummy function before <code>DocAccessible::ContentRemoved<\/code>, the crash wouldn't occur, and with the same dummy function after, it would. But <code>DocAccessible::ContentRemoved<\/code> is empty, how can adding a function before or after make a difference?!?<\/p>\n<p>Since these <code>DocAccessible<\/code> functions are good candidates for <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bxwfs976%28v=vs.100%29.aspx\">Identical Code Folding<\/a>, I tried disabling it, and it surely did make things worse: the addition of the dummy function stopped \"fixing\" the crash.<\/p>\n<p>That really looked like a good candidate for something that was going to be near impossible to debug.<\/p>\n<p>At this point, I really wished I had a more reliable stack trace, and with fingers crossed (since so many factors were, in fact, \"fixing\" the crash), I tried again with frame pointers enabled. And fortunately, the crash was still happening. After some self-punishment for not having tried that earlier, I finally got a more meaningful stack trace showing the <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a> (indirectly) coming from <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l189\"><code>nsTextEquivUtils::AppendFromAccessible<\/code><\/a>, itself being called from <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l173\"><code>nsTextEquivUtils::AppendFromAccessibleChildren<\/code><\/a>, itself called from <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l189\"><code>nsTextEquivUtils::AppendFromAccessible<\/code><\/a>.<\/p>\n<p>Fortunately, the place indirectly <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l205\">calling <code>Accessible::GetARIAName<\/code><\/a> was reached much less often than <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>, so it was possible to use a breakpoint <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l205\">there<\/a>, continue until the <em>n<\/em>th time, and then step until <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2422\"><code>Accessible::GetARIAName<\/code><\/a>. Finally I knew where and why the crash was happening: really <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/generic\/Accessible.cpp#l2435\">in <code>Accessible::GetARIAName<\/code><\/a>, because <code>mContent<\/code> was NULL.<\/p>\n<p>Obviously, when the crash doesn't happen, <code>mContent<\/code> is never NULL. After some fiddling with both builds with the crash and builds without, I found that <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l189\"><code>nsTextEquivUtils::AppendFromAccessible<\/code><\/a> was never called for an Accessible with a NULL <code>mContent<\/code> when the crash doesn't occur. Which makes sense, but makes the problem upper in the stack.<\/p>\n<p>After more fiddling, and finding out that both crashing and non-crashing builds were initializing a <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/msaa\/nsHTMLWin32ObjectAccessible.cpp#l68\">nsHTMLWin32ObjectAccessible<\/a> with a NULL <code>mContent<\/code>, I had to find why either the Accessible's <code>mContent<\/code> value changed in one case and not the other, or why <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l189\"><code>nsTextEquivUtils::AppendFromAccessible<\/code><\/a> was called for that <code>nsHTMLWin32ObjectAccessible<\/code> in one case and not the other.<\/p>\n<p>And in the end, it turned out the difference was that the <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l372\"><code>gRoleToNameRulesMap<\/code><\/a> value for <code>nsHTMLWin32ObjectAccessible<\/code>'s Role was wrong in one case and not the other, making <code>nsTextEquivUtils::AppendFromAccessible<\/code> <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l219\">being called<\/a> or not.<\/p>\n<p>So the root cause of all this nightmarish chase was that <code>nsHTMLWin32ObjectAccessible<\/code>'s Role is <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/public\/nsIAccessibleRole.idl#l746\">ROLE_EMBEDDED_OBJECT<\/a>, and that the <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l372\"><code>gRoleToNameRulesMap<\/code><\/a> array stopped at <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l495\">ROLE_GRID_CELL<\/a> (which happens to be <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/public\/nsIAccessibleRole.idl#l741\">ROLE_EMBEDDED_OBJECT - 1<\/a>).<\/p>\n<p>The changeset that added <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/public\/nsIAccessibleRole.idl#l746\">ROLE_EMBEDDED_OBJECT<\/a> but forgot to add the corresponding <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l372\"><code>gRoleToNameRulesMap<\/code><\/a> entry is <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=485270\">2.5 years old<\/a>, and other additional roles have been added since then.<\/p>\n<p>Conclusion, the problem was not the compiler doing something broken, but it was the linker laying things out differently in some cases, leading to different values returned when reading past <a href=\"http:\/\/hg.mozilla.org\/mozilla-central\/file\/a81526647059\/accessible\/src\/base\/nsTextEquivUtils.cpp#l372\"><code>gRoleToNameRulesMap<\/code><\/a>, depending on what the linker put there.<\/p>\n<p>It looks like we've been pretty (un?)lucky not to have been hit by this earlier. It's now fixed on all branches, and bug 616262 landed again, with a pretty green Windows Mochitest-Other.<\/p>\n<p>As of writing, I still don't know why the stack trace was truncated in the first place, since the stack was, in fact, not corrupted. I think I don't care anymore.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>10 days ago, I landed bug 616262, and Windows Mochitest-Other immediately turned perma-orange on an a11y test, in the form of a crash. It hadn&#8217;t happened when I was testing the patch queue on Try, nor did it happen on PGO and debug builds on mozilla-central. That looked like a good candidate for a compiler [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[23],"class_list":["post-2601","post","type-post","status-publish","format-standard","hentry","category-planet-mozilla","tag-en"],"_links":{"self":[{"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2601","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2601"}],"version-history":[{"count":34,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2601\/revisions"}],"predecessor-version":[{"id":2636,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2601\/revisions\/2636"}],"wp:attachment":[{"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2601"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2601"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/glandium.org\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2601"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}