glandium.org » Blog Archive » Testing shared cache on try

Testing shared cache on try

After some success with the shared cache experiment (Read about it, and some more), the next step was to get it to work on the Mozilla continuous integration infrastructure, and it turned out to reveal a couple issues.

The first issue is that the DNS server for the AWS build slaves we use is not the AWS DNS, but our in-house DNS. Which has two consequences:

whatever geolocation S3 does at the DNS level may end up giving a S3 endpoint IP that is not optimal for the AWS region we're in because it was correlated to the location of our in-house DNS
the roundtrip to the in-house DNS server was around 80ms, and because every compilation is an independent process, each one does a DNS request, so each one gets that 80ms hit. Note that while suboptimal, doing a DNS request for each compilation also allows to get different S3 endpoints because of both DNS round robin and geolocation S3 uses, which gives very different IPs every so often.

The consequence of this is that build times were very unstable, ranging from 11 minutes like during my experiments up to 45 minutes for a 99% cache hit build! After importing a DNS resolver in the shared cache script and making it use the AWS DNS, build times became much more stable between 11 and 12 minutes. (we actually do need to use the in-house DNS for normal operations on the build slaves, so it's not possible to switch /etc/resolv.conf)

The second issue is that the US Standard region for S3 can have quite high latency depending on the region you're connecting to it from. Our build slaves are located in Oregon and Northern Virginia, and while the slaves in Northern Virginia could reach S3 US Standard within 3ms, those in Oregon could only reach it within 90ms. Those numbers were unfortunately gotten with the in-house DNS, so geolocation may have had its impact on them, but after switching DNS, the build times on Oregon slaves were still way higher than on Northern Virginia slaves (~11 minutes vs. ~21 minutes). Which led us to use a S3 bucket per region.

With those issues dealt with, we're now ready for more widespread testing, and as such I've turned the shared cache on on Linux opt, Linux debug, Linux64 opt and Linux64 debug builds, for try only, only if the push contains the relevant setup, which landed in changeset a62bde1d6efe.
See my post on dev-tree-management for a few more details, notably if you hit bugs.

Please note this is only the beginning. More platforms will use the cache soon, including some that aren't currently using ccache. And I got some timing numbers during the initial tests on try that hint at the most immediate performance issues with the script that need addressing. So you can expect builds to get faster and faster as the cache populates, and as the script is improved with feedback from past experiments and current deployment (I'll be collecting data from your try pushes). Also relatedly, I'm working on build system improvements that should make the 'libs' step much faster, cutting down the time spent on that step.

2014-02-13 10:18:45+0900

p.m.o

Responses are currently closed, but you can trackback from your own site.

Comments are closed.