Note: Our good friend André Arko, who also happens to be one of the maintainers of Bundler wrote this great post for us about the state of Rubygems. He is currently working with a grant from Ruby Central to make Rubygems awesome. Yay for community posts, and thanks André!
Bundler, Rubygems, and rubygems.org are vital infrastructure that every Rubyist uses just about every day. Over the last year, that infrastructure has seen a huge amount of change. This is a review of those changes, an update on where things are now, and an explanation of where we’re going soon.
So, what happened last year?
Playing it a little bit fast and loose with the definition of year, last October rubygems.org went down, in a big way. Bringing the site back up only lasted a few seconds before everything went down again. We eventually discovered that the problem was the dependency API used by Bundler to speed up installs. The dependency API is database and CPU intensive, and there were so many users that the rubygems.org server couldn’t handle the load anymore. I gave a talk at Gotham Ruby with a lot of detail about what the problems were, and how we fixed them. The summary is that the Bundler API was rebuilt as a separate Sinatra app, and we throw a lot more CPU and database resources at it now than we used to.
The next major event was at the end of January, when someone exploited a YAML security vulnerability to get unauthorized access to the server hosting rubygems.org. That meant that, potentially, any gem could have been replaced with a trojan horse that compromised machines as it was installed. Every gem had to be verified to match another copy of that gem from mirrors that were not compromised. Then, the single server hosting all of rubygems.org was decomissioned. New infrastructure was built on Amazon’s EC2, with redundant servers for failover, managed by Chef recipes that are open source in the rubygems/rubygems-aws repo. One significant upside to this change is that the community can now contribute fixes and improvements to the servers that rubygems.org runs on, which was never possible before.
The other significant issues this year have been more diffuse and inconsistent. Is everybody familiar with Travis, the hosted continuous integration service? Travis runs the tests for many open source projects, and they were experiencing seriously degraded network connections to rubygems.org. This caused a huge number of builds to fail just because of dropped or failed connections. After a huge amount of investigation, it turned out that the Travis network issues were a DNS configuration problem. The Travis VMs were hard-coded to use DNS servers that were on the opposite side of the country from where they were. As you may already know, gems are hosted on Amazon’s S3 storage service, and served via Amazon’s CloudFront content delivery network. CloudFront uses the location of your DNS servers to know which server it should tell you to download from. That meant Travis jobs were always being told to download every single gem from all the way across the country, instead of from servers in a nearby datacenter. Once that DNS issue was resolved, Travis build reliability shot up, and has been steady since.
The final major issues this year were all related to SSL, the system used to provide secure HTTP connections. In order to make an HTTPS connection, the client machine must have a certificate that it can use to verify the corresponding certificate used by the server. While recent Macs had most of those certificates built-in, many Linux and Windows machines did not. Compounding the problem, some S3 endpoints recently started using new certificates that couldn’t be verified by every Mac, either. Making everything more confusing, right around the time that the certificate issue happened, there was another issue that caused connections to fail right as they were started. That issue looked similar, but had a completely different cause. We solved the certificate issue by including the needed certificates in Rubygems and Bundler directly. The connection failure issue was a connection timeout set to only a few seconds, which was not enough time to allow connections to be set up over laggy internet connections. Increasing the timeout resolved that issue, too.