Note: There wasn’t much news about Windows servers getting hit by the leap second problem. If this news has made you wary of Linux, try our NetHosting’s Windows virtual hosting to get the familiarity of Microsoft.
Saturday night a leap second was added to the atomic clock. The added time sent Linux servers spiraling in downtime.
Aside from the massive outages at Amazon’s data centers that happened over the weekend, another chunk of the Internet was taken down on Saturday night. Large websites like Reddit, Mozilla, and Gawker all experienced downtime last weekend as a leap second took down a number of services.
Just as the leap year is added to our reckoning of time every four years to account for a slight irregularity in the orbit of the earth, a leap second is also added, albeit less regularly, to account for the variable speed of the planet’s spin. This is a relatively new addition to our modern time keeping – the first leap second was added in 2005, the next in 2009, and now this leap second, added on the end of June 30th and beginning of July 1st in 2012.
Note: In the midst of all of these reports of server downtime this weekend, you need to go with a hosting provider that can guarantee you and your servers good uptime. Like 100% uptime. Try NetHosting’s cloud hosting today.
The problem facing electronics is that they were built without taking the leap second into consideration. Linux Torvalds, creator of the original Linux kernel, reported that the internal program called high-res timer (hrtimer for short) was confused by the extra second and caused hyperactivity on servers, which subsequently seized up computer central processing units. Prior to the leap second actually occurring, the Linux Network Time Protocol (NTP) began having problems with the leap second on Friday night. The NTP was aware one day early that the leap second was coming, but programs relying on the NTP didn’t know how to handle the looming shift in time and began crashing.
One Linux kernel programmer named John Stultz foresaw the problem that would arise with the leap second. In March of this year, the developer patched a fix for the leap second bug into the Linux kernel code. However, that update had not yet been pushed out to all versions of Linux by last Saturday. Thankfully, the vast majority of all servers in the NetHosting data center did have this update and experienced no down time due to this glitch.
Some might think an extra second isn’t something that really needs to be accounted for. Proponents of accounting for shifts in the earth’s orbit with an extra second here and there claim that doing small adjustments now will prevent massive time shifts in the future, like sunsets at ten in the morning, for example. One additional wrench in the problem of the leap second is that no one can predict when the next leap second will be – it all depends on how fast or slow the earth spins which depends on tides, the molten core, and the weather.
Again, the leap second has happened before, and has caused problems before. In 2009, the leap second caused the Solaris operating system (developed by Sun Microsystems) problems, as well as an Oracle software package. Torvalds was quoted as saying “Almost every time we have a leap second, we find something [to fix]. It’s really annoying, because it’s a classic case of code that is basically never run, and thus not tested by users under normal conditions.”
The problems weren’t limited to Linux, however. Servers running the open source database software Cassandra and those servers running Java saw some of the same problems on Saturday. Some engineers speculated that Cassandra was unable to pause Java processes, which were in loops, which of course used all CPU power on the servers the software was running on. Most websites reported that the problems were resolved after the effected servers were rebooted. Reddit reported having this problem with its servers and said that the site experienced thirty to forty minutes of inoperability followed by another hour and a half of down time.
However, the problem still seems to hail back to Linux. Mozilla servers were experiencing problems with Hadoop and Mozilla reported its Tomcat servers getting hung up. Both Tomcat and Hadoop use Linux at their bases.
If you’re interested in reading more about Reddit and how the site usually smoothly operates without leap seconds shutting it down, check out our case study about the news aggregation and discussion site.