Note: Sharing is caring, but sometimes on a crowded server, you don’t get to choose how much you share. Try Cloud Virtual Hosting with NetHosting to get the advantages of CloudLinux on your shared server.
Today, the online video streaming service put its Chaos Monkey software on GitHub for free, under the Apache license.
Every week, Netflix receives over one thousand attacks to its network. Little do people know the streaming video company is doing that to itself. When a network fails, a business fails, so Netflix is trying to pre-empt that as much as possible. Instead of waiting for unknown attackers to exploit a problem the company never knew existed, it’s hedging its bets and employing software that can simulate attacks and outages so developers can practice returning service as quickly as possible.
The program used to test Netflix employees is called Chaos Monkey, and is just one piece of what Netflix calls its Simian Army. Other software that helps the company anticipate problems and fix them before they are actually a major problem includes: Chaos Gorilla, Conformity Monkey, Latency Monkey, and more. Although this sounds somewhat novel, Netflix has been doing it for quite some time, but Chaos Monkey is in the headlines again today because Netflix just released it to the public. The Chaos Monkey source code is now on GitHub for free, under the Apache license.
The biggest question readers might be having (or at least one of the biggest questions I had when reading about this today) was how Netflix can handle these outages when there are thousands of thousands of customers using Netflix all the time. Apparently, Chaos Monkey works at such a level that the outages it causes are small enough that users won’t notice. The idea is that small outages now will prevent huge outages later. However, Chaos Gorilla was developed to simulate the outage of an entire Amazon region, because practice makes perfect.
Again, Netflix uses the Amazon Web Services cloud to stream its television shows and movies. Despite the extra redundancy of the cloud, it’s still hard to ensure there will be no failure for data center tenants, which is why being prepared is so important to big corporations, like Netflix (and even important to small business owners).
Note: Having the speed and reliability of a dedicated server will make visitors to your website think you’re rolling in sales like a big corporation. Try dedicated hosting with NetHosting today to act as big as your company is going to be.
In a blog post today, Netflix explained that Chaos Monkey is “a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables – all the while we continue serving our customers without interruption.”
To be more specific, the randomly disabled “production instances” refers to virtual machines that Netflix uses. The company uses Amazon’s Auto Scaling service however, which is designed to simply recognize that a virtual machine has gone down and boot up another one to take its place, without client side users being able to discern a hiccup in service. Over the course of the past year, Chaos Monkey has tallied up over 65,000 shut downs of virtual machines (not just in production, but in testing environments as well), which allows engineers to find the problem, fix it immediately, and then orchestrate a long-term solution where applicable. Problems have ranged from a bad patch to a bad load balancer.
Even though Chaos Monkey was designed to operate on machines that were in an Amazon cloud, the program can run on any other public cloud networks. However, the software only operates during business hours so that if it breaks something in a big way, there are always engineers around to resolve the problem. Netflix noted that the software can be configured so that it runs at other times of the day as well.
It seems like this might be a response to the massive Amazon outage that happened just last month when a data center in Virginia lost primary backup and secondary backup power during inclement weather. Having said that, even a system perfected through the fire of Chaos Monkey would have a hard time sustaining operation without power to its servers.
To read more about Netflix, check out our case study about the video streaming service provider.