Distributed System Validation

Last week I read an article about Netflix Chaos Monkey and it provoked some thoughts. I’ve read about Chaos Monkey before but back then I didn’t realize some aspects of it. I guess the difference is that I recently started working on a distributed system project and an important aspect of my work is the validation of what we are doing.

So, how do we validate a distributed system? There is no need to reinvent the wheel so I googled around. That’s how I ended up reading Netflix article again. The idea is simple – terminate running “instances” and then watch how your distributed system changes. There are more details to it but the essence is that Chaos Monkey stops some of your instances/services (and makes them unavailable) so you can inspect how well your distributed system works in such scenarios. Please note that this approach works so well that Chaos Monkey runs on both test and production systems and it is considered as a key component for any major distributed system.

Let’s dig in further. Distributed systems, in fact all large projects, are built from many moving parts. Often, it is practically impossible to anticipate every possible problem that may occur in your system. Instead of waiting for problems you can happen them in a controlled manner and get feedback. This makes your system more resilient.

This gets me thinking about the way we build large projects and distributed systems in particular. Nowadays agile methodologies are widely accepted and we build software in small iterations. The continuous project evolution makes it hard doing correct reasoning and it is practically impossible to validate ever-changing system. Often, we build distributed system much like the way we play with Lego. We pick this web server, that database server, that network load balancer and so on, glue them together and add our services on top of it. During this process we rarely consider each component specifics and limitations. This (development) process leads to heisenbugs and other hard to reproduce issues.

Solutions like Chaos Monkey make it easier as they provide a process that helps the validation of your systems. Incorporating such process/methodology into your software development process gives you better monitoring and better guarantees for more resilient systems.