5 DevOps Horror Stories
DevOps is key to enterprise transformation, but we know DevOps is a journey, not a destination. In that sense, enterprises are always learning better ways to adopt DevOps, no matter how long they’ve been practicing it. Continually improving DevOps practices is the only way that organizations can advance their enterprise transformations.
Toward that end, we look in this post at well-known organizations that have made mistakes and faced dire consequences in applying DevOps to their transformation strategies, yet have come away with precious learning from their failures.
(And no, they're not Contino customers, in case you were wondering! You can check our customer stories out here.)
SlideShare: Restrict Access to Production Environments
SlideShare adopted DevOps in its early days, but being a new organization, their processes were not mature. At one point, a developer was trying to analyze a MySQL database using a new tool, and he started to change the order of the columns in the database. What he didn’t know what that this also changed the database in production, and this resulted in an outage for over 60,000 users.
Sylvain Kalache, narrator of the incident, commented that while DevOps is about empowering everyone on the team, access to production environments should be restricted to a few who can handle it. This is where it’s important to configure advanced role-based access (RBAC) for teams and individuals. Tools like AWS IAM are essential to ensuring everyone has access to perform their day-to-day tasks, but not full access that can cause damage to the user experience. In this case, the developer could have been given access to a staging environment to try the same thing, without affecting users.
Knight Capital: The Dark Side of Automation
Knight Capital experienced what was probably the most extraordinary tech failure in recent years. It was a real-time stock trading company that used automation to make transactions faster and easier for traders. Knight used an internal application called SMARS that handled buy orders in the stock market. This app had been running for many years, and had many outdated parts in its codebase. One such part was a feature called Power Peg that was inactive but was not removed from the codebase. When writing new code for the application, the new code inadvertently called the Power Peg feature which Knight had overlooked. As a result, Knight’s app made buy orders worth billions of dollars in just minutes. This resulted in the company paying a $460M fine, and going bankrupt overnight.
There are many lessons to learn from this horror story. One crucial lesson is that automation is incredibly powerful, and if used carelessly, it can result in major mishaps. Another is that processes need to be retired and new features need to be introduced in an application over time so that conflicting changes don’t occur.
Workflowy: Decomposing databases is a delicate affair
Workflowy is a simple, yet elegant productivity tool that’s been growing steadily. The staff was making architectural changes to cope with the growth. Their databases became too large, and they decided to decompose a single large database into multiple smaller databases. During the process they found that it slowed down queries and blocked data access for users. Some users were not able to sync data from their mobile devices, and others couldn’t log in to the web app.
Troubleshooting showed that they had a couple of issues. First, there was a bug affecting their Apache web server, which they resolved. This didn’t fix the issue. They then realised that their process of decomposing databases was the root cause. They avoided a certain ‘slow query’ of the database and kept the site working. Along the way, they also upgraded their infrastructure to make queries faster.
The lesson here is that decomposing databases can cause performance issues, and even outages, but by isolating the key issues, and resolving them first, you can restore services faster.
IRS: Move to the cloud
The United States Internal Revenue Service (IRS) is not exactly known for its technological prowess. After all, its job is to collect taxes, not advance technological innovation.
This is perhaps why the IRS application that processes tax returns failed in 2016. The problem was caused by a faulty electrical voltage regulator. The IRS had a backup device, but unfortunately, that too failed.
This may sound like a freak incident, but there is an important lesson in it. When managing your own infrastructure, lots of things can go wrong. The solution is to migrate infrastructure to the cloud. Today, the cloud is the most reliable way to run your applications. Offload the effort of setting up and maintaining physical hardware to a cloud vendor so you can focus on what’s most important—running and improving your applications.
Instapaper: Know Your Cloud Vendor
Instapaper is an offline reading app. It originally started out with Softlayer as its cloud vendor but, after an acquisition, they moved to AWS. However, the team wasn’t familiar with AWS. When they experienced an outage and noticed that one of their MySQL databases on AWS RDS was out of space, they dug deeper and found that all RDS databases created before April 2014 have a size limit of 2TB. Their databases that stored users’ bookmarks had hit this 2TB limit.
After much discussion with the AWS team and some help from Pinterest developers, they were able to index the existing data using Aurora, and create new databases in new instances with larger size limits. All this was done in 30 hours, and service was finally restored.
The lesson to learn here is that you need to be aware of the limits and restrictions of the cloud vendor you use. This is especially true if you use multiple cloud vendors, or if you migrate from one provider to another. This can help you prepare for incidents in advance.
Conclusion
Whether it’s databases, hardware infrastructure, legacy code, or cloud vendor limits, there are countless ways things can go wrong as you practice DevOps. But by learning from the experiences of these enterprises, you can better prepare to avoid such incidents.