Posted Sunday, November 20, 2016
In late 2014, the sole system administrator for the Star Tribune Digital department (let’s call him John) turned in his two-weeks’ notice. Thus began a panic as we realized we had no idea how our infrastructure worked. At the time, I was one of two senior software engineers there. I was in the middle of a large refactoring project for the mobile website. John had, thankfully, kept up an internal wiki that at least roughly documented how some of the legacy infrastructure worked. The new infra on AWS was very new and very undocumented, but at least we had a place to start. John’s disciple, an eager but more junior developer who had originated as a copy-editor, did what he could to absorb as much knowledge as he could from John before he left. But his heart just wasn’t in operations, and he didn’t want to fall on that sword. In a one-on-one with my boss, we talked about the dilemma facing us. I mentioned that I had some experience with Linux in production environments. He gently suggested that I take on the system administration role, at least until they could hire someone to replace John. I blanched, knowing enough to know that I knew little about the world of operations. I was reluctant, but in the end he convinced me to at least try and hold down the fort until help arrived. I set the condition that I be immediately removed from all development projects I was assigned to. My boss agreed. At this point, we were about six months from launch of a scratch-built web platform that would replace our proprietary hosted CMS. None of the developers had experience in operational concerns at that scale, including myself. John had written some Chef cookbooks to provision the few things living on the new infrastructure, but only an outside contractor brought in to help in the interim with operations actually knew Chef. We struggled for a short period of time to maintain it. I learned enough of Ruby and Chef to get by, but it was always a battle to get the tool to do what I expected whenever I needed to do something slightly different from the existing process. I forget who originally suggested it, but we started to switch our orchestration code to Ansible. Ansible was far easier for me to wrap my head around, and at a time when I needed to learn as much as possible as fast as possible, that was a lifesaver. At this point, I’d been “the operations guy” for about a month. The hiring ad for a “devops engineer” had been open for a month and a half. We had had only a few interested parties, and none of them were anything close to what we needed. I saw the writing on the wall, and asked my boss for a meeting. I told him I was willing to permanently become the operations engineer for the team. His relief was obvious. Shortly after that, the engineer opening changed from operations to developer, and within a short space of time we had an awesome new junior developer on the team. Meanwhile, I dove headlong into the world of system administration and operations. No longer held back by my own thought of “this is only temporary,” I devoured anything and everything I could find on the subject of what Google would later call Site Reliability Engineering. After a couple short months, all of the Chef code was gone, a lot of our infrastructure had been deployed onto AWS, and we had hired a junior operations engineer to help out. Our on-call rotation was still a manually-maintained forwarded phone line, though, and fighting fires was always a panic moment… since usually, it was a problem that had been happening for hours, and we didn’t even know about it until someone from the news floor informed us. But that’s a story for another time. It was the launch of our web platform that completely changed operations for us, and for me. In early May 2015, the new website launched. And immediately crashed. And stayed down. We reverted to the old CMS. The next ten days saw both development and operations working feverishly to try and figure out what had gone wrong and how to fix it. We switched from using a clever-but-misleading traffic simulation to using a tool called siege to do load testing, and finally got to a point where we thought we could keep the new site online. At 3:00 AM ten days after the first attempt, the new platform went online again and stayed online. For the next three hours, the launch team (including myself as the sole operations engineer on the team) built and deployed multiple iterations of the new site, and finally at six in the morning, we all left the office. I greeted the post-launch engineering team as we passed each other in the office entrance. After a couple hours’ sleep in a downtown hotel, the launch team returned to the office to assess the situation and help out with any problems that might have occurred. Let me be clear. When I was a developer, I pulled my share of sixteen-hour days to meet deadlines. But I never before this launch had experienced a shift this long or this rewarding. Before then, I had never before cheered and exchanged high fives with the team as we saw the site working as expected in the wee hours of the morning. At this point, I was invested in operations. It had acquired my interest and not just my attention.