In this piece I discuss the practice of going deep into a system through a few examples. I believe I do this more often and more thoroughly than the average engineer, giving me an edge when tackling problems and increasing my rate of technical learning.
Microsoft, 2006
Soon after joining Microsoft as an SDET (a software engineer that writes test automation, a misguided concept IMO but a story for another day) I was given the task to fix a test suite that had been severely damaged by the introduction of IPv6 to the Windows networking stack (including the component I was supposed to test, Microsoft’s DNS Server). The tests were written in a custom scripting language called HAPI and a very large amount of them were failing mysteriously. I think my job was to fix them, but also to write tests for other new features that were coming online.
Despite having spent nearly 2 years at that job, I never really understood how HAPI worked. I never made a custom build of the interpreter, I never even read its source code. In hindsight, it was pretty hopeless that I could fix the broken tests, because I didn’t understand what they were doing.
Looking back to that time, I feel I wasted much of those 2 years. Not so much for staying in the SDET position (which was a minor mistake), but primarily for failing to go deeper into it. I learned a few things here and there, but not enough.
Eventually I interviewed for a SDE position at Microsoft and accelerated my learning speed, and later joined Facebook and accelerated it even more.
Facebook, 2012
After Skype balked at a partnership to support voice calling on Facebook mobile, I had a chance to write it from scratch. I proposed to write a prototype based on WebRTC, then a somewhat nascent technology developed for desktop computers. I started porting WebRTC to mobile and with the help of two other engineers got it to work on Facebook’s mobile application. Thanks to open source and a lot of hard, focused work, our small team of 3 people accomplished what a company of thousands wouldn’t do for us. I couldn’t be more proud at the time of my performance review.
To my great surprise, despite a promotion and glowing commentary in other areas, my performance feedback said that I needed to go deeper into the system. Having handwritten some assembly to fix an endianness bug that would only manifest on ARM processors I was outraged at this feedback. Did they want me to go deeper than the actual CPU instructions? At that time I hadn’t yet learned to be cautious of my own bias against constructive feedback, so it didn’t immediately sink in.
Having shipped a v1, at the time we were focused on improving quality before rolling out more broadly. A lot of my time was spent aggregating logs and trying to build a model to understand which variables could predict bad calls. Despite some good ideas, this effort didn’t yield the breakthrough I wanted, and I had wasted a good quarter on this.
The one thing that probably saved my next review was an investigation that I did into connection establishment delays. We knew that sometimes it took too long to connect a call, so I set out to figure out what was taking time. For signaling (the metadata that controls a call, saying things like “I’m calling you” or “I accept”) we leveraged MQTT, a lightweight transport protocol that Facebook used for chat as well. I traced the code all the way down to the actual system call that opened a socket, and then got curious about how said socket was configured. I realized that we wanted to send bytes out as soon as possible (both for chats and calls), so I just reduced the TCP send buffer size. This was non-controversial enough that I simply committed a one line code change without wrapping it into an A/B test or anything like that. A month later someone chased me down because they were trying to understand how the latency for all chats had reduced by 10%. I explained, people were happy. Going deep paid off, and the previous performance feedback started to make sense. Nobody was saying I was incapable of doing this. They were saying I should do more of it!
Snap, 2016
There was a time the Snapchat app for Android got so bad that the head of engineering asked my team in Seattle to stop everything we were doing and to help alleviate the problem. This was way before the rewrite (aka Mushroom) and I think no one will disagree when I say that was a wild time for that app.
One of the many problems that plagued the app were the infamous ANRs. ANR is short for “application not responding”, which Android would show to users after the UI thread had been blocked for a long time, prompting them to terminate the offending application. When looking at the stack traces for our ANRs, the most common had to do with SharedPreferences.
SharedPreferences is Android’s built in mechanism for an app to save small bits of data, typically some trivial state or app configuration. By then, most of our developers were savvy enough to not update SharedPreferences on the UI thread, since blocking IO would easily lead to dropped frames and overall user sadness. But somehow we still had a lot of ANRs related to it. Having learned my lesson about going deep, this time I didn’t hesitate. I went beyond our app and started reading Android’s source code. That allowed me to understand that the implementation of SharedPreferences simply queued updates to be executed later on, that each update would completely rewrite the entire preferences file, and that the entire queue would be executed on the UI thread when certain system events occurred. With that knowledge I built a wrapper to avoid that access pattern, which killed one third of all Snapchat’s ANRs when shipped.
Snap, 2020
The other day I had a conversation with a more junior engineer in which he asked me how come I knew so much stuff. My first instinct was to say “well, I’ve been doing this for a long time”. But reflecting further I remembered this pattern of going deep. When facing a technical mystery, don’t just shrug and write it off. Try to figure out what is actually going on — there are many tools to do it (debuggers, source code, tracing tools, network monitors and so on). If you do a good job, there’s a fairly decent chance that you crack an important problem. And even if you don’t, you’ll come out of your investigation knowing a bit more. You’ll get better and learn something different each time you repeat this cycle, eventually allowing you to solve problems that few others can.