Tuesday, July 4, 2023

Four Theories of What the Hell Happened At Twitter. By Jeremiah Johnson

Four Theories of What the Hell Happened At Twitter. By Jeremiah Johnson — Read time: 9 minutes


Four Theories of What the Hell Happened to Twitter

How Dennis Nedry explains your rate limit


On Saturday, Twitter went haywire.



Elon Musk tweeted out that every account on the site was now rate-limited in how many tweets they could view. People immediately began to see ‘Rate Limit Exceeded’ when trying to access the site in addition to more generalized errors. The level of limitations or outages was also unclear - some accounts appeared to be far more thoroughly stymied than others.


To state the obvious: this is an insane thing to do. Twitter is a business that relies on advertising for revenue. Advertising dollars scale linearly with “How many tweets are viewed”. This is cutting off one’s nose to spite one’s face. To my knowledge there’s never been a successful social media site that ever attempted anything like this. Getting users to use the site *more often* is the entire point of running Twitter. Predictably, the collective crowd at Twitter went berserk and reactions, jokes, and discussions of the outage dominated trending topics.


As happens in the wake of any dramatic event, people immediately started theorizing why this was happening. I want to summarize the main theories here, and try to make a contribution of my own to think about how Twitter operates today.


Elon publicly and repeatedly blamed data-scraping bots for the issues Twitter was having. He’s been doing this almost continuously since he first planned to purchase the site, using bots as a scapegoat for every perceived ill Twitter runs into. He tweeted that the rate-limits were “the only way to stop scraping”.


Under this theory, the story is simple. Twitter was under massive pressure from a horde of bots. These bots were scraping data, likely as input for giant AI models. These attacks were so detrimental in terms of downgrading user experience and cost to Twitter that there was no choice but to implement the drastic step of rate-limiting every account on the site to try to stop the bots. Elon told a similar story about the move one day earlier to stop users without accounts from seeing any tweets.


It’s a clean, neat story. The problem is that it’s almost certainly bullshit.


The first thing to note about Elon’s bot theory is that it makes no sense. Here’s a graph of Twitter outages on June 30th and July 1st, when the rate limit happened.



Elon’s theory is that the site was under so much continuous strain that rate limits were necessary to save it. Factually, that’s not true. The site was operating mostly normally until about 9AM on Saturday morning, when everything went to shit. There was a single, specific event around 9AM that kicked everything off. It makes no sense to blame that on a nebulous army of bots, which would look more like continuous strain.


Former Twitter executive Yoel Roth posted on BlueSky that “It just doesn’t pass the sniff test that scraping all of a sudden created such dramatic performance problems”. Virtually everyone with knowledge of the data-scraping industry seemed to agree. Data scraping happens and it’s annoying for many sites. But it’s also been a constant factor for years, and it isn’t something that would or could have increased dramatically in a short time span.


And even if the site were under pressure, it would make no sense whatsoever to rate-limit users. If users are having minor to moderate difficulty loading the site the solution should be to fix your back end systems, not to rate-limit them and guarantee that they’ll have severe difficulty loading the site. The proposed solution is like cutting off one’s foot in response to breaking a toe.


Some Twitter users began to build a different theory: Twitter had stopped paying their Google Cloud bill, and Google was yanking Twitter’s servers offline.


After all, wasn’t it an odd coincidence that Twitter’s contract with Google was set to expire on June 30th, and there had been very public disputes about the Twitter/Google relationship? Elon is famous for stiffing folks - he’s refused to pay former workers, to pay former executives, to pay the rent on his offices, etc. And the outage suspiciously started right around 9AM on July 1st! Very, very odd indeed. Users from all corners began speculating wildly about this theory. The rate limit was a cover for Google cutting off Twitter from a giant chunk of their servers.


This theory is also a clean, neat story. The problem is that the Google Cloud theory also has gaping holes in it.


It’s absolutely true that Twitter stopped paying its Google Cloud bill once Elon took control of the company. But it’s equally true (although less noticed) that once Linda Yaccarino came on board as CEO, she quickly repaired the relationship with Google and began paying the bills again. This was reported in Bloomberg, the NYTimes, and ArsTechnica, among other sites. There’s no longer any reason for Google to have cut Twitter off.


There’s also the fact that Twitter does not use Google Cloud to serve web pages to viewers. According to reporting from Ars Technica and others, Twitter mostly uses Google servers for back end functions like spam prevention and analytics. Losing Google wouldn’t necessarily interrupt the basic user experience.


And frankly, if the site’s outages were due to an external villain I think Elon would be quick to point the finger. He’s never been shy about starting fights. If this was really Google’s fault, I’d expect him to say so loudly and repeatedly. What would he have to lose?


As the outage continued, some internet detectives weren’t content to trust Elon’s word or to examine business relationships. These sleuths decided to be more direct: why not just directly examine the details of what was happening on Twitter’s site? There were plenty of odd clues laying around in the open, and sometimes the rate limits didn’t even seem to be working. Users reported being able to get around the rate limits by utilizing third party apps, by switching browsers, and with other interesting tactics. Sheldon Chang on Mastodon was the first to put together an alternative theory - Twitter was DDOSing itself.



According to the most popular form of this theory, the site is experiencing downtime because of a bug in how Twitter decided to restrict logged-out viewing. On Friday Twitter began requiring users to login in with an account in order to see any Tweets. If you tried to view a tweet without logging in, you’d be prompted with a mandatory log-in screen. This was widely seen as an anti-data-scraping move.


But it appears that the code for was poorly written, and when users hit this log-in screen it began sending tens of requests per second to load the page to Twitter servers. Multiply tens of messages per second with millions of users facing this new log-in screen, and you get a site that is suddenly facing critical outages from too much traffic.


Sheldon’s post gets into the technical details. I’m not a network engineer, but this story seems by far the most plausible to me thus far. It may even be the case that Elon didn’t realize the site was effectively attacking itself! He could have blamed data scraping bots because in his anti-scraping zeal, he sincerely thought that at first. It may also be the case that something else inside Twitter broke other than the self-DDOS as detailed in this expert’s excellent thread. When it comes to Elon, it’s truly impossible to know.


I think Self DDOS is the most plausible theory yet - it has some degree of evidence, no outright conflicting facts, and relies on technical argument rather than high level hand-waving theory. But I think it would be a mistake to stop here, so I want to indulge in a little bit of hand-waving theory myself to explain why this sort of thing happened in the first place.


One of my favorite books as a teen was Jurassic Park by Michael Crichton. It had dinosaurs eating people and that’s obviously awesome, but it also had scientific discussions that go much deeper than what the movie shows. I remember specifically being entranced by the idea of chaos theory, which was presented with far more depth in the book than in the film.


A very quick background: Chaos theory in mathematics deals with unpredictable, non-linear systems. The best way to think of linear vs non-linear system is how systems respond to incremental change. Say you throw a ball with X units of force, and it lands 50 feet away. If you throw with X+ 0.01 units of force, it will land 50.01 feet away. Small changes in starting conditions lead to small changes in output, predictably. That’s linear. In non-linear systems, that doesn’t happen. Small changes at the start can cause unpredictability and chaos later on. Weather is a non-linear system - small changes in the atmosphere can lead to wild, unpredictable variations just a few days later. That’s why it’s so hard to predict the weather more than a few days in advance.


Jurassic Park takes the strictly mathematical version of chaos theory and creates a more human version that governs people and systems. The core lesson of Jurassic Park is that complex human systems are chaotic. If you attempt to direct them according to basic rules and with minimal levels of control, you’re inviting disaster. Chaotic systems will always behave in ways you don’t expect, and if you don’t plan for that you’re going to fail.



This man explains why you couldn’t read your favorite dril tweets

In Jurassic Park, an arrogant businessman entered a new venture he didn’t really understand. He attempted to run his new business with a minimal level of staffing and with a focus on cutting costs. Because he didn’t truly understand the business and because complex systems are inherently chaotic, things went wrong in ways nobody anticipated. And because he didn’t hire redundant staff or build any additional resiliency into the system, these things caused the whole system to crash to the ground in catastrophic ways that involved dinosaurs eating people. That’s not mathematical chaos theory, but it’s a sort of ‘Generalized Chaos Theory’.


At Twitter, the same thing happened. Elon Musk entered a new field of business he didn’t really understand. He attempted to run his new business with a minimal level of staffing and with a focus on cutting costs. Because he didn’t truly understand social media and because complex systems are inherently chaotic, things at Twitter went wrong in ways nobody anticipated. And because he didn’t hire redundant staff or build any additional resiliency into the system, these things caused the whole system to crash to the ground in catastrophic ways that involved dinosaurs eating people Twitter failing worldwide.


It’s interesting to try to find the specific technical cause of Twitter’s giant crash. But if you buy into Generalize Chaos Theory, something like this was inevitable. If it wasn’t the log-in prompt, it would eventually be something else. If you attempt to run complex, chaotic systems with minimal staffing, you should expect to see errors and failures cascade in unexpected ways. If you attempt to make frequent and sudden changes to chaotic systems, those failures will get bigger and more frequent.


A final piece of evidence in favor of Generalized Chaos Theory: This post on Blind from a current Twitter employee, begging for former employees to help him debug issues that are currently spiraling out of control.


Image

Change a few words and this could easily be a Jurassic Park employee begging for help to get the dinosaurs back in their cages. Unless Elon stops re-inventing the site every week and starts hiring more engineering support staff, I expect this will not be the last major outage Twitter suffers.


Share this post on Twitter if you’re not rate limited. If you are rate limited, just mass email it to your entire contact list like a boomer.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.