Facebook’s biggest loss in history was caused by a mishandled command that resulted in what the social media giant said was “an error of our own making.”
“We have done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down not because of malicious activity but because of an error of our own creation. Tried to recover from the outage,” said the new post published on Tuesday.
Facebook’s vice president of engineering and infrastructure, Santosh Janardhan, explained in the post why and how the six-hour shutdown occurred and the technical, physical and security challenges the company’s engineers faced in restoring services.
According to Mr Janardhan, the primary reason for the outage was incorrect command during routine maintenance work.
Facebook’s engineers were forced to physically access the data centers that make up the “global backbone network” and overcome several hurdles to fix an error caused by incorrect commands.
Once these errors were corrected, however, another challenge was thrown at them, in the form of managing the “increase in traffic”, which would result from correcting the problems.
Mr Janardhan explained in the post how “an error was caused by the system managing our global backbone network capacity.”
“Backbone is the network that Facebook built to link all of our computing facilities together, which includes thousands of miles of fiber-optic cables around the world and connects all of our data centers,” the post said.
All user requests to Facebook, including loading news feeds or accessing messages, are handled by this network, which handles requests from small data centers.
To effectively manage these centers, engineers maintain day-to-day infrastructure, including taking part of the “backbone” offline, adding more capacity on the router, or updating the software that manages all data traffic.
“It was the source of yesterday’s outage,” Janardhan said.
“During one of these routine maintenance jobs, an order was issued with the intention of assessing the availability of global backbone capacity, which inadvertently took down all connections to our backbone network, globally,” he said. Effectively disconnected Facebook data centers.”
What complicated matters, the post said, was that the faulty command could not audit the outage because a bug in the company’s audit tool prevented it from performing commands.
A “complete disconnection” then occurred between Facebook’s data centers and the Internet, something that “became a second issue that made things worse.”
The entirety of Facebook’s “backbone” was removed from operations, leading the data center locations themselves to be designated as “unhealthy.”
“The end result was that our DNS servers had become unreachable even though they were still running,” Post said.
Domain Name Systems (DNS) are systems through which web page addresses typed by users are translated into Internet Protocol (IP) addresses that can be read by machines.
“This made it impossible for the rest of the Internet to find our servers.”
Mr Janardhan said this posed two challenges. The first was that due to network disruptions, Facebook engineers could not access the data centers by normal means.
The second was the company’s internal tools it normally used to resolve issues that were “broken”.
Engineers were forced to move to these data centers, where they would have to “debug the problem and restart the system”.
However, this did not prove to be an easy task, as Facebook’s data centers have significant physical and security covers designed to be “hard to get into”.
Mr Janardhan explained how the company’s routers and hardware were designed so that they were difficult to modify despite physical access.
“So it took additional time to activate the secure access protocols needed to enable people to work onsite and on servers. Only then can we confirm the issue and get our spine back online,” he said.
Engineers then faced the final hurdle – they could not restore access to all users worldwide, as the increase in traffic could result in more accidents. Reversing the massive drop in power usage by data centers could put “everything from power systems to caches at risk”.
The post said the “storm drills” the company had previously conducted meant they knew how to slowly and safely bring the system back online.
Mr Janardhan concluded, “I believe such a tradeoff is worth it – day-to-day security is greatly increased and a slow recovery from such a rare event.”
Facebook’s outage – which affected all of its services including WhatsApp and Instagram – caused a personal loss of nearly $7bn for chief executive Mark Zuckerberg as the company’s stock price plummeted. Mr. Zuckerberg apologizes to users for any inconvenience caused by the interruption of service.
Credit: www.independent.co.uk /