Most of us know Amazon best for its e-commerce services, which enables us to easily order nearly anything off the internet these days—from food to clothes and furniture—with free shipping, with just a few clicks on Amazon Prime. It's exactly this that made Jeff Bezos the (until recently) richest man in the world, and continues to rake in the most cash; but Amazon does much, much more than retail.
In fact, it also happens to control 33% of the internet, which runs on Amazon AWS (Amazon Web Services) servers—placing it high above even Google and Microsoft when it comes to lucrative web services.
And last Tuesday, a portion of the internet, together with Amazon.com, disappeared for a while, when Amazon's servers in Northern Virginia (which has one of the biggest, as well as the first AWS data center ever) experienced an unexpected crash. The downtime lasted about seven hours, starting at around 7:30 AM PST, and with the network finally fully restored by 2:22 PM PST.
During the prolonged outage, the whole event was shrouded in mystery: few details were shared as to what exactly was the cause of the whole thing, and when things would be back to normal. A few days after the event, however, Amazon has released a rather more detailed repot as to what happened on December 7.
As it turns out, it was a very unusual crash which affected the AWS monitoring systems, which Amazon says significantly delayed the tech rescue team's own ability to understand and diagnose the issue for the first few hours. Moreover, Amazon says that "the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region."
Amazon says it is hard at work updating the systems to prevent the tech team (and consequently, AWS customers) from being left in the dark anymore, should future technical issues or outages occur.
Apart from sending significant portions of the internet offline, the Amazon outage also affected large-scale services such as Netflix, Disney+, Ticketmaster, and others.
Many smart devices that rely on an internet connection to function also stopped working temporarily, such as smart assistant Alexa, Roomba vacuums (via CNBC), security cameras, smart cat litter boxes, and even baby monitors—which, all other annoyances aside, posed a significant safety concern.
Recommended Stories
Here is part of Amazon's post on its website, published on Friday:
At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.
These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.
Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered. [...]
We have taken several actions to prevent a recurrence of this event. We immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediations. Our systems are scaled adequately so that we do not need to resume these activities in the near-term. Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event.
This code path has been in production for many years but the automated scaling activity triggered a previously unobserved behavior. We are developing a fix for this issue and expect to deploy this change over the next two weeks. We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue.
Create a free account and join our vibrant community
Register to enjoy the full PhoneArena experience. Here’s what you get with your PhoneArena account:
Follow us on social media to catch the latest trending stories, watch exclusive videos, and join the conversation with our vibrant community!
Thank you for sharing your feedback with us!
Recommended Stories
Loading Comments...
COMMENT
All comments need to comply with our
Community Guidelines
Phonearena comments rules
A discussion is a place, where people can voice their opinion, no matter if it
is positive, neutral or negative. However, when posting, one must stay true to the topic, and not just share some
random thoughts, which are not directly related to the matter.
Things that are NOT allowed:
Off-topic talk - you must stick to the subject of discussion
Offensive, hate speech - if you want to say something, say it politely
Spam/Advertisements - these posts are deleted
Multiple accounts - one person can have only one account
Impersonations and offensive nicknames - these accounts get banned
Moderation is done by humans. We try to be as objective as possible and moderate with zero bias. If you think a
post should be moderated - please, report it.
Have a question about the rules or why you have been moderated/limited/banned? Please,
contact us.
Things that are NOT allowed: