The US-EAST-1 region of Amazon Web Services (AWS) is encountering another round of service disruptions, just days after a significant outage affected numerous online services. At 3:36 PM PDT on October 28, AWS informed its customers that increased latencies were impacting some EC2 instance launches within the use1-az2 Availability Zone.
To manage the situation, AWS throttled certain requests for EC2 resources, suggesting that customers could resolve issues by retrying their requests. Additionally, there were reports of elevated task launch failure rates for Elastic Container Service (ECS) tasks, affecting both EC2 and Fargate users in the US-EAST-1 region.
AWS”s status page indicated that ECS operates multiple cells in the US-EAST-1 region, with a small number currently experiencing higher error rates. This led to warnings that customers might see their container instances disconnect from ECS, which could stop tasks in specific scenarios. Despite this, AWS stated that it had identified the underlying problem and was actively working to rectify it.
The disruptions extended to EMR Serverless services, which are utilized for large-scale data processing tools such as Hadoop and Spark. At 5:31 PM PDT, AWS updated its customers, revealing that EMR Serverless maintains a warm pool of ECS clusters to facilitate customer requests, but some of these clusters were located in the impacted ECS cells.
AWS reported that it was in the process of refreshing these warm pools with healthy clusters and noted that recovery efforts were underway for the affected ECS cells, although external visibility of progress was limited. According to AWS, ECS had halted new launches and tasks on the compromised clusters, while some services, such as Glue, were beginning to see recovery in error rates, though they continued to experience increased latency.
As of the latest updates, AWS estimated that full recovery might take an additional 2-3 hours. While the company has not disclosed the root cause of the disruptions, the recent history of issues suggests that internal dependencies within its cloud infrastructure could be contributing to the ongoing fragility of its services.
This incident has affected multiple AWS services, including App Runner, Batch, CodeBuild, Fargate, Glue, EMR Serverless, EC2, ECS, and the Elastic Kubernetes Service. However, reports indicate that there may not be widespread service disruptions, possibly due to the presence of six availability zones in the US-EAST-1 region, which allows customers to access alternative resources.
As this situation develops, AWS customers are encouraged to monitor the status updates and prepare for potential continued latency and service interruptions in the affected areas.
