Chaos Experiment with Amazon ElastiCache | Amazon Web Services

Amazon Web Services · Advanced ·🏗️ Systems Design & Architecture ·1y ago

Skills: Systems Design Basics80%Distributed Systems70%

Learn how to test high availability with Fault injection service and amazon Elasticache. Learn more at - http://go.aws/46agVIU Subscribe to AWS: https://go.aws/subscribe Sign up for AWS: https://go.aws/signup AWS free tier: https://go.aws/free Explore more: https://go.aws/more Contact AWS: https://go.aws/contact Next steps: Explore on AWS in Analyst Research: https://go.aws/reports Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace Join the AWS Partner Network: https://go.aws/partners Learn more on how Amazon builds and operates software: https://go.aws/library Do you have technical AWS questions? Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—use AWS to be more agile, lower costs, and innovate faster. #AWS #AmazonWebServices #CloudComputing

What You'll Learn

This video demonstrates how to conduct a Chaos experiment using Amazon Web Services (AWS) Fault Injection Simulator (FIS) to test the high availability of an Elasticache cluster and its connected application or client. The experiment involves injecting a failure into the system and observing how it responds.

Full Transcript

Welcome everyone. Today we'll discuss how to test the first tolerance of the elastic cache cluster you're operating as well as the connected application or client and how to monitor a test called a cow experiment. We will explore the benefits of a c experiment and the process itself and address common questions our customers have when encountering a failure in your in-memory workloads. So what is cow's experiment and why should we do it? Cow's experiments help to confirm whether a system is fail safe and has resilience by injecting load or forks into the system within a controlled environment. Failures are certainly something to be avoided. So what's the reason for intentionally injecting force and checking reactions in advance? The obvious reason is the potential cost of downtime. According to estimates, 91% of enterprises face over $300,000 per hour of downtime with 44% hitting 1 to5 million per hour. Businesses facing regular outages pay 16 times more for recovery compared to those with less downtime. downtime risks, compliance issues, stock prices, even business failure in extreme case. Furthermore, most modern systems are composed of distributed systems. Distributed systems are complex. So when we actually encounter failures, it's very difficult to identify the actual root cause starting from the unfamiliar phenomena. This is why simulated failure training is necessary. AWS provides a service called the fort injection simulator that allows you to conduct resilience testing. Resilience testing with AWS FIS is necessary for the following reasons. First, it can improve reliability and availability. Second, it can uncover hidden issues that we are not aware of beforehand. Third, it can help identify gaps in our team's response procedures. It allows us to assess various aspects such as monitoring, observability, alarms, and logs. Finally, it enables us to enhance our value related lumbus and playbooks. Resilience testing is not a one-time activity in systems life cycle. It's necessary to continuously identify areas for improvement and refine the team's response procedures through the following steps. Bordon systems evolve and change constantly. Much like living creatures, the teams operating these systems must also adapt these evolutions and they need to periodically conduct regulance testing to maintain visibility into potential failures. Okay, now let's look at an example of F injection. For a system configured with MTZ, we can inject a F that simulates a power interruption in one of the AGS. The following is a typical multi-AZ architecture. Using AWS FIS, we will inject a power outage port in one of the availability zones. This is reserved in the application being used only in the right a and elastic cache and RDS will be promoted to primary. Once the for is induced and availability is reduced an alert will be triggered and the EB will start processing traffic only from the right A allowing the service to be served from that AC. This is the desired recovery process for our system in the event of an a failure. In this demo that I will explain later, we focus on testing elastic cache and its connected clients as they are critical components. So as shown in the diagram, we will use FIS to inject an A5 failure into the elastic cache cluster and observe how the cluster and its clients respond. Before that, let's run about the Barlide client library that we will use for the experiment. Barlide is an open source client library maintained by the Barky opensource project. It's compatible not only with barky but also with lettuce. Also if you look at the high level architecture on the right you can see the core logic is implemented in lost and it's wrapped to provide client libraries in other languages. This architecture allows you to use the high performance Sparky client library in Python, NodeJS, Java and Go with the same core loy. We will use the bark glide client library as application for the C experiment. Okay, let's proceed the caus experiment for elastic hash. First, let's take a look at the particle code installed on the EC2 instance. The read.py reads from the leada for random keys and the right.py writes random string keys to the primary node. The two calls are not significantly different except that they use get and set commands respectively and the read strategy is set separately. In this demo video, the client is running on only one host but in a rear environment the client will be running on multiple hosts. Therefore, it's a good idea to use AWS system manager along with scripts like learn.sh and info.sh to automate the process. Let's start the reading and writing using the script. The clients are running normally and we can see the loads being generated. Next, let's check the alaste cluster. The cluster consists of three shards with two replicas per shard and uses a total of three ages A, B and C. Additionally, we will use FIS target tag that has been entered to control which clusters the value is injected into. This tag is used when creating experiment templates in FIS. We will be injecting a failure into availability j and the master matrix shows that the node one of shard one is in availability jone c is a primary. Therefore I will soon check the engine logs of shard one through cloudatch logs. Now let's create a template for caus experiment in FIS. In the description, enter bar exp and click the next button. Then click the add action button. For the name, enter a out a select cache. interpret age power action type. Keep the duration at five minutes and click the save button. Select the target and enter the FIS target tag that was on the Last Cache cluster earlier. Also set the age to C. If you need it, you can configure IM law and the termination condition, lier generation and logging settings. We will skip those for the demo. Finally, leave the conditions and create the template. In one to two minutes, the template for the experiment generated. We will proceed with the experiment using the template. Start the experiment by clicking the button. If the experiment status has changed to learning, now let's check the status of alastic cache. If the elastic cache cluster status is modifying, the experiment is progressing normally. Approximately 2 minutes after the experiment started, let's check the metrics of alaste such as CPU utilization. We can see that matrix of the nodes in availability John C are not being collected. Next, let's check the loads of barlide on the EC2 instance. During the failover, we can see that the keys on the nodes in John C experienced a timeout, but then started functioning normally again. Finally, let's look at the engine logs in cloudatch logs. The primary node of shard one was impacted by the age failure. So the other nodes of shard one detected the connection failure with the existing primary and performed available to select a new primary node. Today we've conducted cost experiments using AWS FIS. Through this experiment, we've observed how clients like Barglide operate, check the clusters engine logs, and identified the metrics that needed to be collected for monitoring. As demonstrated in the demo, Cy experiments help us anticipate what may happen when failures occur and consider how to identify and address such situations. If you have become more curious after watching the demo, you can refer to the following materials. Hope this video is helpful and look forward to your questions. Please feel free to reach out to your account teams if you have any more follow-up questions on the cow experiment for elastic cache. Thank you again.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Amazon Web Services · Amazon Web Services · 30 of 60

← Previous Next →

Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services

Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services

Amazon Web Services

Galileo on modernizing on banking infrastructure | Amazon Web Services

Galileo on modernizing on banking infrastructure | Amazon Web Services

Amazon Web Services

Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services

Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services

Amazon Web Services

AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services

AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services

Amazon Web Services

How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services

How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services

Amazon Web Services

Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services

Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services

Amazon Web Services

AWS Security Hub Exposure Findings | Amazon Web Services

AWS Security Hub Exposure Findings | Amazon Web Services

Amazon Web Services

How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?

How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?

Amazon Web Services

How do I extend an EBS volume with LVM partitions?

How do I extend an EBS volume with LVM partitions?

Amazon Web Services

AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services

AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services

Amazon Web Services

Run Cloud Adoption Framework workshops with Miro | Amazon Web Services

Run Cloud Adoption Framework workshops with Miro | Amazon Web Services

Amazon Web Services

Getting Started with AWS Cost Optimization Hub | Amazon Web Services

Getting Started with AWS Cost Optimization Hub | Amazon Web Services

Amazon Web Services

Why did my Amazon SQS messages get sent to a dead-letter queue?

Why did my Amazon SQS messages get sent to a dead-letter queue?

Amazon Web Services

Declarative Policies for EC2 | Amazon Web Services

Declarative Policies for EC2 | Amazon Web Services

Amazon Web Services

How do I troubleshoot IAM permission issues for the Billing and Cost Management console?

How do I troubleshoot IAM permission issues for the Billing and Cost Management console?

Amazon Web Services

Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services

Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services

Amazon Web Services

Fueling Success: Small shifts, powerful performance | Amazon Web Services

Fueling Success: Small shifts, powerful performance | Amazon Web Services

Amazon Web Services

WEX enhances customer experience with AI-powered chatbot | Amazon Web Services

WEX enhances customer experience with AI-powered chatbot | Amazon Web Services

Amazon Web Services

Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services

Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services

Amazon Web Services

Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?

Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?

Amazon Web Services

Telemetry Pipelines for AI | Amazon Web Services

Telemetry Pipelines for AI | Amazon Web Services

Amazon Web Services

Getting Control over Security and Observability Data | Amazon Web Services

Getting Control over Security and Observability Data | Amazon Web Services

Amazon Web Services

The Problem with Telemetry Data Volume | Amazon Web Services

The Problem with Telemetry Data Volume | Amazon Web Services

Amazon Web Services

Telemetry Pipelines on AWS | Amazon Web Services

Telemetry Pipelines on AWS | Amazon Web Services

Amazon Web Services

What are Telemetry Pipelines? | Amazon Web Services

What are Telemetry Pipelines? | Amazon Web Services

Amazon Web Services

Using AI for RegEx on Telemetry Pipelines | Amazon Web Services

Using AI for RegEx on Telemetry Pipelines | Amazon Web Services

Amazon Web Services

Multi-Session Support in the AWS Console | Amazon Web Services

Multi-Session Support in the AWS Console | Amazon Web Services

Amazon Web Services

How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?

How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?

Amazon Web Services

How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services

How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services

Amazon Web Services

Chaos Experiment with Amazon ElastiCache | Amazon Web Services

Chaos Experiment with Amazon ElastiCache | Amazon Web Services

Amazon Web Services

Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services

Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services

Amazon Web Services

ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services

ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services

Amazon Web Services

Pennymac scales document processing with AWS | Amazon Web Services

Pennymac scales document processing with AWS | Amazon Web Services

Amazon Web Services

AWS | Next Level Innovation | Amazon Web Services

AWS | Next Level Innovation | Amazon Web Services

Amazon Web Services

Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services

Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services

Amazon Web Services

A Leader's Edge from Executive Insights | Amazon Web Services

A Leader's Edge from Executive Insights | Amazon Web Services

Amazon Web Services

How do I create a custom Amazon WorkSpaces image?

How do I create a custom Amazon WorkSpaces image?

Amazon Web Services

Charles Leclerc tests his AI-generated race track | Amazon Web Services

Charles Leclerc tests his AI-generated race track | Amazon Web Services

Amazon Web Services

Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services

Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services

Amazon Web Services

How do I prevent the resources in my CloudFormation stack from getting deleted or updated?

How do I prevent the resources in my CloudFormation stack from getting deleted or updated?

Amazon Web Services

How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?

How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?

Amazon Web Services

Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services

Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services

Amazon Web Services

Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services

Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services

Amazon Web Services

AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services

AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services

Amazon Web Services

What's new in RCPs | Amazon Web Services

What's new in RCPs | Amazon Web Services

Amazon Web Services

API Caching using Amazon ElastiCache | Amazon Web Services

API Caching using Amazon ElastiCache | Amazon Web Services

Amazon Web Services

Pendula: Amazon Nova Customer Testimonial | Amazon Web Services

Pendula: Amazon Nova Customer Testimonial | Amazon Web Services

Amazon Web Services

InDebted : Amazon Nova Customer Testimonial | Amazon Web Services

InDebted : Amazon Nova Customer Testimonial | Amazon Web Services

Amazon Web Services

Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services

Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services

Amazon Web Services

Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services

Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services

Amazon Web Services

How do I reuse a knowledge base session in Amazon Bedrock?

How do I reuse a knowledge base session in Amazon Bedrock?

Amazon Web Services

EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast

EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast

Amazon Web Services

Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures

Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures

Amazon Web Services

Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures

Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures

Amazon Web Services

How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services

How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services

Amazon Web Services

SLMs and LLMs: What’s the Difference? | Amazon Web Services

SLMs and LLMs: What’s the Difference? | Amazon Web Services

Amazon Web Services

SLMs and LLMs: When to use them? | Amazon Web Services

SLMs and LLMs: When to use them? | Amazon Web Services

Amazon Web Services

SLMs on CPU | Amazon Web Services

SLMs on CPU | Amazon Web Services

Amazon Web Services

Intelligent Model Routing | Amazon Web Services

Intelligent Model Routing | Amazon Web Services

Amazon Web Services

SLMs, LLMs, and Model Routing in Agents | Amazon Web Services

SLMs, LLMs, and Model Routing in Agents | Amazon Web Services

Amazon Web Services

This video teaches how to conduct a Chaos experiment using AWS FIS to test the high availability of an Elasticache cluster and its connected application or client. The experiment involves injecting a failure into the system and observing how it responds. By conducting this experiment, you can identify potential failures and improve the resilience of your system.

Key Takeaways

Create an Elasticache cluster
Configure the Barlide client library
Inject a failure into the system using AWS FIS
Observe the system's response to the failure
Analyze the metrics and logs to identify potential issues

💡 Conducting Chaos experiments can help identify potential failures in distributed systems and improve their resilience

🔒 Pro feature: Ask AI to explain this lesson →

More on: Systems Design Basics

View skill →

Complete Application Deployment using Kubernetes Components | Kubernetes Tutorial 20

Complete Application Deployment using Kubernetes Components | Kubernetes Tutorial 20

TechWorld with Nana

How to write a Windows emulator for Linux from scratch

How to write a Windows emulator for Linux from scratch

Google for Developers

Deploying an ecommerce web app to GKE

Deploying an ecommerce web app to GKE

BUILDING AN 8-BIT COMPUTER FROM SCRATCH #4 (Full Stream)

BUILDING AN 8-BIT COMPUTER FROM SCRATCH #4 (Full Stream)

Getting started with Caddy the HTTPS Web Server from scratch

Getting started with Caddy the HTTPS Web Server from scratch

Build & Optimize React Native Product Listing Apps

Build & Optimize React Native Product Listing Apps

Related AI Lessons

Why Realtime Collaboration Is Harder Than It Looks?

Realtime collaboration is a complex distributed systems problem that requires careful engineering, not just a simple UI feature

Medium · JavaScript

Podcast: Architectural Patterns: Moving Beyond Cloud-Native to Local-First - Insights from Adam Wiggins

Learn how to design local-first architectures that combine cloud-based collaboration with local software performance and data ownership

Three Questions I Ask Every System. Most Design Reviews Skip All Three.

Learn three crucial questions to ask during system design reviews to identify potential issues and improve architecture

Medium · Programming

Thoughts on new “HOT” role *FDE*

Learn about the emerging role of Forward Deployed Engineer (FDE) and its significance in system design and product success

Retracing It All With My Son