Chaos Experiment with Amazon ElastiCache | Amazon Web Services
Learn how to test high availability with Fault injection service and amazon Elasticache.
Learn more at - http://go.aws/46agVIU
Subscribe to AWS: https://go.aws/subscribe
Sign up for AWS: https://go.aws/signup
AWS free tier: https://go.aws/free
Explore more: https://go.aws/more
Contact AWS: https://go.aws/contact
Next steps:
Explore on AWS in Analyst Research: https://go.aws/reports
Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace
Join the AWS Partner Network: https://go.aws/partners
Learn more on how Amazon builds and operates software: https://go.aws/library
Do you have technical AWS questions?
Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb
Why AWS?
Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—use AWS to be more agile, lower costs, and innovate faster.
#AWS #AmazonWebServices #CloudComputing
What You'll Learn
This video demonstrates how to conduct a Chaos experiment using Amazon Web Services (AWS) Fault Injection Simulator (FIS) to test the high availability of an Elasticache cluster and its connected application or client. The experiment involves injecting a failure into the system and observing how it responds.
Full Transcript
Welcome everyone. Today we'll discuss how to test the first tolerance of the elastic cache cluster you're operating as well as the connected application or client and how to monitor a test called a cow experiment. We will explore the benefits of a c experiment and the process itself and address common questions our customers have when encountering a failure in your in-memory workloads. So what is cow's experiment and why should we do it? Cow's experiments help to confirm whether a system is fail safe and has resilience by injecting load or forks into the system within a controlled environment. Failures are certainly something to be avoided. So what's the reason for intentionally injecting force and checking reactions in advance? The obvious reason is the potential cost of downtime. According to estimates, 91% of enterprises face over $300,000 per hour of downtime with 44% hitting 1 to5 million per hour. Businesses facing regular outages pay 16 times more for recovery compared to those with less downtime. downtime risks, compliance issues, stock prices, even business failure in extreme case. Furthermore, most modern systems are composed of distributed systems. Distributed systems are complex. So when we actually encounter failures, it's very difficult to identify the actual root cause starting from the unfamiliar phenomena. This is why simulated failure training is necessary. AWS provides a service called the fort injection simulator that allows you to conduct resilience testing. Resilience testing with AWS FIS is necessary for the following reasons. First, it can improve reliability and availability. Second, it can uncover hidden issues that we are not aware of beforehand. Third, it can help identify gaps in our team's response procedures. It allows us to assess various aspects such as monitoring, observability, alarms, and logs. Finally, it enables us to enhance our value related lumbus and playbooks. Resilience testing is not a one-time activity in systems life cycle. It's necessary to continuously identify areas for improvement and refine the team's response procedures through the following steps. Bordon systems evolve and change constantly. Much like living creatures, the teams operating these systems must also adapt these evolutions and they need to periodically conduct regulance testing to maintain visibility into potential failures. Okay, now let's look at an example of F injection. For a system configured with MTZ, we can inject a F that simulates a power interruption in one of the AGS. The following is a typical multi-AZ architecture. Using AWS FIS, we will inject a power outage port in one of the availability zones. This is reserved in the application being used only in the right a and elastic cache and RDS will be promoted to primary. Once the for is induced and availability is reduced an alert will be triggered and the EB will start processing traffic only from the right A allowing the service to be served from that AC. This is the desired recovery process for our system in the event of an a failure. In this demo that I will explain later, we focus on testing elastic cache and its connected clients as they are critical components. So as shown in the diagram, we will use FIS to inject an A5 failure into the elastic cache cluster and observe how the cluster and its clients respond. Before that, let's run about the Barlide client library that we will use for the experiment. Barlide is an open source client library maintained by the Barky opensource project. It's compatible not only with barky but also with lettuce. Also if you look at the high level architecture on the right you can see the core logic is implemented in lost and it's wrapped to provide client libraries in other languages. This architecture allows you to use the high performance Sparky client library in Python, NodeJS, Java and Go with the same core loy. We will use the bark glide client library as application for the C experiment. Okay, let's proceed the caus experiment for elastic hash. First, let's take a look at the particle code installed on the EC2 instance. The read.py reads from the leada for random keys and the right.py writes random string keys to the primary node. The two calls are not significantly different except that they use get and set commands respectively and the read strategy is set separately. In this demo video, the client is running on only one host but in a rear environment the client will be running on multiple hosts. Therefore, it's a good idea to use AWS system manager along with scripts like learn.sh and info.sh to automate the process. Let's start the reading and writing using the script. The clients are running normally and we can see the loads being generated. Next, let's check the alaste cluster. The cluster consists of three shards with two replicas per shard and uses a total of three ages A, B and C. Additionally, we will use FIS target tag that has been entered to control which clusters the value is injected into. This tag is used when creating experiment templates in FIS. We will be injecting a failure into availability j and the master matrix shows that the node one of shard one is in availability jone c is a primary. Therefore I will soon check the engine logs of shard one through cloudatch logs. Now let's create a template for caus experiment in FIS. In the description, enter bar exp and click the next button. Then click the add action button. For the name, enter a out a select cache. interpret age power action type. Keep the duration at five minutes and click the save button. Select the target and enter the FIS target tag that was on the Last Cache cluster earlier. Also set the age to C. If you need it, you can configure IM law and the termination condition, lier generation and logging settings. We will skip those for the demo. Finally, leave the conditions and create the template. In one to two minutes, the template for the experiment generated. We will proceed with the experiment using the template. Start the experiment by clicking the button. If the experiment status has changed to learning, now let's check the status of alastic cache. If the elastic cache cluster status is modifying, the experiment is progressing normally. Approximately 2 minutes after the experiment started, let's check the metrics of alaste such as CPU utilization. We can see that matrix of the nodes in availability John C are not being collected. Next, let's check the loads of barlide on the EC2 instance. During the failover, we can see that the keys on the nodes in John C experienced a timeout, but then started functioning normally again. Finally, let's look at the engine logs in cloudatch logs. The primary node of shard one was impacted by the age failure. So the other nodes of shard one detected the connection failure with the existing primary and performed available to select a new primary node. Today we've conducted cost experiments using AWS FIS. Through this experiment, we've observed how clients like Barglide operate, check the clusters engine logs, and identified the metrics that needed to be collected for monitoring. As demonstrated in the demo, Cy experiments help us anticipate what may happen when failures occur and consider how to identify and address such situations. If you have become more curious after watching the demo, you can refer to the following materials. Hope this video is helpful and look forward to your questions. Please feel free to reach out to your account teams if you have any more follow-up questions on the cow experiment for elastic cache. Thank you again.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Amazon Web Services · Amazon Web Services · 30 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
▶
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services
Amazon Web Services
Galileo on modernizing on banking infrastructure | Amazon Web Services
Amazon Web Services
Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services
Amazon Web Services
AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services
Amazon Web Services
How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services
Amazon Web Services
Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services
Amazon Web Services
AWS Security Hub Exposure Findings | Amazon Web Services
Amazon Web Services
How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?
Amazon Web Services
How do I extend an EBS volume with LVM partitions?
Amazon Web Services
AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services
Amazon Web Services
Run Cloud Adoption Framework workshops with Miro | Amazon Web Services
Amazon Web Services
Getting Started with AWS Cost Optimization Hub | Amazon Web Services
Amazon Web Services
Why did my Amazon SQS messages get sent to a dead-letter queue?
Amazon Web Services
Declarative Policies for EC2 | Amazon Web Services
Amazon Web Services
How do I troubleshoot IAM permission issues for the Billing and Cost Management console?
Amazon Web Services
Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services
Amazon Web Services
Fueling Success: Small shifts, powerful performance | Amazon Web Services
Amazon Web Services
WEX enhances customer experience with AI-powered chatbot | Amazon Web Services
Amazon Web Services
Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services
Amazon Web Services
Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?
Amazon Web Services
Telemetry Pipelines for AI | Amazon Web Services
Amazon Web Services
Getting Control over Security and Observability Data | Amazon Web Services
Amazon Web Services
The Problem with Telemetry Data Volume | Amazon Web Services
Amazon Web Services
Telemetry Pipelines on AWS | Amazon Web Services
Amazon Web Services
What are Telemetry Pipelines? | Amazon Web Services
Amazon Web Services
Using AI for RegEx on Telemetry Pipelines | Amazon Web Services
Amazon Web Services
Multi-Session Support in the AWS Console | Amazon Web Services
Amazon Web Services
How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?
Amazon Web Services
How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services
Amazon Web Services
Chaos Experiment with Amazon ElastiCache | Amazon Web Services
Amazon Web Services
Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services
Amazon Web Services
ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services
Amazon Web Services
Pennymac scales document processing with AWS | Amazon Web Services
Amazon Web Services
AWS | Next Level Innovation | Amazon Web Services
Amazon Web Services
Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services
Amazon Web Services
A Leader's Edge from Executive Insights | Amazon Web Services
Amazon Web Services
How do I create a custom Amazon WorkSpaces image?
Amazon Web Services
Charles Leclerc tests his AI-generated race track | Amazon Web Services
Amazon Web Services
Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services
Amazon Web Services
How do I prevent the resources in my CloudFormation stack from getting deleted or updated?
Amazon Web Services
How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?
Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services
Amazon Web Services
What's new in RCPs | Amazon Web Services
Amazon Web Services
API Caching using Amazon ElastiCache | Amazon Web Services
Amazon Web Services
Pendula: Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
InDebted : Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services
Amazon Web Services
Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services
Amazon Web Services
How do I reuse a knowledge base session in Amazon Bedrock?
Amazon Web Services
EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast
Amazon Web Services
Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures
Amazon Web Services
Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures
Amazon Web Services
How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services
Amazon Web Services
SLMs and LLMs: What’s the Difference? | Amazon Web Services
Amazon Web Services
SLMs and LLMs: When to use them? | Amazon Web Services
Amazon Web Services
SLMs on CPU | Amazon Web Services
Amazon Web Services
Intelligent Model Routing | Amazon Web Services
Amazon Web Services
SLMs, LLMs, and Model Routing in Agents | Amazon Web Services
Amazon Web Services
More on: Systems Design Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Podcast: Architectural Patterns: Moving Beyond Cloud-Native to Local-First - Insights from Adam Wiggins
InfoQ AI/ML
Three Questions I Ask Every System. Most Design Reviews Skip All Three.
Medium · Programming
Thoughts on new “HOT” role *FDE*
Medium · LLM
Building a chikku OS
Dev.to · Ravi Bhuvan
🎓
Tutor Explanation
DeepCamp AI