aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Cloud-Native
  • Software

How Chaos Mesh Helps Apache APISIX Improve System Stability

  • aster.cloud
  • September 20, 2021
  • 4 minute read
Chaos Mesh helps Apache APISIX improve system stability

Apache APISIX is a cloud-native, high-performance, scaling microservices API gateway. It is one of the Apache Software Foundation’s top-level projects and serves hundreds of companies around the world, processing their mission-critical traffic, including finance, the Internet, manufacturing, retail, and operators. Our customers include NASA, the European Union’s digital factory, China Mobile, and Tencent.

Apache APISIX architecture

Apache APISIX architecture


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

As our community grows, Apache APISIX’s features more frequently interact with external components, making our system more complex and increasing the possibility of errors. To identify potential system failures and build confidence in the production environment, we introduced the concept of Chaos Engineering.

In this post, we’ll share how we use Chaos Mesh® to improve our system stability.

Our pain points

Apache APISIX processes tens of billions of requests a day. At that volume level, our users have noticed a couple of issues:

  • Scenario #1: In Apache APISIX’s configuration center, when unexpectedly high network latency occurs between etcd and Apache APISIX, can Apache APISIX still filter and forward traffic normally?
  • Scenario #2: When a node in the etcd cluster fails and the cluster can still run normally, an error is reported for the node’s interaction with the Apache APISIX admin API.

Although Apache APISIX has covered many scenarios through unit, end-to-end (E2E), and fuzz tests in continuous integration (CI), it has not covered the interaction scenario with external components. If the system behaves abnormally, for example, if the network jitters, a hard disk fails, or a process is killed, can Apache APISIX give appropriate error messages? Can it keep running or restore itself to normal operation?

Why we chose Chaos Mesh

To test these user scenarios and to discover similar problems before our product goes into production, our community decided to use Chaos Mesh for chaos testing.

Read More  IBM Study: Vehicles Believed to be Software Defined and AI Powered by 2035

Chaos Mesh is a cloud-native Chaos Engineering platform that features all-around fault injection methods for complex systems on Kubernetes, covering faults in Pod, the network, file system, and even the kernel. It helps users find weaknesses in the system and ensures that the system can resist out-of-control situations in the production environment.

Like Apache APISIX, Chaos Mesh has an active open source community. We know that an active community can ensure stable software use and rapid iteration. This makes Chaos Mesh more attractive.

How we use Chaos Mesh in APISIX

Chaos Engineering has grown beyond simple fault injection and now forms a complete methodology. To create a chaos experiment, we determined what the normal operation or “steady state” of our application should be. We then introduced potential problems to see how the system responded. If the problems knocked the application out of its steady state, we fixed them.

Now, we’ll take the two scenarios we mentioned to show you how we use Chaos Mesh in Apache APISIX.

Scenario #1

We deployed a Chaos Engineering experiment using the following steps:

  1. We found metrics to measure whether Apache APISIX is running normally. In the test, the most important method is to use Grafana to monitor the Apache APISIX’s running metrics. We extracted data from Prometheus in CI for comparison. Here, we used the routing and forwarding requests per second (RPS) and etcd connectivity as evaluation metrics. We analyzed the log. For Apache APISIX, we checked Nginx’s error log to determine whether there was an error and whether the error was in line with our expectations.
  2. We performed a test in the control group. We found that both create route and access route were successful, and we could connect to etcd. We recorded the RPS.
  3. We used network chaos to add a five second network latency and then retested. This time, set route failed, get route succeeded, etcd could be connected to, and RPS had no significant change compared to the previous experiment. The experiment met our expectations.
Read More  Dragonfly Integrates Nydus For Image Acceleration Practice
High network latency occurs between etcd and Apache APISIX

High network latency occurs between etcd and Apache APISIX

Scenario #2

After we conducted the same experiment as above in the control group, we introduced pod-kill chaos and reproduced the expected error. When we randomly deleted a small number of etcd nodes in the cluster, sometimes APISIX could connect to etcd and sometimes not, and the log printed a large number of connection rejection errors.

When we deleted the first or third node in the etcd endpoint list, the set route returned a result normally. However, when we deleted the second node in the list, the set route returned the error “connection refused.”

Our troubleshooting revealed that the etcd Lua API used by Apache APISIX selected the endpoint sequentially, not randomly. Therefore, when we created an etcd client, we bound to only one etcd endpoint. This led to continuous failure.

After we fixed this problem, we added a health check to the etcd Lua API to ensure that a large number of requests would not be sent to the disconnected etcd node. To avoid flooding the log with errors, we added a fallback mechanism when the etcd cluster was completely disconnected.

An error is reported

An error is reported from one etcd node’s interaction with the Apache APISIX admin API

Our future plans

Run a chaos test in E2E simulation scenarios

In Apache APISIX, we manually identify system weaknesses for testing and repair. As in the open source community, we test in CI, so we don’t need to worry about the impact of Chaos Engineering’s failure radius on the production environment. But the test cannot cover complicated and comprehensive application scenarios in the production environment.

Read More  Five Exciting Things About Istio Ambient Mesh

To cover more scenarios, the community plans to use the existing E2E test to simulate more complete scenarios and conduct chaos tests that are more random and cover a larger range.

Add chaos tests to more Apache APISIX projects

In addition to finding more vulnerabilities for Apache APISIX, the community plans to add chaos tests to more projects such as Apache APISIX Dashboard and Apache APISIX Ingress Controller.

Add features to Chaos Mesh

When we deployed Chaos Mesh, some features were temporarily unsupported. For example, we couldn’t select a service as a network latency target or specify container port injection as network chaos. In the future, the Apache APISIX community will assist Chaos Mesh to add related features.

You’re welcome to contribute to the Apache APISIX project on GitHub. If you are interested in Chaos Mesh and would like to improve it, join its Slack channel (#project-chaos-mesh) or submit your pull requests or issues to its GitHub repository. Chaos Engineering

 

Guest post originally published on PingCAP‘s blog by Shuyang Wu


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • Apache APISIX
  • Chaos Mesh
  • CNCF
You May Also Like
men with computer website information and chat bubbles vector illustration
View Post
  • Software
  • Software Engineering

What is an ISV (independent software vendor)?

  • August 27, 2025
aster-cloud-erp-bill_of_materials_2
View Post
  • Software
  • Software Engineering

What is an SBOM (software bill of materials)?

  • July 2, 2025
aster-cloud-sms-pexels-tim-samuel-6697306
View Post
  • Programming
  • Software

Send SMS texts with Amazon’s SNS simple notification service

  • July 1, 2025
aster-cloud-website-pexels-goumbik-574069
View Post
  • Programming
  • Software

Host a static website on AWS with Amazon S3 and Route 53

  • June 27, 2025
View Post
  • Software
  • Technology

Canonical Releases Ubuntu 25.04 Plucky Puffin

  • April 17, 2025
View Post
  • Software
  • Technology

IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management

  • March 27, 2025
Vehicle manufacturing
View Post
  • Software

IBM Study: Vehicles Believed to be Software Defined and AI Powered by 2035

  • December 12, 2024
aster-cloud-tux-gaming
View Post
  • Computing
  • Gears
  • Software

5 best Linux distributions for gamers in 2024

  • September 11, 2024

Stay Connected!
LATEST
  • digital-nomad-freelancer-worker-2151205464 1
    One paperwork problem – Get your Digital Nomad Visa employment documents fast from UK, EU or Singapore
    • June 16, 2026
  • 2
    Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection
    • June 15, 2026
  • 3
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 4
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 5
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 6
    WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements
    • June 8, 2026
  • 7
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 8
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 9
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
  • 10
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • pope-leo-xiv-cq5dam-1500.844 1
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 2
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 3
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
  • 4
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • Anthropic Institute 5
    Introducing The Anthropic Institute
    • March 11, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.