aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Cloud-Native

Stream Vs. Batch: Leveraging M3 And Thanos For Real-Time Aggregation

  • aster.cloud
  • December 16, 2021
  • 5 minute read

KubeCon North America 2021 – Breakout Session Recap

I gave a breakout session at this year’s KubeCon North America titled “Stream vs. Batch: Leveraging M3 and Thanos for Real-Time Aggregation.” This blog is a recap of the topics and concepts discussed during the session. Visit our KubeCon North America events page for the full session recording.

Why aggregation matters for real-time

With monitoring workflows aimed at minimizing time to detect incidents, having real-time insights is critical for maintaining reliable cloud-native applications. But monitoring business-critical applications can become difficult at scale. How do you continue processing large volumes of real time data while maintaining valuable insights? This is where aggregation can help!


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Taking the example given during the presentation, when querying a high cardinality metric such as CPU usage, query time can take up to 20 seconds to complete as it’s fetching 60,000 time series across all pods and labels.

In most metrics monitoring use cases, however, you don’t need the view metrics at the per pod or label level, and an aggregate view is sufficient for understanding how your system is performing at a high level. Continuing with the above example, if you aggregate on only two labels (e.g. container name and namespace) by pre-computing the sum at one minute intervals, the query results become real-time (0.4 seconds) with roughly 200 time series.

Stream vs. batch aggregation

Understanding the value of aggregation for query performance and real-time results, it’s also important to know the two primary approaches to metrics aggregation – stream and batch.

  • With stream aggregation, metrics data is streaming continuously, and the aggregation is done in memory on the streaming ingest path before writing results to a time series database (TSDB). Because data is aggregated in real-time, streaming aggregation is typically meant for information that’s needed immediately.
Read More  Fixing Font Padding In Compose Text
  • With batch aggregation, data gets collected over time, and aggregation is done by reading raw metrics from a TSDB before writing back the aggregated metric data. Because data is aggregated in batches over time, batch aggregation is typically meant for large quantities of information that aren’t time sensitive.

Aggregation with M3 and Thanos

M3 and Thanos each have their own approaches to stream and batch aggregation, both of which are based on how Prometheus performs aggregation via recording rules. Prometheus recording rules allow for pre-computing of frequently needed and/or expensive queries before then storing back the aggregate metrics to a TSDB. They execute and pre-compute as a single process in memory at regular intervals making them especially useful for dashboards. With large scale metrics monitoring, however, you will typically outgrow a single Prometheus instance and turn to a Prometheus remote storage solution like M3, Thanos, and Cortex.

Stream aggregation with M3

M3 is an open source metrics engine comprised of four main components:

  • M3DB – distributed, custom built TSDB
  • M3 Coordinator – optimized ingest and downsampling tier
  • M3 Aggregator – streaming aggregation tier (optional, depends on use case)
  • M3 Query – distributed query engine

M3’s approach to aggregation uses roll-up rules, which aggregate across multiple time series at regular intervals using the M3 coordinator and, in some use cases, the M3 aggregator. Before writing the newly aggregated series to M3DB, the M3 coordinator will reconstitute the series as a counter, histogram, or gauge metric — all of which are compatible with PromQL (check out our blog on the primary types of Prometheus metrics for more information). With aggregation done in-memory upon the ingest path, the aggregated metrics are immediately available for query with M3.

Read More  Continuous Profiling In Kubernetes Using Pyroscope
Simplified architectural diagram to demonstrate M3 streaming aggregation

Batch Aggregation with Thanos

Similar to M3, Thanos is an open source metrics monitoring solution. It has several main components, including:

  • Store / Store API – gateway to object store
  • Querier – horizontally scalable and stateless query, aggregation, and deduplication tier
  • Sidecar – proxy for Prometheus via remote write/read
  • Compactor – downsampling and block compaction tier
  • Ruler – evaluates Prometheus recording and alerting rules

With the Thanos sidecar setup, Prometheus metrics are scraped and stored inside each Prometheus instance. From there, the Thanos query tier pulls data from the instances via the sidecars before aggregating and deduplicating the metrics. Once these metrics have been processed inside the querier, the query results are available for display inside your dashboards (e.g. Grafana). However, for larger scale queries, especially those needed on a regular basis, the querier can be informed by the ruler to execute Prometheus recording rules on the collected metrics. Once the rules have been evaluated, the ruler will send the newly aggregated time series to Thanos object store (e.g. S3) for query and/or longer term storage.

Simplified architectural diagram to demonstrate Thanos batch aggregation

How to choose: pros and cons of each approach

Let’s now take a look at the various benefits and tradeoffs of these two approaches – streaming with M3 and batch with Thanos – and how they compare to one another:

M3 Pros: With M3, metrics are aggregated in-memory on the ingest path making them immediately available for query. Additionally, with roll up rules, only the aggregated metrics need to be persisted to M3DB and all other raw data can be dropped. By alleviating the query requirements for M3DB, you are able to scale to a higher number of alerts and recording rules.

Read More  Kubernetes Operators 101

M3 Cons: In terms of trade offs, M3 aggregation is more complex to operate and deploy compared to Thanos. It also does not support arbitrary PromQL, but instead reconstitutes the aggregate metrics as counters, histograms, and gauges.

Thanos Pros: Compared to M3, Thanos is more simple to operate, especially when scaling resources up or down. It is also fully PromQL compatible allowing for arbitrary PromQL queries and aggregation via Prometheus recording rules.

Thanos Cons: In terms of trade offs, Thanos aggregation adds an additional step when compared to streaming aggregation, as you need to re-query, read, and then write metrics to storage over the network. This can lead to large resource consumption, as well as slow queries. With aggregation performed against the query tier, larger scale queries may also take a while to request metrics from each Prometheus instance, and in some cases, will lead to skipped metrics by missing the intervals set by recording rules.

That’s a wrap!

Focusing on M3 and Thanos, we’re able to compare some of the major benefits and tradeoffs of using stream and batch processing for large scale metrics monitoring. If interested in learning more, check out the full session recording or visit M3 and Thanos documentation.

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. To learn more, visit https://chronosphere.io/ or request a demo.

 

 

Guest post originally published on Chronosphere’s blog by Gibbs Cullens
Source CNCF


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

aster.cloud

Related Topics
  • CNCF
  • Kubecon
  • KubeCon North America
  • M3
  • Thanos
You May Also Like
View Post
  • Cloud-Native
  • Multi-Cloud

Oracle Expands Multicloud Capabilities with AWS, Google Cloud, and Microsoft Azure

  • September 11, 2024
Cloud computing concept image double exposure Digitally Enhanced Smart City Concept with Cloud Computing
View Post
  • Cloud-Native
  • Computing
  • Hybrid Cloud
  • Multi-Cloud
  • Public Cloud

Make Your Business Resilient By Integrating These Best Practices Into Your Cloud Architecture

  • July 29, 2024
Huawei Cloud Cairo Region Goes Live
View Post
  • Cloud-Native
  • Computing
  • Platforms

Huawei Cloud Goes Live in Egypt

  • May 24, 2024
View Post
  • Cloud-Native
  • Computing
  • Engineering

10 Cloud Development Gotchas To Watch Out For

  • March 29, 2024
Storage Ceph
View Post
  • Cloud-Native
  • Data

The Growth Of IBM Storage Ceph – The Ideal Foundation For A Modern Data Lakehouse

  • January 30, 2024
Clouds
View Post
  • Cloud-Native
  • Platforms
  • Software Engineering

Microsoft Releases Azure Migrate Assessment Tool For .NET Application

  • January 14, 2024
View Post
  • Cloud-Native
  • Engineering
  • Platforms

Top Highlights From AWS Worldwide Public Sector Partners At Re:Invent 2023

  • December 27, 2023
View Post
  • Cloud-Native
  • Computing

Supercharging IBM’s Cloud-Native AI Supercomputer

  • December 24, 2023

Stay Connected!
LATEST
  • 1
    Expectations vs. Reality: The AI We Thought We’d Have in 10 Years
    • June 19, 2026
  • digital-nomad-freelancer-worker-2151205464 2
    One paperwork problem – Get your Digital Nomad Visa employment documents fast from UK, EU or Singapore
    • June 16, 2026
  • 3
    Samsung Art Store Brings Art Basel to Homes Worldwide With New Curated Collection
    • June 15, 2026
  • 4
    You Do Not Need to Invest in the IPO of SpaceX, Anthropic, and OpenAI
    • June 10, 2026
  • 5
    The consequences of relying on AI for accurate news
    • June 10, 2026
  • 6
    Connecting AI agents with unstructured data using Google Cloud Storage MCP Servers
    • June 10, 2026
  • 7
    WWDC26: Apple unveils next generation of Apple Intelligence, Siri AI, powerful parental controls, and an expansive set of software improvements
    • June 8, 2026
  • 8
    IBM and Google Cloud Announce Strategic Partnership to Scale AI with Human Expertise and AI‑Powered Delivery
    • June 4, 2026
  • Data center 9
    Data Sovereignty in Spain. It’s Not Just About the Law, It’s About Efficiency
    • June 3, 2026
  • 10
    Ink vs Pixels. What you miss versus what you are actually missing.
    • June 1, 2026
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    Banks race to patch new cyber vulnerabilities, and other cybersecurity news
    • May 25, 2026
  • pope-leo-xiv-cq5dam-1500.844 2
    Pope Leo XIV to Publish First Encyclical on Artificial Intelligence and Human Dignity on 25 May
    • May 22, 2026
  • 3
    Portfolio to Clients, and is Strengthened by Ongoing Project Glasswing Work
    • May 20, 2026
  • reMarkable Paper Pure 4
    Everything The reMarkable Paper Pure Actually Does
    • May 14, 2026
  • 5
    Scaling cloud and AI: Microsoft Azure’s commitment to Europe’s digital future
    • May 11, 2026
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.