GeneralJuly 19, 2021

Getting started with Observability

Are you observing your systems?

Most of the tech giants including companies like Amazon, Netflix, started to build their systems using a monolithic architecture because back in the time it was much faster to set up a monolith and get the business moving. But over time as the product matures or fat growth happens, with growing systems the code gets more and more complicated. They all faced this problem and looked at microservices as a solution. One of the biggest benefits of microservices is that each microservice can be developed, scaled, and deployed independently.

With great power comes great responsibility and that’s what happened when organizations switched to microservices from more monolithic application architectures, they got significant benefits in delivery speed and scalability but on the flip side now they have to deal with the operational complexity in managing, monitoring and securing the new distributed architecture.

One of the benefits of working with older technologies was the limited set of defined failure modes. Yes, things broke, but you would pretty much know what broke at any given time, or you could find out quickly because a lot of older systems failed in pretty much the same three ways over and over again.


Microservice interactions at Amazon and Netflix (Image by divante.com)

Adopting a single deployment platform can address some of the concerns regarding operational complexity but it goes against the philosophy that makes microservice architectures effective. Using APIs to expose core business functionality and facilitate service-to-service communication gives us several control points and makes it easier to deal with complex modern applications. API-driven applications come with their issues like design complexity, visibility, communication, security, etc. which we discussed in detail in this blog post.

In a nutshell, Operating distributed systems is hard, not only because of their inherent complexity of the number of components and their distribution but also because of the unpredictability of their failure modes: there are plenty of unknown unknowns. We are left with an imperative to build systems that can be debugged, armed with evidence instead of conjectures.

With the growing complexity of systems and fast-moving software delivery trains due to modern cloud-native architectures, the possible failure modes became more abundant. Monitoring tools helped us for a while in keeping track of application and infrastructure performance analytics but it isn’t very suitable for modern distributed applications. As we discussed above, developers these days don’t know what their software failure modes are and more unknown unknowns means we won’t put any effort into fixing something because we don’t know the problem exists in the first place. Standard monitoring can only help you with tracking known unknown and it’s very relative. Your monitoring is only as useful as your system is monitorable.

This monitor-ableness of your modern applications is what we call "Observability".

In control theory,

Observability is defined as a measure of how well internal states of a system can be inferred from knowledge of that system’s external outputs. Simply put, observability is how well you can understand your complex system.

Metrics, events, logs, and traces—or MELT—are at the core of Observability. But, Observability is about a whole lot more than just data.

Observability is all about the ability to ask abstract questions to your system and find the answer without the need of opening a black box. Like, consider your process of placing an order on Amazon failed due to query timeout so What characteristics did the queries that timed out at 500ms share in common? Service versions? Browser plugins? Here, Instrumentation produces data which is what we call telemetry, and querying that data answers our questions.

Whenever we talk about Observability, we also talk about metrics, logs, and traces which are three pillars of Observability.

Metrics: Aggregated summary statistics.
Logs: Detailed debugging information emitted by processes.
Distributed Tracing: Provides insights into the full lifecycles, aka traces of requests to a system, allowing you to pinpoint failures and performance issues.

We will discuss all three of these in detail in the upcoming articles in this series.

In a nutshell,

This blog post was about giving you a brief overview of Observability and help you understand why you need Observability. In the upcoming blog posts, we will talk about metrics, logs, and traces and also see different applications of modern Observability for modern distributed applications.

Until then, if your organization is using microservice architecture and exploring Observability solutions, feel free to check out Hypertrace, which is a modern API Observability platform. join our slack community to interact with folks who are on the same microservice transition journey and are exploring Observability.

References

About author
Jayesh is a founding engineer/ Product Manager at TraceableAI and he is building Hypertrace. He loves reading and you can find him on twitter and linkedin to discuss anything around tech.