An Introduction to Observability & Telemetry

Software, and the hardware that supports it, is a constantly changing discipline. It wasn’t that long ago when the majority of software operated in a monolithic system. Around the 80s was the emergence of Modular Architecture, which soon transitioned into Service Oriented Architecture. SOA grew into popularity alongside a popular shift in how we build software–Agile. Then, around the 2010s, we started to see the rise of Microservices, containerization, & DevOps.

These innovations have inherently made software products more complex–especially from a quality perspective–due to their distributed & decentralized nature. Building quality software today means the ecosystem of microservices, frameworks, cloud providers, processes, and so on need to operate as a smooth system to provide the end product to the user.

Why should we care about this stuff?

Today’s software is inherently complex
Quality practices should include all parts of the underlying software system
Observability enables a deeper understanding of the full system supporting the software product–and the user’s experience

Basics of Observability

Observability is the ability to take external outputs, typically telemetry, to assess the state of the whole application or software system. Or, to take the definition from Dora.dev:

“Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.”

Observability typically focuses on three main telemetry types—Logs, Metrics, & Traces—providing the data about your underlying system.

Logs

Logs are very granular & discrete records. They are time-stamped & immutable, so they are valuable when attempting to debug specific issues or errors. The content of logs can differ across different systems & log types. A log might capture a single string defining an error message, a state change, or a user action. A log might be a structured record that includes detailed metadata about an event that was fired.

Because logs are very precise, they are very valuable when debugging issues in production. There are several types of logs, including event logs, transaction logs, & server logs.

Metrics

Metrics are typically a point-in-time measurement of the state of a system or service. They are used to represent an aggregate value & are usually numeric.

When you think of metrics, you want to think about things like:

How much CPU capacity does your application use in 5 minutes?
What’s the number of requests per second?
What’s your application error rate per minute?
How much memory does your application use in 5 minutes?

You could also create metrics that are specifically focused on Business KPIs, like session duration, page views, & transaction volume.

Metrics are typically used to assess the system's health over time. They can be extremely valuable as an early signal that there is system degradation, trend tracking, or unexpected changes.

Metrics are a great early detection system for quality teams to utilize. A spike in error rates or a gradual increase in response latency can signal systemic issues before they begin impacting the user experience. And when extended to support business KPI’s, they can give your team insight into how users experience your software.

Traces

Traces record the data across a user’s session. Traces are composed of multiple spans–the individual units of work that map the full path of a user’s interaction across microservices, databases, queues, APIs, and anything else that may be a part of the software system. The span is what actually follows a user's request path across the system.

Traces are a great way to understand the flow of data across the different microservices in the application. They can be used to help debug more ‘Why’ questions or pinpoint specific issue areas, like:

Why does this user flow perform differently than expected?
Where is the latency coming from?
Where is the bottleneck?

They are typically used for optimization efforts–they help teams understand where there could be system latency or bottlenecks.

Why Should Quality Teams Care?

The nature of building software is collaborative–we build the best products when we work alongside engineers.

Testers & quality folks are experts in quality–we provide a different perspective & way of approaching problems & solutions than engineers or product managers. It's when we combine these different perspectives & expertise that we build good products for our users.

As quality experts, we are uniquely situated to provide the most value when we are able to understand both the macro- and micro- parts of the system. Observability enables quality teams to shift towards being more proactive.

A quality team that understands telemetry can evaluate real system behavior in production. They can contribute to defining alert thresholds, use traces to triage incidents, identify patterns in metrics for continuous improvement, and so much more.

Observability is more than a single tool or a single setup—it’s a continuous improvement practice requiring thoughtful design and cross-collaboration. But observability is a valuable tool in building a holistic quality practice.

Author Bio:

Sidney Karoffa is a QA engineer & systems designer focused on engineering enablement & just building good software. Passionate about UX, test innovation, & process optimization. Strong believer in democratizing technology & knowledge.

Learn more about Sidney and her work on LinkedIn.

An Introduction to Observability & Telemetry

Basics of Observability

Logs

Metrics

Traces

Why Should Quality Teams Care?

Reply

Keep Reading

Women in Testing Newsletter

Home