Observability: why logs, metrics, and traces matter

I’ve spent years working with microservices, Kafka, Spring Boot. What I’d never had to do is set up observability from scratch — in every previous job, another team handled it, or it came preconfigured out of the box. Until we started building an ERP with no platform team and nothing preconfigured. Everything the system needs, we build ourselves.

01 — Monitoring is not observability

The difference matters. Monitoring answers the question “is it broken?” Observability answers “why is it broken and where?”

A dashboard showing average endpoint latency is monitoring. An alert that fires when your Kafka queue goes past 10,000 messages is monitoring. It’s useful — essential, in fact. But it only tells you something is wrong. Understanding what’s happening requires something else.

Observability is the ability to understand what your system is doing internally without having to add new code every time something new breaks. If every incident ends with “I’ll add some logs here to see what happens next time,” you don’t have observability. You have a barely-observable system you’re patching reactively.

An observable system lets you ask questions you didn’t anticipate when you wrote it. And that’s what you need when something strange happens at 3 AM.

02 — The three signals

Modern observability is built on three types of signals: logs, metrics, and traces. Each answers a different question and has different strengths and limitations.

Logs

What we’ve always done. Discrete events in text format: “user X logged in,” “an event was published to Kafka,” “this query timed out.”

Logs are irreplaceable when you need the exact detail of what happened: the full error message, the affected resource ID, the stack trace. But at scale, they’re noise. An API with moderate traffic generates millions of lines a day. Searching through them manually is like looking for a needle in a haystack — and often you don’t even know which needle you’re looking for.

Their main limitation: they tell you what happened at one point in the system, but not how that event connects to everything else. The log of an exception in service A knows nothing about the log of the request that caused it in service B.

Metrics

Numeric values aggregated over time: average latency, throughput, error count per minute, CPU usage, queue depth. They’re cheap to store, cheap to query, easy to visualize in a dashboard, and perfect for alerting.

Metrics give you a global picture of the system: whether everything is within normal bounds or something is drifting out of range.

Their limitation is the flip side of their strength: they’re aggregated. A metric tells you the p99 latency has gone up. It doesn’t tell you why, which specific requests are slow, or where in your system time is being lost.

Traces

This is the paradigm shift. A trace follows a specific request from the moment it enters your system until it exits, crossing every service and component it touches along the way.

Each step the request takes is recorded as a span: the HTTP endpoint entry, the database query, the Kafka publish, the processing in the other service’s consumer — all of it. And every one of those spans shares the same trace_id.

The result is a tree view with the complete timeline of a single request. You can see exactly how long each component took, which calls it made to other services, where the bottleneck was. For distributed systems, nothing is more useful.

03 — Together they’re more than the sum

If you take one idea from this post, make it this: the real value is in correlation.

A metric tells you latency went up. A trace shows you which specific requests are slow. And the logs from the services that appear in that trace tell you exactly what was happening when they slowed down.

The incident flow in a well-observed system looks like this:

An alert fires because a metric crosses its threshold.
You open the dashboard, identify the affected service, and look at the traces from the slow or failing requests.
A specific trace leads you to the span that’s taking too long or returning an error.
You jump to the logs for that span — filtered by trace_id, not by timestamp — and see the exact detail of what happened.

Without correlation, each step is a separate investigation. With correlation, it’s four clicks. The difference between diagnosing in five minutes or four hours is built here.

That’s why a log without a trace_id or span_id stops being useful the moment your system stops being a monolith. If your API publishes events to Kafka that other services process, the log “failed to update balance” with no further context is a riddle. With trace_id, it’s the end of a thread you can follow back.

04 — Before OpenTelemetry

Instrumenting an application meant choosing a vendor first and locking yourself in afterward.

If you wanted to send traces to Datadog, you installed the Datadog agent. New Relic, the New Relic agent. Jaeger, yet another one. Every vendor had its own API, its own instrumentation approach, its own Java agent. Switching backends meant re-instrumenting the application from scratch. And maintaining the same instrumented code for multiple vendors at once was a straight-up nightmare.

The same applied to metrics. For logs, every aggregator had its own format, its own parsers, its own conventions.

The problem wasn’t technical. It was political. Every vendor wanted lock-in, and fragmentation was kept alive artificially. The industry had been asking for an open standard for years.

05 — What OpenTelemetry solves

OpenTelemetry (OTel) is that standard. It’s the result of merging two previous projects (OpenTracing and OpenCensus) into a single specification governed by the CNCF. Today it’s the second most active project in the CNCF, behind only Kubernetes.

The promise is simple and exactly what’s needed: instrument once, choose backend later. Your application speaks OTel. Tomorrow you can switch from Datadog to Grafana, or to Elastic, or to Jaeger, without touching a single line of application code.

OpenTelemetry defines three things:

An API: how your code creates spans, metrics, and structured logs. It’s what you see in your IDE. The API is stable and version-compatible.
An SDK: the implementation of that API for each language. It’s what runs inside your application.
A protocol (OTLP): how telemetry travels from your application to wherever it gets stored. A binary protocol over gRPC or HTTP, efficient and supported by virtually all modern backends.

And one additional piece worth its own chapter:

The Collector: an intermediate process (typically a Go binary) that receives telemetry from your applications, optionally processes it (filters, enriches, transforms), and forwards it to the final backend.

The Collector is what makes the whole system flexible. Your applications always speak OTLP, to a Collector. The Collector decides what goes where. Changing backend = changing the Collector’s config. The applications stay the same.

06 — Why this matters in Spring Boot 4

Spring has always had its own observability story. For years, the reference API has been Micrometer — designed by the Spring team itself as a vendor-neutral facade, long before OpenTelemetry was mature. For us that meant a Spring Actuator exposing Prometheus metrics, an external vendor agent for traces, and logs handled separately.

Spring Boot 4 changes the game by introducing spring-boot-starter-opentelemetry: native OTel integration. Not an external Java agent hooking into bytecode. Not a third-party library with its own configuration. Just another Spring starter, configurable like any other.

This matters because it closes the loop: all three signals (logs, metrics, traces) leave the application speaking OTLP, correlated with each other, without you having to glue pieces together manually. Under the hood, Spring still uses Micrometer for what Micrometer already did well, connected to the OTel SDK so the output is standard.

But talking about how this works in code is what we do in the next part.

07 — What’s coming in Part 2

In the next article we’re going to take everything we’ve described here conceptually and build it for real. The demo isn’t a textbook example — it’s a scaled-down version of what I set up for the ERP. Same architecture, same decisions, same real problems. We’ll take two Spring Boot 4 services connected by Kafka, instrument them with spring-boot-starter-opentelemetry, and look at:

What the starter gives you for free and what you have to configure yourself.
How a trace propagates from an HTTP request, through Kafka, reaches the consumer, and ends up writing to a database. All with the same trace_id.
What happens when you export that telemetry first to a vanilla Collector and then to an ELK stack. Without touching a single line of application code.

That last point is what matters most to us: proving that OpenTelemetry’s promise is real. Instrumentation is one decision, backend is a completely separate one, and that’s what makes it worth it.

If you take one thing from this post, make it this. Look at your most critical application and ask yourself: if something breaks right now, how do you find out? How do you diagnose it? Do you have all three types of signals? Are they correlated?