November 29, 2023

#product, #engineeering

OTel semantic conventions deep dive


Blog Screenshot
Author
Dominic Chapman

Head of Product

OpenTelemetry Semantic Conventions: A primer and a quick deep dive

It’s a familiar frustration: engineers trying to connect telemetry data from across their infrastructure find that it’s labeled inconsistently in ways they need to reverse-engineer. The AWS region from which events were logged might be labeled region , region_name , aws_region , or data_center interchangeably. Or region might be used for the AWS region in one set of events, but for geographical request origin regions like EMEA, LATAM, etc., in another. The cost in time and expertise needed to untangle this mess can compete with tool licensing fees as a recurring cost.

Observability tools should be able to correlate across multiple signals for automated analysis and timely resolution of incidents. Engineers’ time and minds should be free to do the actual oversight, analysis, understanding and innovation that telemetry data makes possible.

OpenTelemetry (OTel) semantic conventions, or "semconv" for short, aim to standardize parts of event content, so developers can put data into their logs that’s easy for anyone to get out. By designing logging with all potential users in mind — not just themselves or their company’s customers — developers can facilitate the interconnection of infrastructure components using a common telemetry language. Other engineers can conceive and build new experiences based on the secure knowledge that certain information will be available in a standardized format.

Why OTel?

OTel’s conventions aren’t the first. Other projects have defined semantic conventions that prescribe how common attributes should be named, such as Elastic Common Schema (ECS) or Splunk’s Common Information Model (CIM). Even the naming in protocols like Statsd was assumed to follow some structure. This research paper from Meta details some of the history in section 4.

Why did OTel emerge as front-runner? Probably because it’s a collaborative community effort by the Cloud Native Computing Foundation with more than 300 companies signed up, from Google to Lowe’s Home Improvement. The GitHub repository for the semantic conventions includes contributors from Microsoft, Elastic, Grafana Labs, Dynatrace and Google. So the conventions are both extensive and guaranteed to be portable rather than locked or blocked by individual vendors.

What they are, where they go

OTel’s semantic conventions are a standardized set of key:value pairs that should be used in the Attribute fields of metrics, logs, and traces, and also in the resource fields of logs. The OTel spec defines specific conventions to be used for events, logs, metrics, and traces or spans. There are currently about two dozen specified attributes each for Resources, Metrics, and Traces.

Where OTel semantic conventions are used in events

The spec is still evolving and not fully stable, yet it’s worth diving into now. It already lists nearly 100 standardized key/value pairs for common attributes in telemetry signals. The goal is to ease correlation across signals and languages, and enables consistency across languages by auto-generating constants and enumerations from YAML files.

The spec is still evolving and not fully stable. It’s hosted in the opentelemetry-specification repo.

Some are required, some optional, not all are yet stable

Each convention also has a requirements level: Required, Conditionally Required, Recommended, or Opt-In. These range from the mandatory Required (”all instrumentations MUST populate the attribute”) to Opt-In (”only if the user configures the instrumentation to do so”). Opt-In is for attributes that are expensive to retrieve, or might pose a security risk. Most attributes in the spec are Recommended — the conventions are voluntary, but their adoption will make life easier for engineers going forward.

Again, the spec is still evolving. But it’s better to start working with it now and make a few course corrections as it matures and stabilizes.

Resource conventions

Resources are immutable assets that describe the entity producing the telemetry — e.g. a service running on a specific port at a specific IP address. In the still-evolving semantic conventions for resources — documented here — the service.* group of attributes is perhaps most important. Only service.name is required, but the spec lists several more that may prove valuable at some point:

OTel semantic conventions for server attributes

There are additional conventions for specific cloud providers, such as AWS. Here’s a sample AWS attribute:

An OTel semantic convention for use in AWS logs

Other important resource convention groups include cloud., container., host., k8s. and process.*.

Tracing conventions

Spans are the most granular of all telemetry signals. The number of attributes per event is usually higher than for metrics and logs, and their telemetry throughput is also much higher. That makes tracing data expensive to transport, store, and process.

But there’s a lot of value to be gleaned from traces. They provide rich debugging context when correlated with other signals — e.g. starting from and interesting metric pattern and studying traces. That makes semconv especially important.

Spans can be decorated with arbitrary number of attributes. Some are common across languages, applications, and operations, like HTTP / database client calls. This one, for example, matches the conventions already established for HTTP clients and servers:

OTel semantic convention for HTTP response status codes

The conventions are divided into logical groups. Some are general, some are context-specific — http., rpc., etc. They also include more than just attribute key/values. They also list best practices for span types, names, and events — for example not including the URL in an operation name. Instead of /api/user/1234, the value should be api/user/{user_id}, which can be correlated from user_id info elsewhere.

Metrics conventions

Metrics give a stable signal for KPIs. They’re good for aggregating numerical data points across regular time intervals, and across spatial, application, and user dimensions. But they’re far less efficient than traces for debugging. Aligning attributes conventions for metrics with those for tracing attributes makes troubleshooting much easier. You can go from a broad, aggregated metric view to a high-granularity understanding of the problem in traces.

The conventions give guidance toward naming metrics within a hierarchy based on usage, similar to the attribute groups in trace spans. They also guide aspects like metric units (part of data model, not name) and pluralization (avoided unless value represented in countable quantity like errors).

  • Don’t: Put a telemetry producer (service, environment, technology) in metric name, for example prod.axiom-client.http.duration.

  • Do: Use metric names that are global in scope, such as http.client.duration and specify the telemetry producer as the value of a separate service.nameattribute.

{
	"metric": [
		{
			"name": "http.client.duration"
		}
	]
	"resource": {
    "attributes": [
      {
        "key": "service.name",
        "value": {
          "stringValue": "axiom-client"
        }
      }
    ]
  },
}

Logs conventions

OTel’s semconv for logs and events are still in their early days, but some have started to materialize:

  • log.* attributes to identify source and feature_flag.* to represent evaluations of FFs.
  • Both logs and events use same exception.*, compatible with semconv for tracing API.
  • Conventions for events include mandatory attributes required by the events’ API interface.

Axiom — more attributes, less stress

Axiom was architected in parallel with the emergence of the OpenTelemetry specs, using every modern cloud-native optimization available. Its hyper-efficient ingest, storage, and query of both structured and unstructured data mare Axiom more cost-effective than older architectures. Axiom accepts events with any schema, and provides compatibility with the OTel SDKs to make make instrumentation and collection a breeze.

Questions? Ideas? Like to argue with engineers? Talk to us about OTel and semantic conventions at axiom.co/discord.

Share
Get started with Axiom

Learn how to start ingesting, streaming, and
querying data into Axiom in less than 10 minutes.