Teaching AI to speak Splunk, then proving it works

Evals turn vibes into data: "does it work?" becomes a measurable question with real answers
Treat prompts as source code: CI runs evals on every PR, catching regressions before they merge
Docs compound: write once for humans, extract into skills for agents, measure with evals
Skills fail silently: without measurement, you won't know until production breaks

Agents are shipping to production. Prompts are code. Evals are CI. Most teams are still flying blind, shipping AI capabilities on vibes and hoping manual spot-checks catch the failures that matter.

It started with a question from a prospective customer: "We have over 600 saved searches, 300 dashboards, and 145 alerts: six years of Splunk’s Search Processing Language (SPL). Can AI translate them to Axiom?"

We had a hunch from successful manual migrations that it would be possible. But that wasn't good enough for production tooling. A mistranslated query isn't a minor bug; it could represent a blind spot where incidents hide.

So we built AI-powered translation through an open-source skill for Splunk migration, then built the tests to prove it works. Now those evals run on every pull request.

The migration problem

When companies migrate from Splunk to Axiom, data moves easily. Institutional knowledge doesn't: the SPL queries that catch production issues at 3am, the dashboards that tell you if the system is healthy.

SPL and APL look similar. The differences are small, but can be easy to miss:

stats count by status becomes summarize count() by status (parentheses required)
dc(user) becomes dcount(user) (different function name)
Ad-hoc APL queries benefit from explicit time filters like where _time between (ago(1h) .. now()), but dashboard queries should omit them to sync with the time picker

Here's what a real translation looks like:

# SPL
index=web status>=400 | stats count by status | sort -count

# APL
['web']
| where status >= 400
| summarize count() by status
| order by count() desc

Get these wrong and your query silently returns wrong results. Multiply by hundreds of saved queries and migration becomes a thornier problem.

From months to an afternoon

A prospective customer using Axiom in a structured proof of value trial has six years of Splunk behind them, over 2,500 queries when you count dashboard panels. Translation is the start. Each dashboard needs chart types mapped, layouts rebuilt, and filters configured. Manual migration would take 400+ hours at minimum: 5 minutes per query, 1 hour per dashboard, no errors, no complexity variation. Realistically? One engineer, full-time, for months.

An agent can do it in an afternoon. The human reviews; the agent builds.

A skill is how you teach an agent to complete specific tasks. It's a folder of instructions that agents load on demand, including scripts, templates, and reference materials. The agent reads the skill when the task matches, getting the exact context it needs without bloating every request.

Two skills handle migration:

spl-to-apl: translates queries, covering command mappings, function equivalents, joins, and time-handling
building-dashboards: creates dashboards via API, handling chart types, layout, and SmartFilters

Beyond migration, we've shipped axiom-sre for hypothesis-driven debugging: incidents, root cause analysis, log investigation.

All three are open source:

amp skill add axiomhq/skills/spl-to-apl
amp skill add axiomhq/skills/building-dashboards
amp skill add axiomhq/skills/sre

The "trust me, it works" problem

AI capabilities are probabilistic. A translation that worked yesterday might produce subtly wrong results today. You tweak a prompt to fix one edge case and break three others.

Most teams ship on vibes. They test manually, form an impression, and hope for the best. Brian Lovin nailed the framing: you are the feedback loop. And you're slow, expensive, and unreliable.

We had opinions about the skill. One of us said "it's bloated." The other said "it's fine, it covers all the edge cases." No data, no shared criteria, just vibes vs vibes.

So we built an eval to settle it.

Measuring what matters

Axiom's evals feature in the AI engineering workflow lets you test AI capabilities. You define test cases with inputs and expected outputs, create scorers that measure what matters, and run them automatically.

For spl-to-apl, we built test cases covering the most common SPL patterns: basic searches, aggregations, time-series analysis, field extraction, joins, and complex pipelines. We expand coverage as we encounter new patterns in production.

The evals run generated APL against Axiom Playground and compare the actual results. Three scorers check what matters:

skill-loaded: Did the agent load the translation skill?
schema-read: Did it read the reference documentation?
results-match: Do the query results match when run against real data?

Now we have a baseline. Every change to the skill runs against the same test suite. If accuracy drops, we know before it ships.

CI/CD for AI prompts

We wired evals into GitHub Actions. Every PR that touches a skill:

Detects which skills changed
Runs the eval suite
Compares against the baseline from main
Posts results directly to the PR

- name: Run eval
  run: |
    pnpm exec axiom eval ../skills/${{ matrix.skill }}/.meta/${{ matrix.skill }}.eval.ts \
      --baseline ${{ steps.baseline.outputs.baseline_id }}

No more "I tested it manually and it seemed fine." Every change is measured against a known baseline. Regressions get caught before they merge.

This is treating prompts as production code.

What we learned

Evals aren't optional. The moment your AI capability matters for production, you need repeatable testing. Manual spot-checks don't scale.
Build confidence in chaos. Models change constantly. We can't control that. What we can control is measuring our own changes against a known baseline, so we make decisions with data instead of vibes.
Docs compound. We already had APL documentation for humans. Extracting it into a skill made that knowledge available to agents. Adding evals made it measurable. Each layer builds on the last.
Dog-food everything. The eval framework powering our skills testing is the same one available to every Axiom customer: integrated with tracing, built for extreme scale, alongside all your other logs. We're not building tooling we don't use ourselves.

Get started

Migrating from Splunk?

amp skill add axiomhq/skills/spl-to-apl
amp skill add axiomhq/skills/building-dashboards

You bring the SPL, the agent brings the APL. Explore the repo: github.com/axiomhq/skills, which includes skills, eval tooling, and the GitHub Actions workflow, all open source.

Building AI capabilities?

If you're shipping agents and want to stop guessing, Axiom's AI engineering toolkit gives you the same eval infrastructure we used here: tracing, cost tracking, and evaluation in one platform.

The infrastructure we built (skills, evals, CI automation) is now part of how we ship everything AI-related. We're not going back to vibes.

#LAUNCHEDStop guessing. Ship AI products with confidenceLearn more→

#PLATFORM

Observability

Distributed traces

Volumetric logging

Application performance monitoring

Infrastructure monitoring

AI Engineering

AI workflow tracing

AI SDK & telemetry

Long‑term active retention

Evaluation & experimentation

#LATEST

Latest from the blog

#SIGNALS

Features

Logs

Traces

Metrics

AI

#ARCHITECTURE

#TECHNOLOGIES

Technologies

OpenTelemetry

Events API

Vercel & AI SDK

Cloudflare

#INGEST_FROM_ANYWHERE

#CHANGELOG

See what’s new at Axiom

#GET_STARTED

Documentation

Axiom Playground

Axiom CLI

Support

#COMPANY

Blog

Changelog

About us

Careers

#NEWS

From burden to asset: reimagining logs at scale

Teaching AI to speak Splunk, then proving it works

The migration problem

From months to an afternoon

The "trust me, it works" problem

Measuring what matters

CI/CD for AI prompts

What we learned

Get started

Migrating from Splunk?

Building AI capabilities?

More posts

Catch what tests miss: Online evaluations for AI capabilities

Read more→

Close the loop: User feedback for AI capabilities

Read more→

2025 recap: Reflections on building data infrastructure for the AI era

Read more→

Get started with Axiom