October 15, 2024

#engineering, #product

Monitoring at Axiom


Blog Screenshot
Author
Mark Ramotowski

Senior Software Engineer

As one of the engineers working on monitoring at Axiom, having just rebuilt the infrastructure, I figured now would be a good time to reflect on what monitoring actually is, what value can be unlocked from it, and how it can benefit your organisation. With that in mind, this is going to be the first part of a multi-part series exploring the different types of monitors we currently offer, and how they can be leveraged to solve observability problems.

What are monitors

In essence, a monitor is a recurrent execution of a query in order to provide real-time actionable insight through some external notification system (be it email, slack, opsgenie etc.). By offloading this data query, such as “has my error rate increased across my API services” to some recurrent process, it reduces the need to execute those queries yourself, allowing you to focus elsewhere: Only requiring you to engage with the problem when it arises. In contrast to a dashboard - which is a high-level overview of data - monitors are the set of questions that are recurrently asked of it as data changes over time.

A good monitor conforms to these 3 basic principles:

  • Precision: It should only trigger when there is something that requires actioning
  • Context: It should provide all the required context about the event
  • Scalability: It is not dependent upon the load of the system, i.e. more data does not mean more notifications

Let’s look at the most basic type of monitor we offer: Threshold Monitors.

Threshold Monitors

A threshold monitor is a monitor that notifies when some threshold is met. The different types of questions this encapsulates are vast and through configuration of the monitor on Axiom, a lot of functionality can be unlocked. For threshold monitors the state of the monitor persists through time: If the monitor triggers on one run and notifies and the monitor still triggers subsequently, it will stay in an open state and not re-alert up until the value goes back to normal, at which point it will close the alert state and re-notify.

There are 3 main properties to a monitor:

  • Range of the monitor - how much data am I including?
  • Interval - how often do I run?
  • Binning of the APL query - how am I bucketing the data?

We always suggest that a fixed bin size (e.g. bin(time, 1m) ) is used as it won’t be affected by the range of the monitor, as bin_auto(_time) is. This is important, as the size of the bins changes the magnitude of the numbers, in cases such as count() and sum().

Importantly, both the range and the interval affect the frequency in which the alert state of the monitor changes and notifications are sent. For instance, if the monitor query has a much smaller binning and interval than the range, then on each run it will be recurrently inspecting the same bucket of data and therefore it will stay in an open alert state for much longer. This can be seen here across the same 3 queries with a different range (the state changes less frequently as the range of the monitor increase):

Although we do support structured query building, the following examples will show Axiom’s powerful query language, APL.

Database Error Monitor

Question: “Are there any ‘database connection failed’ errors in the logs?” Threshold: 1 Range: 1 minute Interval: 1 minute

The APL may look something like this:

['my-dataset']
    | where ['error.message'] == 'database connection failed'
    | summarize count() by bin(_time, 1m)

Caveat: We recommend a fixed bin size as it helps visualise the threshold chart.

N.B. Threshold of 1 monitors can also be defined using Match Event monitors, which gives back more information from the log. This will be discussed more in the next post.

Failed Login Attempts Monitor

Question: “Are there more than 5 failed login attempts from a single IP address in 15 minutes?” Threshold: 5 Range: 15 minutes Interval: 5 minutes

The APL may look something like this:

['my-dataset']
    | where ['error.message'] == "failed_login"
    | summarize count() by bin(_time, 15m), ip_address

This would alert on whether any IP address has passed that threshold value of 15. However, by setting “Alert By Group” to true, then the monitor will track each IP address separately and alert if one or more IP addresses passes the threshold.

Slow API Requests Monitor

Question: “Are there any slow API requests happening and alert me if there is no data” Threshold: 1000ms Range: 5 minutes Interval: 5 minutes Notify by Group: true Alert on no data: true

The APL may look something like this:

['my-dataset']
    | where isnotempty(['data.path']) and isnotempty(['data.duration[ms]'])
    | summarize max(['data.duration[ms]']) by bin(_time, 5m), ['data.path']

Mean Error Rate Monitor

Question: “Has my error rate increased passed 5%?” Threshold: 5 Range: 1 minutes Interval: 10 minutes Alert on no data: true

The APL may look something like this:

['my-dataset']
    | extend isError=iff(isnotempty(['data.error']), 1.0, 0.0)
    | summarize avg(isError) by bin(_time, 1m)

This uses the iff() function to segment the data into those logs that have an error in them, and those that do not.

Error Rate Spiking Monitor

By leveraging APL, monitors can be as simple or complex as required. Whereas previously we needed to have some context of what “normal” is, we can write APL to define what the baseline is, and whether the new data deviates from this.

We use this internally to track whether there is an increase in monitor error rate.

Question: “Has my error rate increased by 1.96 standard deviations from what it was previously?” Threshold: 1 Range: 30 minutes Interval: 5 minutes

The APL we use looks like:

['axiom-monitor-results']
    // segment the time into "before" and "after" buckets
    | extend isCurrent = iff(_time < ago(5m) and _time >= ago(30m), 0, 1), isError = iff(isnotempty(error), 1.0, 0.0)
    // calculate the average and standard deviation for the before and after buckets
    | summarize avg(isError), stdev(isError) by isCurrent
    // calculate the 95% CI around the average
    | extend meanPlusStd = avg_isError+stdev_isError*1.96, meanMinusStd = avg_isError-stdev_isError*1.96
    // order the groups (so that we can extract the correct values)
    | order by isCurrent asc
    | summarize meanPlusStdList=make_list(meanPlusStd), meanMinusStdList=make_list(meanMinusStd), currentMean=make_list(avg_isError)
    // calculate whether the current mean is +- standard deviation
    | extend isOutsideNorm = currentMean[1] > meanPlusStdList[0] or currentMean[1] < meanMinusStdList[0]
    // convert the bool into a number
    | project toreal(isOutsideNorm)

This returns a table with a value of either 1 or 0 indicating whether there has been a shift away from the expected error rate within the last 5 minutes. This is quite an advanced use of APL, and we’ve released an Anomaly Monitor that encapsulates this logic, making it far easier to create.

Summary

The ease and flexibility of creating monitors at Axiom enables key events to be broadcast into relevant notification channels so that business critical problems can be resolved as quickly as possible. Threshold monitors represent one component of a number of different monitor types that help elucidate underlying issues within business systems. By using Axiom’s powerful query language (APL), monitors can therefore be tailored to handle complex scenarios, such as detecting spikes in error rates, among many others.

Threshold monitors are essential because they allow you to define specific conditions that, when met, trigger alerts. This proactive approach ensures that issues are caught early, minimising downtime and maintaining system performance. For instance, monitoring database errors, failed login attempts, or slow API responses helps identify and address potential problems before they escalate, preventing disruptions.

Moreover, we’ve recently released a new “custom webhook” notifier type which enables monitors to have interoperability with any other system, further reducing the friction between an event occurring and having visibility of it.

Finally, we’re also very keen to hear of other types of monitors people have or need via Discord, where we actively engage with the community. Your feedback is invaluable to help us evolve our monitoring capabilities.

Share
Get started with Axiom

Learn how to start ingesting, streaming, and
querying data into Axiom in less than 10 minutes.