The sample operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm is not statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize instead.

You can find the sample operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.

For users of other query languages

If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.

Usage

Syntax

| sample ProportionOfRows

Parameters

  • ProportionOfRows: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.

Returns

The operator returns a table containing the specified number of rows, selected randomly from the input dataset.

Use case examples

In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.

Query

['sample-http-logs']
| sample 0.05

Run in Playground

Output

_timereq_duration_msidstatusurimethodgeo.citygeo.country
2023-10-16 12:45:00234user1200/indexGETNew YorkUS
2023-10-16 12:47:00120user2404/loginPOSTParisFR
2023-10-16 12:48:00543user3500/checkoutPOSTTokyoJP

This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.

  • take: Use take when you want to return the first N rows in the dataset rather than a random subset.
  • where: Use where to filter rows based on conditions rather than sampling randomly.
  • top: Use top to return the highest N rows based on a sorting criterion.