parse
Learn how to extract structured data from string fields using parsing patterns.
The parse operator in APL is a powerful tool for extracting structured data from string fields. It allows you to define parsing patterns using string constants, regular expressions, or a combination of both, and then assign the extracted values to new fields in the result set.
The parse operator is useful when working with log data or other semi-structured data sources where information is embedded within string fields. By using parse, you can efficiently extract and transform the data into a structured format for querying.
Importance of the parse operator
- Data extraction: It allows you to extract structured data from unstructured or semi-structured string fields, enabling you to transform raw data into a more usable format.
- Flexibility: The parse operator supports different parsing modes (simple, relaxed, regex) and provides various options to define parsing patterns, making it adaptable to different data formats and requirements.
- Performance: By extracting only the necessary information from string fields, the parse operator helps optimize query performance by reducing the amount of data processed and enabling more efficient filtering and aggregation.
- Readability: The parse operator provides a clear and concise way to define parsing patterns, making the query code more readable and maintainable.
Syntax
| parse [kind=regex|relaxed] Expression with [*] StringConstant FieldName [: FieldType] [*] ...
kind
: Optional parameter to specify the parsing mode. Its value can beregex
for regular expressions orrelaxed
for relaxed parsing. The default issimple
.Expression
: The string expression to parse.StringConstant
: A string literal or regular expression pattern to match against.FieldName
: The name of the field to assign the extracted value.FieldType
: Optional parameter to specify the data type of the extracted field. The default isstring
.*
: Wildcard to match any characters before or after theStringConstant
....
: You can specify additionalStringConstant
andFieldName
pairs to extract multiple values.
Returns
The parse operator returns the input dataset with new fields added based on the specified parsing pattern. The new fields will contain the extracted values from the parsed string expression. If the parsing fails for a particular row, the corresponding fields will have null values.
Examples
Parse content type
This example parses the content_type
field to extract the datatype
and format
values separated by a /
. The extracted values are projected as separate fields.
Original string:
application/charset=utf-8
Parse operation:
['sample-http-logs']
| parse content_type with datatype "/" format
| project datatype, format
Output:
{
"datatype": "application",
"format": "charset=utf-8"
}
Parse user agent
This example parses the user_agent
field to extract the operating system name (os_name
) and version (os_version
) enclosed within parentheses. The extracted values are projected as separate fields.
Original string:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Parse operation:
['sample-http-logs']
| parse user_agent with * "(" os_name " " os_version ";" * ")" *
| project os_name, os_version
Output:
{
"os_name": "Windows NT 10.0; Win64; x64",
"os_version": "10.0"
}
Parse URI endpoint
This example parses the uri
field to extract the endpoint
value that appears after /api/v1/
. The extracted value is projected as a new field.
Original string:
/api/v1/ping/user/textdata
Parse operation:
['sample-http-logs']
| parse uri with "/api/v1/" endpoint
| project endpoint
Output:
{
"endpoint": "ping/user/textdata"
}
Parse ID into region, tenant, and user ID
This example demonstrates how to parse the id
field into three parts: region
, tenant
, and userId
. The id
field is structured with these parts separated by hyphens (-
). The extracted parts are projected as separate fields.
Original string:
usa-acmeinc-3iou24
Parse operation:
['sample-http-logs']
| parse id with region "-" tenant "-" userId
| project region, tenant, userId
Output:
{
"region": "usa",
"tenant": "acmeinc",
"userId": "3iou24"
}
Parse in relaxed mode
The parse operator supports a relaxed mode that allows for more flexible parsing. In relaxed mode, Axiom treats the parsing pattern as a regular string and matches results in a relaxed manner. If some parts of the pattern are missing or do not match the expected type, Axiom assigns null values.
This example parses the log
field into four separate parts (method
, url
, status
, and responseTime
) based on a structured format. The extracted parts are projected as separate fields.
Original string:
GET /home 200 123ms
POST /login 500 nonValidResponseTime
PUT /api/data 201 456ms
DELETE /user/123 404 nonValidResponseTime
Parse operation:
['HttpRequestLogs']
| parse kind=relaxed log with method " " url " " status:int " " responseTime
| project method, url, status, responseTime
Output:
[
{
"method": "GET",
"url": "/home",
"status": 200,
"responseTime": "123ms"
},
{
"method": "POST",
"url": "/login",
"status": 500,
"responseTime": null
},
{
"method": "PUT",
"url": "/api/data",
"status": 201,
"responseTime": "456ms"
},
{
"method": "DELETE",
"url": "/user/123",
"status": 404,
"responseTime": null
}
]
Parse in regex mode
The parse operator supports a regex mode that allows you to parse use regular expressions. In regex mode, Axiom treats the parsing pattern as a regular expression and matches results based on the specified regex pattern.
This example demonstrates how to parse Kubernetes pod log entries using regex mode to extract various fields such as podName
, namespace
, phase
, startTime
, nodeName
, hostIP
, and podIP
. The parsing pattern is treated as a regular expression, and the extracted values are assigned to the respective fields.
Original string:
Log: PodStatusUpdate (podName=nginx-pod, namespace=default, phase=Running, startTime=2023-05-14 08:30:00, nodeName=node-1, hostIP=192.168.1.1, podIP=10.1.1.1)
Parse operation:
['PodLogs']
| parse kind=regex AppName with @"Log: PodStatusUpdate \(podName=" podName: string @", namespace=" namespace: string @", phase=" phase: string @", startTime=" startTime: datetime @", nodeName=" nodeName: string @", hostIP=" hostIP: string @", podIP=" podIP: string @"\)"
| project podName, namespace, phase, startTime, nodeName, hostIP, podIP
Output:
{
"podName": "nginx-pod",
"namespace": "default",
"phase": "Running",
"startTime": "2023-05-14 08:30:00",
"nodeName": "node-1",
"hostIP": "192.168.1.1",
"podIP": "10.1.1.1"
}
Best practices
When using the parse operator, consider the following best practices:
- Use appropriate parsing modes: Choose the parsing mode (simple, relaxed, regex) based on the complexity and variability of the data being parsed. Simple mode is suitable for fixed patterns, while relaxed and regex modes offer more flexibility.
- Handle missing or invalid data: Consider how to handle scenarios where the parsing pattern does not match or the extracted values do not conform to the expected types. Use the relaxed mode or provide default values to handle such cases.
- Project only necessary fields: After parsing, use the project operator to select only the fields that are relevant for further querying. This helps reduce the amount of data transferred and improves query performance.
- Use parse in combination with other operators: Combine parse with other APL operators like where, extend, and summarize to filter, transform, and aggregate the parsed data effectively.
By following these best practices and understanding the capabilities of the parse operator, you can effectively extract and transform data from string fields in APL, enabling powerful querying and insights.