diff --git a/docs/user/ppl/cmd/ad.md b/docs/user/ppl/cmd/ad.md index 6d18396506..dc532544ea 100644 --- a/docs/user/ppl/cmd/ad.md +++ b/docs/user/ppl/cmd/ad.md @@ -1,37 +1,76 @@ -# ad (deprecated by ml command) -## Description +# ad (Deprecated) -The `ad` command applies Random Cut Forest (RCF) algorithm in the ml-commons plugin on the search result returned by a PPL command. Based on the input, the command uses two types of RCF algorithms: fixed-in-time RCF for processing time-series data, batch RCF for processing non-time-series data. -## Syntax +> **Warning**: The `ad` command is deprecated in favor of the [`ml` command](./ml.md). -## Fixed In Time RCF For Time-series Data +The `ad` command applies the Random Cut Forest (RCF) algorithm in the ML Commons plugin to the search results returned by a PPL command. The command provides two anomaly detection approaches: -ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] \ [date_format] [time_zone] [category_field] -* number_of_trees: optional. Number of trees in the forest. **Default:** 30. -* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. **Default:** 8. -* sample_size: optional. The sample size used by stream samplers in this forest. **Default:** 256. -* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32. -* time_decay: optional. The decay factor used by stream samplers in this forest. **Default:** 0.0001. -* anomaly_rate: optional. The anomaly rate. **Default:** 0.005. -* time_field: mandatory. Specifies the time field for RCF to use as time-series data. -* date_format: optional. Used for formatting time_field. **Default:** "yyyy-MM-dd HH:mm:ss". -* time_zone: optional. Used for setting time zone for time_field. **Default:** "UTC". -* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted. +- [Anomaly detection for time-series data](#anomaly-detection-for-time-series-data) using the fixed-in-time RCF algorithm +- [Anomaly detection for non-time-series data](#anomaly-detection-for-non-time-series-data) using the batch RCF algorithm + +> **Note**: To use the `ad` command, `plugins.calcite.enabled` must be set to `false`. + +## Syntax + +The `ad` command has two different syntax variants, depending on the algorithm type. + +### Anomaly detection for time-series data + +Use this syntax to detect anomalies in time-series data. This method uses the fixed-in-time RCF algorithm, which is optimized for sequential data patterns. + +The fixed-in-time RCF `ad` command has the following syntax: + +```syntax +ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] [date_format] [time_zone] [category_field] +``` + +### Parameters + +The fixed-in-time RCF algorithm supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `time_field` | Required | The time field for RCF to use as time-series data. | +| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. | +| `shingle_size` | Optional | The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is `8`. | +| `sample_size` | Optional | The sample size used by the stream samplers in this forest. Default is `256`. | +| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. | +| `time_decay` | Optional | The decay factor used by the stream samplers in this forest. Default is `0.0001`. | +| `anomaly_rate` | Optional | The anomaly rate. Default is `0.005`. | +| `date_format` | Optional | The format used for the `time_field` field. Default is `yyyy-MM-dd HH:mm:ss`. | +| `time_zone` | Optional | The time zone for the `time_field` field. Default is `UTC`. | +| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. | -## Batch RCF For Non-time-series Data +### Anomaly detection for non-time-series data + +Use this syntax to detect anomalies in data where the order doesn't matter. This method uses the batch RCF algorithm, which is optimized for independent data points. + +The batch RCF `ad` command has the following syntax: + +```syntax ad [number_of_trees] [sample_size] [output_after] [training_data_size] [anomaly_score_threshold] [category_field] -* number_of_trees: optional. Number of trees in the forest. **Default:** 30. -* sample_size: optional. Number of random samples given to each tree from the training data set. **Default:** 256. -* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32. -* training_data_size: optional. **Default:** size of your training data set. -* anomaly_score_threshold: optional. The threshold of anomaly score. **Default:** 1.0. -* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted. +``` + +### Parameters + +The batch RCF algorithm supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. | +| `sample_size` | Optional | The number of random samples provided to each tree from the training dataset. Default is `256`. | +| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. | +| `training_data_size` | Optional | The size of the training dataset. Default is the full dataset size. | +| `anomaly_score_threshold` | Optional | The anomaly score threshold. Default is `1.0`. | +| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. | -## Example 1: Detecting events in New York City from taxi ridership data with time-series data -This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data. +## Example 1: Detecting events in New York City taxi ridership time-series data + +The following examples use the `nyc_taxi` dataset, which contains New York City taxi ridership data with fields including `value` (number of rides), `timestamp` (time of measurement), and `category` (time period classifications such as 'day' and 'night'). + +This example trains an RCF model and uses it to detect anomalies in time-series ridership data: ```ppl ignore source=nyc_taxi @@ -40,7 +79,7 @@ source=nyc_taxi | where value=10844.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -51,9 +90,10 @@ fetched rows / total rows = 1/1 +---------+---------------------+-------+---------------+ ``` -## Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category -This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values. +## Example 2: Detecting events in New York City taxi ridership time-series data by category + +This example trains an RCF model and uses it to detect anomalies in time-series ridership data across multiple category values: ```ppl ignore source=nyc_taxi @@ -62,7 +102,7 @@ source=nyc_taxi | where value=10844.0 or value=6526.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -74,9 +114,10 @@ fetched rows / total rows = 2/2 +----------+---------+---------------------+-------+---------------+ ``` -## Example 3: Detecting events in New York City from taxi ridership data with non-time-series data -This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data. +## Example 3: Detecting events in New York City taxi ridership non-time-series data + +This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data: ```ppl ignore source=nyc_taxi @@ -85,7 +126,7 @@ source=nyc_taxi | where value=10844.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -96,9 +137,10 @@ fetched rows / total rows = 1/1 +---------+-------+-----------+ ``` -## Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category -This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values. +## Example 4: Detecting events in New York City taxi ridership non-time-series data by category + +This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data across multiple category values: ```ppl ignore source=nyc_taxi @@ -107,7 +149,7 @@ source=nyc_taxi | where value=10844.0 or value=6526.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -119,6 +161,4 @@ fetched rows / total rows = 2/2 +----------+---------+-------+-----------+ ``` -## Limitations -The `ad` command can only work with `plugins.calcite.enabled=false`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/addcoltotals.md b/docs/user/ppl/cmd/addcoltotals.md index bcc089859e..51f511d70f 100644 --- a/docs/user/ppl/cmd/addcoltotals.md +++ b/docs/user/ppl/cmd/addcoltotals.md @@ -1,21 +1,32 @@ -# AddColTotals - -# Description +# addcoltotals -The `addcoltotals` command computes the sum of each column and add a summary event at the end to show the total of each column. This command works the same way `addtotals` command works with row=false and col=true option. This is useful for creating summary reports with subtotals or grand totals. The `addcoltotals` command only sums numeric fields (integers, floats, doubles). Non-numeric fields in the field list are ignored even if its specified in field-list or in the case of no field-list specified. +The `addcoltotals` command computes the sum of each column and adds a summary row showing the total for each column. This command is equivalent to using `addtotals` with `row=false` and `col=true`, making it useful for creating summary reports with column totals. -# Syntax +The command only processes numeric fields (integers, floats, doubles). Non-numeric fields are ignored regardless of whether they are explicitly specified in the field list. -`addcoltotals [field-list] [label=] [labelfield=]` -- `field-list`: Optional. Comma-separated list of numeric fields to sum. If not specified, all numeric fields are summed. -- `labelfield=`: Optional. Field name to place the label. If it specifies a non-existing field, adds the field and shows label at the summary event row at this field. -- `label=`: Optional. Custom text for the totals row labelfield\'s label. Default is \"Total\". +## Syntax -# Example 1: Basic Example +The `addcoltotals` command has the following syntax: -The example shows placing the label in an existing field. +```syntax +addcoltotals [field-list] [label=] [labelfield=] +``` + +## Parameters + +The `addcoltotals` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | A comma-separated list of numeric fields to add. By default, all numeric fields are added. | +| `labelfield` | Optional | The field in which the label is placed. If the field does not exist, it is created and the label is shown in the summary row (last row) of the new field. | +| `label` | Optional | The text that appears in the summary row (last row) to identify the computed totals. When used with `labelfield`, this text is placed in the specified field in the summary row. Default is `Total`. | + +## Example 1: Basic example + +The following query places the label in an existing field: ```ppl source=accounts @@ -24,7 +35,7 @@ source=accounts | addcoltotals labelfield='firstname' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -38,9 +49,9 @@ fetched rows / total rows = 4/4 +-----------+---------+ ``` -# Example 2: Adding column totals and adding a summary event with label specified. +## Example 2: Adding column totals with a custom summary label -The example shows adding totals after a stats command where final summary event label is \'Sum\' and row=true value was used by default when not specified. It also added new field specified by labelfield as it did not match existing field. +The following query adds totals after a `stats` command where the final summary event label is `Sum`. It also creates a new field specified by `labelfield` because this field does not exist in the data: ```ppl source=accounts @@ -48,7 +59,7 @@ source=accounts | addcoltotals `count()` label='Sum' labelfield='Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -61,9 +72,9 @@ fetched rows / total rows = 3/3 +---------+--------+-------+ ``` -# Example 3: With all options +## Example 3: Using all options -The example shows using addcoltotals with all options set. +The following query uses the `addcoltotals` command with all options set: ```ppl source=accounts @@ -73,7 +84,7 @@ source=accounts | addcoltotals avg_balance, count label='Sum' labelfield='Column Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 diff --git a/docs/user/ppl/cmd/addtotals.md b/docs/user/ppl/cmd/addtotals.md index 745b1ae750..ba916172b3 100644 --- a/docs/user/ppl/cmd/addtotals.md +++ b/docs/user/ppl/cmd/addtotals.md @@ -1,33 +1,44 @@ -# AddTotals +# addtotals -## Description +The `addtotals` command computes the sum of numeric fields and can create both column totals (summary row) and row totals (new field). This command is useful for creating summary reports with subtotals or grand totals. + +The command only processes numeric fields (integers, floats, doubles). Non-numeric fields are ignored regardless of whether they are explicitly specified in the field list. -The `addtotals` command computes the sum of numeric fields and appends a row with the totals to the result. The command can also add row totals and add a field to store row totals. This is useful for creating summary reports with subtotals or grand totals. The `addtotals` command only sums numeric fields (integers, floats, doubles). Non-numeric fields in the field list are ignored even if it\'s specified in field-list or in the case of no field-list specified. ## Syntax -`addtotals [field-list] [label=] [labelfield=] [row=] [col=] [fieldname=]` +The `addtotals` command has the following syntax: + +```syntax +addtotals [field-list] [label=] [labelfield=] [row=] [col=] [fieldname=] +``` + +## Parameters -- `field-list`: Optional. Comma-separated list of numeric fields to sum. If not specified, all numeric fields are summed. -- `row=`: Optional. Calculates total of each row and add a new field with the total. Default is true. -- `col=`: Optional. Calculates total of each column and add a new event at the end of all events with the total. Default is false. -- `labelfield=`: Optional. Field name to place the label. If it specifies a non-existing field, adds the field and shows label at the summary event row at this field. This is applicable when col=true. -- `label=`: Optional. Custom text for the totals row labelfield\'s label. Default is \"Total\". This is applicable when col=true. This does not have any effect when labelfield and fieldname parameter both have same value. -- `fieldname=`: Optional. Calculates total of each row and add a new field to store this total. This is applicable when row=true. +The `addtotals` command supports the following parameters. -## Example 1: Basic Example +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | A comma-separated list of numeric fields to add. By default, all numeric fields are added. | +| `row` | Optional | Calculates the total of each row and adds a new field to store the row total. Default is `true`. | +| `col` | Optional | Calculates the total of each column and adds a summary event at the end with the column totals. Default is `false`. | +| `labelfield` | Optional | The field in which the label is placed. If the field does not exist, it is created and the label is shown in the summary row (last row) of the new field. Applicable when `col=true`. | +| `label` | Optional | The text that appears in the summary row (last row) to identify the computed totals. When used with `labelfield`, this text is placed in the specified field in the summary row. Default is `Total`. Applicable when `col=true`. This parameter has no effect when the `labelfield` and `fieldname` parameters specify the same field name. | +| `fieldname` | Optional | The field used to store row totals. Applicable when `row=true`. | -The example shows placing the label in an existing field. +## Example 1: Basic example + +The following query places the label in an existing field: ```ppl source=accounts | head 3 -|fields firstname, balance +| fields firstname, balance | addtotals col=true labelfield='firstname' label='Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -41,9 +52,10 @@ fetched rows / total rows = 4/4 +-----------+---------+-------+ ``` -## Example 2: Adding column totals and adding a summary event with label specified. +## Example 2: Adding column totals with a custom summary label + +The following query adds totals after a `stats` command, with the final summary event labeled `Sum`. It also creates a new field specified by `labelfield` because the field does not exist in the data: -The example shows adding totals after a stats command where final summary event label is \'Sum\'. It also added new field specified by labelfield as it did not match existing field. ```ppl source=accounts @@ -51,7 +63,7 @@ source=accounts | addtotals col=true row=false label='Sum' labelfield='Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 5/5 @@ -66,7 +78,8 @@ fetched rows / total rows = 5/5 +----------------+-----------+---------+-----+-------+ ``` -if row=true in above example, there will be conflict between column added for column totals and column added for row totals being same field \'Total\', in that case the output will have final event row label null instead of \'Sum\' because the column is number type and it cannot output String in number type column. +If you set `row=true` in the preceding example, both row totals and column totals try to use the same field name (`Total`), creating a conflict. When this happens, the summary row label displays as `null` instead of `Sum` because the field becomes numeric (for row totals) and cannot display string values: + ```ppl source=accounts @@ -74,7 +87,7 @@ source=accounts | addtotals col=true row=true label='Sum' labelfield='Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 5/5 @@ -89,9 +102,9 @@ fetched rows / total rows = 5/5 +----------------+-----------+---------+-----+-------+ ``` -## Example 3: With all options +## Example 3: Using all options -The example shows using addtotals with all options set. +The following query uses the `addtotals` command with all options set: ```ppl source=accounts @@ -101,7 +114,7 @@ source=accounts | addtotals avg_balance, count row=true col=true fieldname='Row Total' label='Sum' labelfield='Column Total' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 diff --git a/docs/user/ppl/cmd/append.md b/docs/user/ppl/cmd/append.md index 6c765286c6..641f436b20 100644 --- a/docs/user/ppl/cmd/append.md +++ b/docs/user/ppl/cmd/append.md @@ -1,27 +1,35 @@ -# append -## Description +# append -The `append` command appends the result of a sub-search and attaches it as additional rows to the bottom of the input search results (The main search). -The command aligns columns with the same field names and types. For different column fields between the main search and sub-search, NULL values are filled in the respective rows. -## Syntax +The `append` command appends the results of a subsearch as additional rows to the end of the input search results (the main search). -append \ -* sub-search: mandatory. Executes PPL commands as a secondary search. - -## Limitations +The command aligns columns that have the same field names and types. For columns that exist in only the main search or subsearch, `NULL` values are inserted into the missing fields for the respective rows. -* **Schema Compatibility**: When fields with the same name exist between the main search and sub-search but have incompatible types, the query will fail with an error. To avoid type conflicts, ensure that fields with the same name have the same data type, or use different field names (e.g., by renaming with `eval` or using `fields` to select non-conflicting columns). - -## Example 1: Append rows from a count aggregation to existing search result +## Syntax + +The `append` command has the following syntax: + +```syntax +append +``` + +## Parameters + +The `append` command supports the following parameters. -This example appends rows from "count by gender" to "sum by gender, state". +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | Executes PPL commands as a secondary search. | + +## Example 1: Append rows from a count aggregation to existing search results + +The following query appends rows from `count by gender` to `sum by gender, state`: ```ppl source=accounts | stats sum(age) by gender, state | sort -`sum(age)` | head 5 | append [ source=accounts | stats count(age) by gender ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 6/6 @@ -37,15 +45,16 @@ fetched rows / total rows = 6/6 +----------+--------+-------+------------+ ``` -## Example 2: Append rows with merged column names -This example appends rows from "sum by gender" to "sum by gender, state" with merged column of same field name and type. +## Example 2: Append rows with merged column names + +The following query appends rows from `sum by gender` to `sum by gender, state`, merging columns that have the same field name and type: ```ppl source=accounts | stats sum(age) as sum by gender, state | sort -sum | head 5 | append [ source=accounts | stats sum(age) as sum by gender ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 6/6 @@ -60,4 +69,9 @@ fetched rows / total rows = 6/6 | 101 | M | null | +-----+--------+-------+ ``` - \ No newline at end of file + +## Limitations + +The `append` command has the following limitations: + +* **Schema compatibility**: When fields with the same name exist in both the main search and the subsearch but have incompatible types, the query fails with an error. To avoid type conflicts, ensure that fields with the same name share the same data type. Alternatively, use different field names. You can rename the conflicting fields using `eval` or select non-conflicting columns using `fields`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/appendcol.md b/docs/user/ppl/cmd/appendcol.md index fb879c1b6f..e0557293bb 100644 --- a/docs/user/ppl/cmd/appendcol.md +++ b/docs/user/ppl/cmd/appendcol.md @@ -1,17 +1,30 @@ -# appendcol -## Description +# appendcol -The `appendcol` command appends the result of a sub-search and attaches it alongside with the input search results (The main search). -## Syntax +The `appendcol` command appends the result of a subsearch as additional columns to the input search results (the main search). -appendcol [override=\] \ -* override=: optional. Boolean field to specify should result from main-result be overwritten in the case of column name conflict. **Default:** false. -* sub-search: mandatory. Executes PPL commands as a secondary search. The sub-search uses the same data specified in the source clause of the main search results as its input. +## Syntax + +The `appendcol` command has the following syntax: + +```syntax +appendcol [override=] +``` + +## Parameters + +The `appendcol` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | Executes PPL commands as a secondary search. The `subsearch` uses the data specified in the `source` clause of the main search results as its input. | +| `override` | Optional | Specifies whether the results of the main search should be overwritten when column names conflict. Default is `false`. | -## Example 1: Append a count aggregation to existing search result + + +## Example 1: Append a count aggregation to existing search results -This example appends "count by gender" to "sum by gender, state". +This example appends `count by gender` to `sum by gender, state`: ```ppl source=accounts @@ -20,7 +33,7 @@ source=accounts | head 10 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 10/10 @@ -40,9 +53,10 @@ fetched rows / total rows = 10/10 +--------+-------+----------+------------+ ``` -## Example 2: Append a count aggregation to existing search result with override option -This example appends "count by gender" to "sum by gender, state" with override option. +## Example 2: Append a count aggregation to existing search results, overriding the main search results + +This example appends `count by gender` to `sum by gender, state` and overrides the main search results: ```ppl source=accounts @@ -51,7 +65,7 @@ source=accounts | head 10 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 10/10 @@ -71,9 +85,10 @@ fetched rows / total rows = 10/10 +--------+-------+----------+------------+ ``` -## Example 3: Append multiple sub-search results -This example shows how to chain multiple appendcol commands to add columns from different sub-searches. +## Example 3: Append multiple subsearch results + +The following query chains multiple `appendcol` commands to add columns from different subsearches: ```ppl source=employees @@ -82,7 +97,7 @@ source=employees | appendcol [ stats max(age) as max_age ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 9/9 @@ -101,9 +116,10 @@ fetched rows / total rows = 9/9 +------+-------------+-----+------------------+---------+ ``` -## Example 4: Override case of column name conflict -This example demonstrates the override option when column names conflict between main search and sub-search. +## Example 4: Resolve column name conflicts using the override parameter + +The following query shows how to use `appendcol` with the `override` option when column names in the main search and subsearch conflict: ```ppl source=employees @@ -111,7 +127,7 @@ source=employees | appendcol override=true [ stats max(age) as agg by dept ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 diff --git a/docs/user/ppl/cmd/appendpipe.md b/docs/user/ppl/cmd/appendpipe.md index f2dc71a2ab..935f6a34ca 100644 --- a/docs/user/ppl/cmd/appendpipe.md +++ b/docs/user/ppl/cmd/appendpipe.md @@ -1,17 +1,30 @@ -# appendpipe -## Description +# appendpipe -The `appendpipe` command appends the result of the subpipeline to the search results. Unlike a subsearch, the subpipeline is not run first.The subpipeline is run when the search reaches the appendpipe command. -The command aligns columns with the same field names and types. For different column fields between the main search and sub-search, NULL values are filled in the respective rows. -## Syntax +The `appendpipe` command appends the results of a subpipeline to the search results. Unlike a subsearch, the subpipeline is not executed first; it runs only when the search reaches the `appendpipe` command. -appendpipe [\] -* subpipeline: mandatory. A list of commands that are applied to the search results from the commands that occur in the search before the `appendpipe` command. +The command aligns columns that have the same field names and types. For columns that exist in only the main search or subpipeline, `NULL` values are inserted into the missing fields for the respective rows. + +## Syntax + +The `appendpipe` command has the following syntax: + +```syntax +appendpipe [] +``` + +## Parameters + +The `appendpipe` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A list of commands applied to the search results produced by the commands that precede the `appendpipe` command. | -## Example 1: Append rows from a total count to existing search result -This example appends rows from "total by gender" to "sum by gender, state" with merged column of same field name and type. +## Example 1: Append rows from a total count to existing search results + +This example appends rows from `total by gender` to `sum by gender, state`, merging columns that have the same field name and type: ```ppl source=accounts @@ -21,7 +34,7 @@ source=accounts | appendpipe [ stats sum(part) as total by gender ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 6/6 @@ -37,9 +50,10 @@ fetched rows / total rows = 6/6 +------+--------+-------+-------+ ``` + ## Example 2: Append rows with merged column names -This example appends rows from "count by gender" to "sum by gender, state". +This example appends rows from `count by gender` to `sum by gender, state`: ```ppl source=accounts @@ -49,7 +63,7 @@ source=accounts | appendpipe [ stats sum(total) as total by gender ] ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 6/6 @@ -65,6 +79,9 @@ fetched rows / total rows = 6/6 +----------+--------+-------+ ``` -## Limitations -* **Schema Compatibility**: Same as command `append`, when fields with the same name exist between the main search and sub-search but have incompatible types, the query will fail with an error. To avoid type conflicts, ensure that fields with the same name have the same data type, or use different field names (e.g., by renaming with `eval` or using `fields` to select non-conflicting columns). \ No newline at end of file +## Limitations + +The `appendpipe` command has the following limitations: + +* **Schema compatibility**: When fields with the same name exist in both the main search and the subpipeline but have incompatible types, the query fails with an error. To avoid type conflicts, ensure that fields with the same name share the same data type. Alternatively, use different field names. You can rename the conflicting fields using `eval` or select non-conflicting columns using `fields`. diff --git a/docs/user/ppl/cmd/bin.md b/docs/user/ppl/cmd/bin.md index 7f8ef389bd..e245214502 100644 --- a/docs/user/ppl/cmd/bin.md +++ b/docs/user/ppl/cmd/bin.md @@ -1,48 +1,75 @@ -# bin - -## Description - -The `bin` command groups numeric values into buckets of equal intervals, making it useful for creating histograms and analyzing data distribution. It takes a numeric or time-based field and generates a new field with values that represent the lower bound of each bucket. -## Syntax - -bin \ [span=\] [minspan=\] [bins=\] [aligntime=(earliest \| latest \| \)] [start=\] [end=\] -* field: mandatory. The field to bin. Accepts numeric or time-based fields. -* span: optional. The interval size for each bin. Cannot be used with bins or minspan parameters. - * Supports numeric (e.g., `1000`), logarithmic (e.g., `log10`, `2log10`), and time intervals - * Available time units: - * microsecond (us) - * millisecond (ms) - * centisecond (cs) - * decisecond (ds) - * second (s, sec, secs, second, seconds) - * minute (m, min, mins, minute, minutes) - * hour (h, hr, hrs, hour, hours) - * day (d, day, days) - * month (M, mon, month, months) -* minspan: optional. The minimum interval size for automatic span calculation. Cannot be used with span or bins parameters. -* bins: optional. The maximum number of equal-width bins to create. Cannot be used with span or minspan parameters. The bins parameter must be between 2 and 50000 (inclusive). - - **Limitation**: The bins parameter on timestamp fields has the following requirements: - - 1. **Pushdown must be enabled**: Controlled by ``plugins.calcite.pushdown.enabled`` (enabled by default). When pushdown is disabled, use the ``span`` parameter instead (e.g., ``bin @timestamp span=5m``). - 2. **Timestamp field must be used as an aggregation bucket**: The binned timestamp field must be used in a ``stats`` aggregation (e.g., ``source=events | bin @timestamp bins=3 | stats count() by @timestamp``). Using bins on timestamp fields outside of aggregation buckets is not supported. -* aligntime: optional. Align the bin times for time-based fields. Valid only for time-based discretization. Options: - * earliest: Align bins to the earliest timestamp in the data - * latest: Align bins to the latest timestamp in the data - * \: Align bins to a specific epoch time value or time modifier expression -* start: optional. The starting value for binning range. **Default:** minimum field value. -* end: optional. The ending value for binning range. **Default:** maximum field value. - -**Parameter Behavior** -When multiple parameters are specified, priority order is: span > minspan > bins > start/end > default. -**Special Behaviors:** -* Logarithmic span (`log10`, `2log10`, etc.) creates logarithmic bin boundaries instead of linear -* Daily/monthly spans automatically align to calendar boundaries and return date strings (YYYY-MM-DD) instead of timestamps -* aligntime parameter only applies to time spans excluding days/months -* start/end parameters expand the range (never shrink) and affect bin width calculation + +# bin + +The `bin` command groups numeric values into buckets of equal intervals, which is useful for creating histograms and analyzing data distribution. It accepts a numeric or time-based field and generates a new field containing values that represent the lower bound of each bucket. + +## Syntax + +The `bin` command has the following syntax: + +```syntax +bin [span=] [minspan=] [bins=] [aligntime=(earliest | latest | )] [start=] [end=] +``` + +## Parameters + +The `bin` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The field to group into buckets. Accepts numeric or time-based fields. | +| `span` | Optional | The interval size for each bin. Cannot be used with the `bins` or `minspan` parameters. Supports numeric, logarithmic (`log10`, `2log10`), and time intervals. See [Time units](#time-units).| +| `minspan` | Optional | The minimum interval size for automatic span calculation. Cannot be used with the `span` or `bins` parameters. | +| `bins` | Optional | The maximum number of equal-width bins to create. Must be between `2` and `50000` (inclusive). Cannot be used with the `span` or `minspan` parameters. See [The bins parameter for timestamp fields](#the-bins-parameter-for-timestamp-fields).| +| `aligntime` | Optional | Align the bin times for time-based fields. Valid only for time-based discretization. Valid values are `earliest`, `latest`, or a specific time. See [Align options](#align-time-options).| +| `start` | Optional | The starting value of the interval range. Default is the minimum value of the field. | +| `end` | Optional | The ending value of the interval range. Default is the maximum value of the field. | + +### The bins parameter for timestamp fields + +The `bins` parameter for timestamp fields has the following requirements: + +- **Pushdown must be enabled**: Enable pushdown by setting `plugins.calcite.pushdown.enabled` to `true` (enabled by default). If pushdown is disabled, use the `span` parameter instead (for example, `bin @timestamp span=5m`). +- **The timestamp field must be used as an aggregation bucket**: The binned timestamp field must be included in a `stats` aggregation (for example, `source=events | bin @timestamp bins=3 | stats count() by @timestamp`). Using `bins` on timestamp fields outside of aggregation buckets is not supported. + + +### Time units + +The following time units are available for the `span` parameter: + +* Microseconds (`us`) +* Milliseconds (`ms`) +* Centiseconds (`cs`) +* Deciseconds (`ds`) +* Seconds (`s`, `sec`, `secs`, `second`, or `seconds`) +* Minutes (`m`, `min`, `mins`, `minute`, or `minutes`) +* Hours (`h`, `hr`, `hrs`, `hour`, or `hours`) +* Days (`d`, `day`, or `days`) +* Months (`M`, `mon`, `month`, or `months`) + +### Align time options + +The following options are available for the `aligntime` parameter: + +* `earliest` -- Align bins to the earliest timestamp in the data. +* `latest` -- Align bins to the latest timestamp in the data. +* `` -- Align bins to a specific epoch time value or time modifier expression. +### Parameter behavior + +When multiple parameters are specified, the priority order is: `span` > `minspan` > `bins` > `start`/`end` > default. + +### Special parameter types + +The `bin` command has the following special handling for certain parameter types: + +* Logarithmic spans (for example, `log10` or `2log10`) create logarithmic bin boundaries instead of linear ones. +* Daily or monthly spans automatically align to calendar boundaries and return date strings (`YYYY-MM-DD`) instead of timestamps. +* The `aligntime` parameter applies only to time spans shorter than a day (excluding daily or monthly spans). +* The `start` and `end` parameters expand the range (they never reduce it) and affect bin width calculations. + ## Example 1: Basic numeric span - + ```ppl source=accounts | bin age span=10 @@ -50,7 +77,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -63,8 +90,9 @@ fetched rows / total rows = 3/3 +-------+----------------+ ``` + ## Example 2: Large numeric span - + ```ppl source=accounts | bin balance span=25000 @@ -72,7 +100,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -84,8 +112,9 @@ fetched rows / total rows = 2/2 +-------------+ ``` + ## Example 3: Logarithmic span (log10) - + ```ppl source=accounts | bin balance span=log10 @@ -93,7 +122,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -105,8 +134,9 @@ fetched rows / total rows = 2/2 +------------------+ ``` + ## Example 4: Logarithmic span with coefficient - + ```ppl source=accounts | bin balance span=2log10 @@ -114,7 +144,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -127,8 +157,9 @@ fetched rows / total rows = 3/3 +------------------+ ``` + ## Example 5: Basic bins parameter - + ```ppl source=time_test | bin value bins=5 @@ -136,7 +167,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -149,8 +180,9 @@ fetched rows / total rows = 3/3 +------------+ ``` + ## Example 6: Low bin count - + ```ppl source=accounts | bin age bins=2 @@ -158,7 +190,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -169,8 +201,9 @@ fetched rows / total rows = 1/1 +-------+ ``` + ## Example 7: High bin count - + ```ppl source=accounts | bin age bins=21 @@ -178,7 +211,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -191,8 +224,9 @@ fetched rows / total rows = 3/3 +-------+----------------+ ``` + ## Example 8: Basic minspan - + ```ppl source=accounts | bin age minspan=5 @@ -200,7 +234,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -213,8 +247,9 @@ fetched rows / total rows = 3/3 +-------+----------------+ ``` + ## Example 9: Large minspan - + ```ppl source=accounts | bin age minspan=101 @@ -222,7 +257,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -233,8 +268,9 @@ fetched rows / total rows = 1/1 +--------+ ``` + ## Example 10: Start and end range - + ```ppl source=accounts | bin age start=0 end=101 @@ -242,7 +278,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -253,8 +289,9 @@ fetched rows / total rows = 1/1 +-------+ ``` + ## Example 11: Large end range - + ```ppl source=accounts | bin balance start=0 end=100001 @@ -262,7 +299,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -273,8 +310,9 @@ fetched rows / total rows = 1/1 +----------+ ``` + ## Example 12: Span with start/end - + ```ppl source=accounts | bin age span=1 start=25 end=35 @@ -282,7 +320,7 @@ source=accounts | head 6 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -296,8 +334,9 @@ fetched rows / total rows = 4/4 +-------+ ``` + ## Example 13: Hour span - + ```ppl source=time_test | bin @timestamp span=1h @@ -305,7 +344,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -318,8 +357,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 14: Minute span - + ```ppl source=time_test | bin @timestamp span=45minute @@ -327,7 +367,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -340,8 +380,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 15: Second span - + ```ppl source=time_test | bin @timestamp span=30seconds @@ -349,7 +390,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -362,8 +403,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 16: Daily span - + ```ppl source=time_test | bin @timestamp span=7day @@ -371,7 +413,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -384,8 +426,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 17: Aligntime with time modifier - + ```ppl source=time_test | bin @timestamp span=2h aligntime='@d+3h' @@ -393,7 +436,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -406,8 +449,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 18: Aligntime with epoch timestamp - + ```ppl source=time_test | bin @timestamp span=2h aligntime=1500000000 @@ -415,7 +459,7 @@ source=time_test | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -428,8 +472,9 @@ fetched rows / total rows = 3/3 +---------------------+-------+ ``` + ## Example 19: Default behavior (no parameters) - + ```ppl source=accounts | bin age @@ -437,7 +482,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -450,8 +495,9 @@ fetched rows / total rows = 3/3 +-----------+----------------+ ``` + ## Example 20: Binning with string fields - + ```ppl source=accounts | eval age_str = CAST(age AS STRING) @@ -460,7 +506,7 @@ source=accounts | sort age_str ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -471,4 +517,3 @@ fetched rows / total rows = 2/2 | 3 | 30-40 | +---------+---------+ ``` - \ No newline at end of file diff --git a/docs/user/ppl/cmd/chart.md b/docs/user/ppl/cmd/chart.md index 829afdedb7..257f0de235 100644 --- a/docs/user/ppl/cmd/chart.md +++ b/docs/user/ppl/cmd/chart.md @@ -1,57 +1,50 @@ -# chart - -## Description - -The `chart` command transforms search results by applying a statistical aggregation function and optionally grouping the data by one or two fields. The results are suitable for visualization as a two-dimension chart when grouping by two fields, where unique values in the second group key can be pivoted to column names. -## Syntax - -chart [limit=(top\|bottom) \] [useother=\] [usenull=\] [nullstr=\] [otherstr=\] \ [ by \ \ ] \| [over \ ] [ by \] -* limit: optional. Specifies the number of categories to display when using column split. Each unique value in the column split field represents a category. **Default:** top10. - * Syntax: `limit=(top|bottom)` or `limit=` (defaults to top) - * When `limit=K` is set, the top or bottom K categories from the column split field are retained; the remaining categories are grouped into an "OTHER" category if `useother` is not set to false. - * Set limit to 0 to show all categories without any limit. - * Use `limit=topK` or `limit=bottomK` to specify whether to retain the top or bottom K column categories. The ranking is based on the sum of aggregated values for each column category. For example, `chart limit=top3 count() by region, product` keeps the 3 products with the highest total counts across all regions. If not specified, top is used by default. - * Only applies when column split is present (by 2 fields or over...by... coexists). -* useother: optional. Controls whether to create an "OTHER" category for categories beyond the limit. **Default:** true - * When set to false, only the top/bottom N categories (based on limit) are shown without an "OTHER" category. - * When set to true, categories beyond the limit are grouped into an "OTHER" category. - * Only applies when using column split and when there are more categories than the limit. -* usenull: optional. Controls whether to group events without a column split (i.e. whose column split is null) into a separate "NULL" category. **Default:** true - * `usenull` only applies to column split. - * Row split should always be non-null value. Documents with null values in row split will be ignored. - * When `usenull=false`, events with a null column split are excluded from results. - * When `usenull=true`, events with a null column split are grouped into a separate "NULL" category. -* nullstr: optional. Specifies the category name for rows that do not contain the column split value. **Default:** "NULL" - * Only applies when `usenull` is set to true. -* otherstr: optional. Specifies the category name for the "OTHER" category. **Default:** "OTHER" - * Only applies when `useother` is set to true and there are values beyond the limit. -* aggregation_function: mandatory. The aggregation function to apply to the data. - * Currently, only a single aggregation function is supported. - * Available functions: aggregation functions supported by the stats command. -* by: optional. Groups the results by either one field (row split) or two fields (row split and column split) - * `limit`, `useother`, and `usenull` apply to the column split - * Results are returned as individual rows for each combination. - * If not specified, the aggregation is performed across all documents. -* over...by...: optional. Alternative syntax for grouping by multiple fields. - * `over by ` groups the results by both fields. - * Using `over` alone on one field is equivalent to `by ` - -## Notes - -* The fields generated by column splitting are converted to strings so that they are compatible with `nullstr` and `otherstr` and can be used as column names once pivoted. -* Documents with null values in fields used by the aggregation function are excluded from aggregation. For example, in `chart avg(balance) over deptno, group`, documents where `balance` is null are excluded from the average calculation. -* The aggregation metric appears as the last column in the result. Result columns are ordered as: [row-split] [column-split] [aggregation-metrics]. - + +# chart + +The `chart` command transforms search results by applying a statistical aggregation function and optionally grouping the data by one or two fields. When grouped by two fields, the results are suitable for two-dimensional chart visualizations, with unique values in the second group key pivoted into column names. + +## Syntax + +The `chart` command has the following syntax: + +```syntax +chart [limit=(top|bottom) ] [useother=] [usenull=] [nullstr=] [otherstr=] [ by ] | [over ] [ by ] +``` + +## Parameters + +The `chart` command supports the following parameters. + +| Parameter | Required/Optional | Description | Default | +| --- | --- | --- | --- | +| `` | Required | The aggregation function to apply to the data. Only a single aggregation function is supported. Available functions are the aggregation functions supported by the [`stats`](./stats.md) command. | N/A | +| `` | Optional | Groups the results by either one field (row split) or two fields (row split and column split). The parameters `limit`, `useother`, and `usenull` apply to the column split. Results are returned as individual rows for each combination. | Aggregate across all documents | +| `over [] by []` | Optional | Alternative syntax for grouping by multiple fields. `over by ` groups the results by both fields. Using `over` alone on one field is equivalent to `by `. | N/A | +| `limit` | Optional | The number of categories to display when using column split. `limit=N` or `limit=topN` returns the top N categories. `limit=bottomN` returns the bottom N categories. When the limit is exceeded, remaining categories are grouped into an `OTHER` category (unless `useother=false`). Set to `0` to show all categories without a limit. The ranking is based on the sum of aggregated values for each column category. For example, `limit=top3` keeps the three categories with the highest total values. Only applies when grouping by two fields. | `top10` | +| `useother` | Optional | Controls whether to create an `OTHER` category for categories beyond the `limit`. When set to `false`, only the top or bottom N categories (based on `limit`) are shown without an `OTHER` category. When set to `true`, categories beyond the `limit` are grouped into an `OTHER` category. This parameter only applies when using column split and when there are more categories than the `limit`. | `true` | +| `usenull` | Optional | Controls whether to group documents that have null values in the column split field into a separate `NULL` category. This parameter only applies to column split. Documents with null values in the row split field are ignored; only documents with non-null values in the row split field are included in the results. When `usenull=false`, documents with null values in the column split field are excluded from the results. When `usenull=true`, documents with null values in the column split field are grouped into a separate `NULL` category. | `true` | +| `nullstr` | Optional | Specifies the category name for documents that have null values in the column split field. This parameter only applies when `usenull` is `true`. | `"NULL"` | +| `otherstr` | Optional | Specifies the category name for the `OTHER` category. This parameter only applies when `useother` is `true` and there are values beyond the `limit`. | `OTHER` | + + +## Notes + +The following considerations apply when using the `chart` command: + +* Fields generated by column splitting are converted to strings. This ensures compatibility with `nullstr` and `otherstr` and allows the fields to be used as column names after pivoting. +* Documents with null values in fields used by the aggregation function are excluded from aggregation. For example, in `chart avg(balance) over deptno, group`, documents where `balance` is null are excluded from the average calculation. +* The aggregation metric appears as the last column in the results. Result columns are ordered as follows: `[row split] [column split] [aggregation metrics]`. + ## Example 1: Basic aggregation without grouping -This example calculates the average balance across all accounts. +This example calculates the average balance across all accounts: ```ppl source=accounts | chart avg(balance) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -62,16 +55,17 @@ fetched rows / total rows = 1/1 +--------------+ ``` -## Example 2: Group by single field -This example calculates the count of accounts grouped by gender. +## Example 2: Group by a single field + +This example calculates the count of accounts grouped by gender: ```ppl source=accounts | chart count() by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -83,16 +77,17 @@ fetched rows / total rows = 2/2 +--------+---------+ ``` -## Example 3: Using over and by for multiple field grouping -This example shows average balance grouped by both gender and age fields. Note that the age column in the result is converted to string type. +## Example 3: Using over [] by [] to group by multiple fields + +The following query calculates average balance grouped by both the `gender` and `age` fields: ```ppl source=accounts | chart avg(balance) over gender by age ``` -Expected output: +The query returns the following results. The `age` column in the result is converted to the string type: ```text fetched rows / total rows = 4/4 @@ -106,16 +101,17 @@ fetched rows / total rows = 4/4 +--------+-----+--------------+ ``` + ## Example 4: Using basic limit functionality -This example limits the results to show only the top 1 age group. Note that the age column in the result is converted to string type. +This example limits the results to show only the single top age group: ```ppl source=accounts | chart limit=1 count() over gender by age ``` -Expected output: +The query returns the following results. The `age` column in the result is converted to the string type: ```text fetched rows / total rows = 3/3 @@ -128,16 +124,17 @@ fetched rows / total rows = 3/3 +--------+-------+---------+ ``` + ## Example 5: Using limit with other parameters -This example shows using limit with useother and custom otherstr parameters. +The following query uses the `chart` command with the `limit`, `useother`, and custom `otherstr` parameters: ```ppl source=accounts | chart limit=top1 useother=true otherstr='minor_gender' count() over state by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -151,16 +148,17 @@ fetched rows / total rows = 4/4 +-------+--------------+---------+ ``` + ## Example 6: Using null parameters -This example shows using limit with usenull and custom nullstr parameters. +The following query uses the `chart` command with the `limit`, `usenull`, and custom `nullstr` parameters: ```ppl source=accounts | chart usenull=true nullstr='employer not specified' count() over firstname by employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -174,16 +172,17 @@ fetched rows / total rows = 4/4 +-----------+------------------------+---------+ ``` -## Example 7: Using chart command with span -This example demonstrates using span for grouping age ranges. +## Example 7: Using span + +The following query uses the `chart` command with `span` for grouping age ranges: ```ppl source=accounts | chart max(balance) by age span=10, gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -195,6 +194,9 @@ fetched rows / total rows = 2/2 +-----+--------+--------------+ ``` -## Limitations -* Only a single aggregation function is supported per chart command. \ No newline at end of file +## Limitations + +The `chart` command has the following limitations: + +* Only a single aggregation function is supported per `chart` command. \ No newline at end of file diff --git a/docs/user/ppl/cmd/dedup.md b/docs/user/ppl/cmd/dedup.md index 59dfcf63dd..92d1410997 100644 --- a/docs/user/ppl/cmd/dedup.md +++ b/docs/user/ppl/cmd/dedup.md @@ -1,19 +1,31 @@ -# dedup -## Description +# dedup The `dedup` command removes duplicate documents defined by specified fields from the search result. -## Syntax -dedup [int] \ [keepempty=\] [consecutive=\] -* int: optional. The `dedup` command retains multiple events for each combination when you specify \. The number for \ must be greater than 0. All other duplicates are removed from the results. **Default:** 1 -* keepempty: optional. If set to true, keep the document if the any field in the field-list has NULL value or field is MISSING. **Default:** false. -* consecutive: optional. If set to true, removes only events with duplicate combinations of values that are consecutive. **Default:** false. -* field-list: mandatory. The comma-delimited field list. At least one field is required. +## Syntax + +The `dedup` command has the following syntax: + +```syntax +dedup [int] [keepempty=] [consecutive=] +``` + +## Parameters + +The `dedup` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comma-delimited list of fields to use for deduplication. At least one field is required. | +| `` | Optional | The number of duplicate documents to retain for each combination. Must be greater than `0`. Default is `1`. | +| `keepempty` | Optional | When set to `true`, keeps documents in which any field in the field list has a `NULL` value or is missing. Default is `false`. | +| `consecutive` | Optional | When set to `true`, removes only consecutive duplicate documents. Default is `false`. Requires the legacy SQL engine (`plugins.calcite.enabled=false`). | -## Example 1: Dedup by one field -This example shows deduplicating documents by gender field. +## Example 1: Remove duplicates based on a single field + +The following query deduplicates documents based on the `gender` field: ```ppl source=accounts @@ -22,7 +34,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -34,9 +46,10 @@ fetched rows / total rows = 2/2 +----------------+--------+ ``` -## Example 2: Keep 2 duplicates documents -This example shows deduplicating documents by gender field while keeping 2 duplicates. +## Example 2: Retain multiple duplicate documents + +The following query removes duplicate documents based on the `gender` field while keeping two duplicate documents: ```ppl source=accounts @@ -45,7 +58,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -58,9 +71,10 @@ fetched rows / total rows = 3/3 +----------------+--------+ ``` -## Example 3: Keep or Ignore the empty field by default -This example shows deduplicating documents while keeping null values. +## Example 3: Handle documents with empty field values + +The following query removes duplicate documents while keeping documents with `null` values in the specified field: ```ppl source=accounts @@ -69,7 +83,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -83,7 +97,7 @@ fetched rows / total rows = 4/4 +----------------+-----------------------+ ``` -This example shows deduplicating documents while ignoring null values. +The following query removes duplicate documents while ignoring documents with empty values in the specified field: ```ppl source=accounts @@ -92,7 +106,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -105,9 +119,10 @@ fetched rows / total rows = 3/3 +----------------+-----------------------+ ``` -## Example 4: Dedup in consecutive document -This example shows deduplicating consecutive documents. +## Example 4: Deduplicate consecutive documents + +The following query removes duplicate consecutive documents: ```ppl source=accounts @@ -116,7 +131,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -129,6 +144,4 @@ fetched rows / total rows = 3/3 +----------------+--------+ ``` -## Limitations -The `dedup` with `consecutive=true` command can only work with `plugins.calcite.enabled=false`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/describe.md b/docs/user/ppl/cmd/describe.md index d6efffc9d5..3c7be026c9 100644 --- a/docs/user/ppl/cmd/describe.md +++ b/docs/user/ppl/cmd/describe.md @@ -1,24 +1,35 @@ -# describe -## Description +# describe -Use the `describe` command to query metadata of the index. `describe` command can only be used as the first command in the PPL query. -## Syntax +The `describe` command queries index metadata. The `describe` command can only be used as the first command in the PPL query. -describe [dataSource.][schema.]\ -* dataSource: optional. If dataSource is not provided, it resolves to opensearch dataSource. -* schema: optional. If schema is not provided, it resolves to default schema. -* tablename: mandatory. describe command must specify which tablename to query from. - -## Example 1: Fetch all the metadata +## Syntax + +The `describe` command has the following syntax. The argument to the command is a dot-separated path to the table consisting of an optional data source, optional schema, and required table name: + +```syntax +describe [.][.] +``` + +## Parameters -This example describes the accounts index. +The `describe` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The table to query. | +| `` | Optional | The data source to use. Default is the OpenSearch `datasource`. | +| `` | Optional | The schema to use. Default is the default schema. | + +## Example 1: Fetch all metadata + +This example describes the `accounts` index: ```ppl describe accounts ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 11/11 @@ -39,9 +50,10 @@ fetched rows / total rows = 11/11 +----------------+-------------+------------+----------------+-----------+-----------+-------------+---------------+----------------+----------------+----------+---------+------------+---------------+------------------+-------------------+------------------+-------------+---------------+--------------+-------------+------------------+------------------+--------------------+ ``` -## Example 2: Fetch metadata with condition and filter -This example retrieves columns with type bigint in the accounts index. +## Example 2: Fetch metadata with a condition and filter + +This example retrieves columns of the type `bigint` from the `accounts` index: ```ppl describe accounts @@ -49,7 +61,7 @@ describe accounts | fields COLUMN_NAME ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -62,6 +74,7 @@ fetched rows / total rows = 3/3 +----------------+ ``` -## Example 3: Fetch metadata for table in Prometheus datasource + +## Example 3: Fetch table metadata for a Prometheus data source See [Fetch metadata for table in Prometheus datasource](../admin/datasources.md) for more context. \ No newline at end of file diff --git a/docs/user/ppl/cmd/eval.md b/docs/user/ppl/cmd/eval.md index d3300cd6b0..635bd0a990 100644 --- a/docs/user/ppl/cmd/eval.md +++ b/docs/user/ppl/cmd/eval.md @@ -1,17 +1,31 @@ -# eval -## Description +# eval -The `eval` command evaluates the expression and appends the result to the search result. -## Syntax +The `eval` command evaluates the specified expression and appends the result of the evaluation to the search results. -eval \=\ ["," \=\ ]... -* field: mandatory. If the field name does not exist, a new field is added. If the field name already exists, it will be overridden. -* expression: mandatory. Any expression supported by the system. +> **Note**: The `eval` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/). It is only executed on the coordinating node. + +## Syntax + +The `eval` command has the following syntax: + +```syntax +eval = ["," = ]... +``` + +## Parameters + +The `eval` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The name of the field to create or update. If the field does not exist, a new field is added. If it already exists, its value is overwritten. | +| `` | Required | The expression to evaluate. | + ## Example 1: Create a new field -This example shows creating a new field doubleAge for each document. The new doubleAge field is the result of multiplying age by 2. +The following query creates a new `doubleAge` field for each document: ```ppl source=accounts @@ -19,7 +33,7 @@ source=accounts | fields age, doubleAge ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -33,9 +47,10 @@ fetched rows / total rows = 4/4 +-----+-----------+ ``` + ## Example 2: Override an existing field -This example shows overriding the existing age field by adding 1 to it. +The following query overrides the `age` field by adding `1` to its value: ```ppl source=accounts @@ -43,7 +58,7 @@ source=accounts | fields age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -57,9 +72,10 @@ fetched rows / total rows = 4/4 +-----+ ``` -## Example 3: Create a new field with field defined in eval -This example shows creating a new field ddAge using a field defined in the same eval command. The new field ddAge is the result of multiplying doubleAge by 2, where doubleAge is defined in the same eval command. +## Example 3: Create a new field using a field defined in eval + +The following query creates a new field based on another field defined in the same `eval` expression. In this example, the new `ddAge` field is calculated by multiplying the `doubleAge` field by `2`. The `doubleAge` field itself is defined earlier in the `eval` command: ```ppl source=accounts @@ -67,7 +83,7 @@ source=accounts | fields age, doubleAge, ddAge ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -81,9 +97,10 @@ fetched rows / total rows = 4/4 +-----+-----------+-------+ ``` + ## Example 4: String concatenation -This example shows using the + operator for string concatenation. You can concatenate string literals and field values. +The following query uses the `+` operator for string concatenation. You can concatenate string literals and field values as follows: ```ppl source=accounts @@ -91,7 +108,7 @@ source=accounts | fields firstname, greeting ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -105,15 +122,16 @@ fetched rows / total rows = 4/4 +-----------+---------------+ ``` + ## Example 5: Multiple string concatenation with type casting -This example shows multiple concatenations with type casting from numeric to string. +The following query performs multiple concatenation operations, including type casting from numeric values to strings: ```ppl source=accounts | eval full_info = 'Name: ' + firstname + ', Age: ' + CAST(age AS STRING) | fields firstname, age, full_info ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -127,6 +145,4 @@ fetched rows / total rows = 4/4 +-----------+-----+------------------------+ ``` -## Limitations -The `eval` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. \ No newline at end of file diff --git a/docs/user/ppl/cmd/eventstats.md b/docs/user/ppl/cmd/eventstats.md index 1cb791d95e..05dd0e585b 100644 --- a/docs/user/ppl/cmd/eventstats.md +++ b/docs/user/ppl/cmd/eventstats.md @@ -1,67 +1,29 @@ -# eventstats -## Description +# eventstats -The `eventstats` command enriches your event data with calculated summary statistics. It operates by analyzing specified fields within your events, computing various statistical measures, and then appending these results as new fields to each original event. -Key aspects of `eventstats`: -1. It performs calculations across the entire result set or within defined groups. -2. The original events remain intact, with new fields added to contain the statistical results. -3. The command is particularly useful for comparative analysis, identifying outliers, or providing additional context to individual events. - -Difference between `stats` and `eventstats` -The `stats` and `eventstats` commands are both used for calculating statistics, but they have some key differences in how they operate and what they produce: -* Output Format - * `stats`: Produces a summary table with only the calculated statistics. - * `eventstats`: Adds the calculated statistics as new fields to the existing events, preserving the original data. -* Event Retention - * `stats`: Reduces the result set to only the statistical summary, discarding individual events. - * `eventstats`: Retains all original events and adds new fields with the calculated statistics. -* Use Cases - * `stats`: Best for creating summary reports or dashboards. Often used as a final command to summarize results. - * `eventstats`: Useful when you need to enrich events with statistical context for further analysis or filtering. It can be used mid-search to add statistics that can be used in subsequent commands. - -## Syntax - -eventstats [bucket_nullable=bool] \... [by-clause] -* function: mandatory. An aggregation function or window function. -* bucket_nullable: optional. Controls whether the eventstats command consider null buckets as a valid group in group-by aggregations. When set to `false`, it will not treat null group-by values as a distinct group during aggregation. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. - * When `plugins.ppl.syntax.legacy.preferred=true`, `bucket_nullable` defaults to `true` - * When `plugins.ppl.syntax.legacy.preferred=false`, `bucket_nullable` defaults to `false` -* by-clause: optional. Groups results by specified fields or expressions. Syntax: by [span-expression,] [field,]... **Default:** aggregation over the entire result set. -* span-expression: optional, at most one. Splits field into buckets by intervals. Syntax: span(field_expr, interval_expr). For example, `span(age, 10)` creates 10-year age buckets, `span(timestamp, 1h)` creates hourly buckets. - * Available time units: - * millisecond (ms) - * second (s) - * minute (m, case sensitive) - * hour (h) - * day (d) - * week (w) - * month (M, case sensitive) - * quarter (q) - * year (y) - -## Aggregation Functions - -The eventstats command supports the following aggregation functions: -* COUNT: Count of values -* SUM: Sum of numeric values -* AVG: Average of numeric values -* MAX: Maximum value -* MIN: Minimum value -* VAR_SAMP: Sample variance -* VAR_POP: Population variance -* STDDEV_SAMP: Sample standard deviation -* STDDEV_POP: Population standard deviation -* DISTINCT_COUNT/DC: Distinct count of values -* EARLIEST: Earliest value by timestamp -* LATEST: Latest value by timestamp - -For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md). -## Usage - -Eventstats - -```sql ignore +The `eventstats` command enriches your event data with calculated summary statistics. It analyzes the specified fields within your events, computes various statistical measures, and then appends these results as new fields to each original event. + +The `eventstats` command operates in the following way: + +1. It performs calculations across the entire search results or within defined groups. +2. The original events remain intact, with new fields added to contain the statistical results. +3. The command is particularly useful for comparative analysis, identifying outliers, and providing additional context to individual events. + +## Comparing stats and eventstats + +For a comprehensive comparison of `stats`, `eventstats`, and `streamstats` commands, including their differences in transformation behavior, output format, aggregation scope, and use cases, see [Comparing stats, eventstats, and streamstats](streamstats.md/#comparing-stats-eventstats-and-streamstats). + +## Syntax + +The `eventstats` command has the following syntax: + +```syntax +eventstats [bucket_nullable=bool] ... [by-clause] +``` + +The following are examples of the `eventstats` command syntax: + +```syntax source = table | eventstats avg(a) source = table | where a < 50 | eventstats count(c) source = table | eventstats min(c), max(c) by b @@ -69,10 +31,54 @@ source = table | eventstats count(c) as count_by by b | where count_by > 1000 source = table | eventstats dc(field) as distinct_count source = table | eventstats distinct_count(category) by region ``` - -## Example 1: Calculate the average, sum and count of a field by group -This example shows calculating the average age, sum of age, and count of events for all accounts grouped by gender. +## Parameters + +The `eventstats` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | An aggregation function or window function. | +| `bucket_nullable` | Optional | Controls whether the `eventstats` command considers `null` buckets as a valid group in group-by aggregations. When set to `false`, it does not treat `null` group-by values as a distinct group during aggregation. Default is determined by `plugins.ppl.syntax.legacy.preferred`. | +| `` | Optional | Groups results by specified fields or expressions. Syntax: `by [span-expression,] [field,]...` Default is aggregating over the entire search results. | +| `` | Optional | Splits a field into buckets by intervals (at most one). Syntax: `span(field_expr, interval_expr)`. For example, `span(age, 10)` creates 10-year age buckets, while `span(timestamp, 1h)` creates hourly buckets. | + +### Time units + +The following time units are available for span expressions: + +* Milliseconds (`ms`) +* Seconds (`s`) +* Minutes (`m`, case sensitive) +* Hours (`h`) +* Days (`d`) +* Weeks (`w`) +* Months (`M`, case sensitive) +* Quarters (`q`) +* Years (`y`) + +## Aggregation functions + +The `eventstats` command supports the following aggregation functions: + +* `COUNT` -- Count of values +* `SUM` -- Sum of numeric values +* `AVG` -- Average of numeric values +* `MAX` -- Maximum value +* `MIN` -- Minimum value +* `VAR_SAMP` -- Sample variance +* `VAR_POP` -- Population variance +* `STDDEV_SAMP` -- Sample standard deviation +* `STDDEV_POP` -- Population standard deviation +* `DISTINCT_COUNT`/`DC` -- Distinct count of values +* `EARLIEST` -- Earliest value by timestamp +* `LATEST` -- Latest value by timestamp + +For detailed documentation of each function, see [Functions](../functions/aggregations.md). + +## Example 1: Calculate the average, sum, and count of a field by group + +The following query calculates the average age, sum of age, and count of events for all accounts grouped by gender: ```ppl source=accounts @@ -81,7 +87,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -95,9 +101,10 @@ fetched rows / total rows = 4/4 +----------------+--------+-----+--------------------+----------+---------+ ``` + ## Example 2: Calculate the count by a gender and span -This example shows counting events by age intervals of 5 years, grouped by gender. +The following query counts events by age intervals of 5 years, grouped by gender: ```ppl source=accounts @@ -106,7 +113,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -120,8 +127,11 @@ fetched rows / total rows = 4/4 +----------------+--------+-----+-----+ ``` -## Example 3: Null buckets handling - + +## Example 3: Null bucket handling + +The following query uses the `eventstats` command with `bucket_nullable=false` to exclude null values from the group-by aggregation: + ```ppl source=accounts | eventstats bucket_nullable=false count() as cnt by employer @@ -129,7 +139,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -142,7 +152,9 @@ fetched rows / total rows = 4/4 | 18 | Dale | null | null | +----------------+-----------+----------+------+ ``` - + +The following query uses the `eventstats` command with `bucket_nullable=true` to include null values in the group-by aggregation: + ```ppl source=accounts | eventstats bucket_nullable=true count() as cnt by employer @@ -150,7 +162,7 @@ source=accounts | sort account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 diff --git a/docs/user/ppl/cmd/expand.md b/docs/user/ppl/cmd/expand.md index 8fddbea7ad..aca295be1e 100644 --- a/docs/user/ppl/cmd/expand.md +++ b/docs/user/ppl/cmd/expand.md @@ -1,38 +1,50 @@ -# expand -## Description +# expand -The `expand` command transforms a single document with a nested array field into multiple documents—each containing one element from the array. All other fields in the original document are duplicated across the resulting documents. -Key aspects of `expand`: -* It generates one row per element in the specified array field. -* The specified array field is converted into individual rows. -* If an alias is provided, the expanded values appear under the alias instead of the original field name. -* If the specified field is an empty array, the row is retained with the expanded field set to null. - -## Syntax +The `expand` command transforms a single document with a nested array field into multiple documents, each containing one element of the array. All other fields in the original document are duplicated across the resulting documents. + +The `expand` command operates in the following way: + +* It generates one row per element in the specified array field. +* The specified array field is converted into individual rows. +* If an alias is provided, the expanded values appear under the alias instead of the original field name. +* If the specified field is an empty array, the row is retained with the expanded field set to `null`. + +## Syntax + +The `expand` command has the following syntax: + +```syntax +expand [as alias] +``` -expand \ [as alias] -* field: mandatory. The field to be expanded (exploded). Currently only nested arrays are supported. -* alias: optional. The name to use instead of the original field name. +## Parameters + +The `expand` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The field to be expanded. Only nested arrays are supported. | +| `` | Optional | The name to use in place of the original field name. | -## Example 1: Expand address field with an alias -Given a dataset `migration` with the following data: +## Example: Expand an address field using an alias + +Given a `migration` dataset with the following data: -```text +```json {"name":"abbas","age":24,"address":[{"city":"New york city","state":"NY","moveInDate":{"dateAndTime":"19840412T090742.000Z"}}]} {"name":"chen","age":32,"address":[{"city":"Miami","state":"Florida","moveInDate":{"dateAndTime":"19010811T040333.000Z"}},{"city":"los angeles","state":"CA","moveInDate":{"dateAndTime":"20230503T080742.000Z"}}]} - ``` -The following query expand the address field and rename it to addr: +The following query expands the `address` field and renames it to `addr`: ```ppl source=migration | expand address as addr ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -45,6 +57,9 @@ fetched rows / total rows = 3/3 +-------+-----+-------------------------------------------------------------------------------------------+ ``` -## Limitations -* The `expand` command currently only supports nested arrays. Primitive fields storing arrays are not supported. E.g. a string field storing an array of strings cannot be expanded with the current implementation. \ No newline at end of file +## Limitations + +The `expand` command has the following limitations: + +* The `expand` command only supports nested arrays. Primitive fields storing arrays are not supported. For example, a string field storing an array of strings cannot be expanded. \ No newline at end of file diff --git a/docs/user/ppl/cmd/explain.md b/docs/user/ppl/cmd/explain.md index 87a2e89bd7..2e21e0eee2 100644 --- a/docs/user/ppl/cmd/explain.md +++ b/docs/user/ppl/cmd/explain.md @@ -1,21 +1,28 @@ -# explain -## Description +# explain -The `explain` command explains the plan of query which is often used for query translation and troubleshooting. The `explain` command can only be used as the first command in the PPL query. -## Syntax +The `explain` command displays the execution plan of a query, which is often used for query translation and troubleshooting. The `explain` command can only be used as the first command in the PPL query. +## Syntax + +The `explain` command has the following syntax: + +```syntax explain queryStatement -* mode: optional. There are 4 explain modes: "simple", "standard", "cost", "extended". **Default:** standard. - * standard: The default mode. Display logical and physical plan with pushdown information (DSL). - * simple: Display the logical plan tree without attributes. - * cost: Display the standard information plus plan cost attributes. - * extended: Display the standard information plus generated code, if the whole plan is able to pushdown, it equals to standard mode. -* queryStatement: mandatory. A PPL query to explain. - -## Example 1: Explain a PPL query in v2 engine +``` + +## Parameters + +The `explain` command supports the following parameters. -When Calcite is disabled (plugins.calcite.enabled=false), explaining a PPL query will get its physical plan of v2 engine and pushdown information. +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A PPL query to explain. | +| `` | Optional | The explain mode. Valid values are:
- `standard`: Displays the logical and physical plan along with pushdown information (query domain-specific language [DSL]). Available in both v2 and v3 engines.
- `simple`: Displays the logical plan tree without attributes. Requires the v3 engine (`plugins.calcite.enabled` = `true`).
- `cost`: Displays the standard information plus plan cost attributes. Requires the v3 engine (`plugins.calcite.enabled` = `true`).
- `extended`: Displays the standard information plus the generated code. If the whole plan is able to pushdown, it is equal to the standard mode. Requires the v3 engine (`plugins.calcite.enabled` = `true`).

Default is `standard`. | + +## Example 1: Explain a PPL query in the v2 engine + +When Apache Calcite is disabled (`plugins.calcite.enabled` is set to `false`), `explain` obtains its physical plan and pushdown information from the v2 engine: ```ppl explain source=state_country @@ -23,7 +30,7 @@ explain source=state_country | stats count() by country ``` -Explain: +The query returns the following results: ```json { @@ -45,9 +52,10 @@ Explain: } ``` -## Example 2: Explain a PPL query in v3 engine -When Calcite is enabled (plugins.calcite.enabled=true), explaining a PPL query will get its logical and physical plan of v3 engine and pushdown information. +## Example 2: Explain a PPL query in the v3 engine + +When Apache Calcite is enabled (`plugins.calcite.enabled` is set to `true`), `explain` obtains its logical and physical plan and pushdown information from the v3 engine: ```ppl explain source=state_country @@ -55,8 +63,8 @@ explain source=state_country | stats count() by country ``` -Explain - +The query returns the following results: + ```json { "calcite": { @@ -72,9 +80,10 @@ Explain } ``` -## Example 3: Explain a PPL query with simple mode -When Calcite is enabled (plugins.calcite.enabled=true), you can explain a PPL query with the "simple" mode. +## Example 3: Explain a PPL query in the simple mode + +The following query uses the `explain` command in the `simple` mode to show a simplified logical plan tree: ```ppl explain simple source=state_country @@ -82,9 +91,9 @@ explain simple source=state_country | stats count() by country ``` -Explain +The query returns the following results: -``` +```json { "calcite": { "logical": """LogicalProject @@ -96,9 +105,10 @@ Explain } ``` -## Example 4: Explain a PPL query with cost mode -When Calcite is enabled (plugins.calcite.enabled=true), you can explain a PPL query with the "cost" mode. +## Example 4: Explain a PPL query in the cost mode + +The following query uses the `explain` command in the `cost` mode to show plan cost attributes: ```ppl explain cost source=state_country @@ -106,8 +116,8 @@ explain cost source=state_country | stats count() by country ``` -Explain - +The query returns the following results: + ```json { "calcite": { @@ -123,16 +133,19 @@ Explain } ``` -## Example 5: Explain a PPL query with extended mode - + +## Example 5: Explain a PPL query in the extended mode + +The following query uses the `explain` command in the `extended` mode to show the generated code: + ```ppl explain extended source=state_country | where country = 'USA' OR country = 'England' | stats count() by country ``` -Explain - +The query returns the following results: + ```json { "calcite": { diff --git a/docs/user/ppl/cmd/fields.md b/docs/user/ppl/cmd/fields.md index 507a8e6903..c7346e82bd 100644 --- a/docs/user/ppl/cmd/fields.md +++ b/docs/user/ppl/cmd/fields.md @@ -1,24 +1,36 @@ -# fields -## Description +# fields -The `fields` command keeps or removes fields from the search result. -## Syntax +The `fields` command specifies the fields that should be included in or excluded from the search results. -fields [+\|-] \ -* +\|-: optional. If the plus (+) is used, only the fields specified in the field list will be kept. If the minus (-) is used, all the fields specified in the field list will be removed. **Default:** +. -* field-list: mandatory. Comma-delimited or space-delimited list of fields to keep or remove. Supports wildcard patterns. +## Syntax + +The `fields` command has the following syntax: + +```syntax +fields [+|-] +``` + +## Parameters + +The `fields` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comma-delimited or space-delimited list of fields to keep or remove. Supports wildcard patterns. | +| `[+|-]` | Optional | If the plus sign (`+`) is used, only the fields specified in the `field-list` are included. If the minus sign (`-`) is used, all fields specified in the `field-list` are excluded. Default is `+`. | -## Example 1: Select specified fields from result -This example shows selecting account_number, firstname and lastname fields from search results. +## Example 1: Select specified fields from the search result + +The following query shows how to retrieve the `account_number`, `firstname`, and `lastname` fields from the search results: ```ppl source=accounts | fields account_number, firstname, lastname ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -32,9 +44,10 @@ fetched rows / total rows = 4/4 +----------------+-----------+----------+ ``` -## Example 2: Remove specified fields from result -This example shows removing the account_number field from search results. +## Example 2: Remove specified fields from the search results + +The following query shows how to remove the `account_number` field from the search results: ```ppl source=accounts @@ -42,7 +55,7 @@ source=accounts | fields - account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -56,17 +69,17 @@ fetched rows / total rows = 4/4 +-----------+----------+ ``` + ## Example 3: Space-delimited field selection -Fields can be specified using spaces instead of commas, providing a more concise syntax. -**Syntax**: `fields field1 field2 field3` +Fields can be specified using spaces instead of commas, providing a more concise syntax: ```ppl source=accounts | fields firstname lastname age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -80,16 +93,17 @@ fetched rows / total rows = 4/4 +-----------+----------+-----+ ``` + ## Example 4: Prefix wildcard pattern -Select fields starting with a pattern using prefix wildcards. +The following query selects fields starting with a pattern using prefix wildcards: ```ppl source=accounts | fields account* ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -103,16 +117,17 @@ fetched rows / total rows = 4/4 +----------------+ ``` + ## Example 5: Suffix wildcard pattern -Select fields ending with a pattern using suffix wildcards. +The following query selects fields ending with a pattern using suffix wildcards: ```ppl source=accounts | fields *name ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -126,9 +141,10 @@ fetched rows / total rows = 4/4 +-----------+----------+ ``` -## Example 6: Contains wildcard pattern -Select fields containing a pattern using contains wildcards. +## Example 6: Wildcard pattern matching + +The following query selects fields containing a pattern using `contains` wildcards: ```ppl source=accounts @@ -136,7 +152,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -147,16 +163,17 @@ fetched rows / total rows = 1/1 +----------------+-----------+-----------------+---------+-------+-----+----------------------+----------+ ``` + ## Example 7: Mixed delimiter syntax -Combine spaces and commas for flexible field specification. +The following query combines spaces and commas for flexible field specification: ```ppl source=accounts | fields firstname, account* *name ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -170,16 +187,17 @@ fetched rows / total rows = 4/4 +-----------+----------------+----------+ ``` + ## Example 8: Field deduplication -Automatically prevents duplicate columns when wildcards expand to already specified fields. +The following query automatically prevents duplicate columns when wildcards expand to already specified fields: ```ppl source=accounts | fields firstname, *name ``` -Expected output: +The query returns the following results. Even though `firstname` is explicitly specified and also matches `*name`, it appears only once because of automatic deduplication: ```text fetched rows / total rows = 4/4 @@ -192,11 +210,10 @@ fetched rows / total rows = 4/4 | Dale | Adams | +-----------+----------+ ``` - -Note: Even though `firstname` is explicitly specified and would also match `*name`, it appears only once due to automatic deduplication. + ## Example 9: Full wildcard selection -Select all available fields using `*` or `` `*` ``. This selects all fields defined in the index schema, including fields that may contain null values. +The following query selects all available fields using `*` or `` `*` ``. This expression selects all fields defined in the index schema, including fields that may contain null values. The `*` wildcard selects fields based on the index schema, not on the data content, so fields with null values are included in the result set. Use backticks (`` `*` ``) if the plain `*` does not return all expected fields: ```ppl source=accounts @@ -204,7 +221,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -214,18 +231,17 @@ fetched rows / total rows = 1/1 | 1 | Amber | 880 Holmes Lane | 39225 | M | Brogan | Pyrami | IL | 32 | amberduke@pyrami.com | Duke | +----------------+-----------+-----------------+---------+--------+--------+----------+-------+-----+----------------------+----------+ ``` - -Note: The `*` wildcard selects fields based on the index schema, not on data content. Fields with null values are included in the result set. Use backticks `` `*` ` if the plain `*`` doesn't return all expected fields. + ## Example 10: Wildcard exclusion -Remove fields using wildcard patterns with the minus (-) operator. +The following query removes fields using wildcard patterns containing the minus (`-`) operator: ```ppl source=accounts | fields - *name ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -239,6 +255,7 @@ fetched rows / total rows = 4/4 +----------------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+ ``` -## See Also -- [table](table.md) - Alias command with identical functionality \ No newline at end of file +## Related documentation + +- [`table`](table.md) -- An alias command with identical functionality \ No newline at end of file diff --git a/docs/user/ppl/cmd/fillnull.md b/docs/user/ppl/cmd/fillnull.md index 40ed91e865..82b8bc72eb 100644 --- a/docs/user/ppl/cmd/fillnull.md +++ b/docs/user/ppl/cmd/fillnull.md @@ -1,24 +1,39 @@ -# fillnull -## Description +# fillnull -The `fillnull` command fills null values with the provided value in one or more fields in the search result. -## Syntax +The `fillnull` command replaces `null` values in one or more fields of the search results with a specified value. -fillnull with \ [in \] -fillnull using \ = \ [, \ = \] -fillnull value=\ [\] -* replacement: mandatory. The value used to replace null values. -* field-list: optional. List of fields to apply the replacement to. It can be comma-delimited (with `with` or `using` syntax) or space-delimited (with `value=` syntax). **Default:** all fields. -* field: mandatory when using `using` syntax. Individual field name to assign a specific replacement value. -* **Syntax variations** - * `with in ` - Apply same value to specified fields - * `using =, ...` - Apply different values to different fields - * `value= []` - Alternative syntax with optional space-delimited field list - -## Example 1: Replace null values with a specified value on one field +> **Note**: The `fillnull` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/). It is only executed on the coordinating node. + +## Syntax + +The `fillnull` command has the following syntax: + +```syntax +fillnull with [in ] +fillnull using = [, = ] +fillnull value= [] +``` + +The following syntax variations are available: + +* `with in ` -- Apply the same value to specified fields. +* `using =, ...` -- Apply different values to different fields. +* `value= []` -- Alternative syntax with an optional space-delimited field list. + +## Parameters + +The `fillnull` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The value that replaces null values. | +| `` | Required (with `using` syntax) | The name of the field to which a specific replacement value is applied. | +| `` | Optional | A list of fields in which null values are replaced. You can specify the list as comma-delimited (using `with` or `using` syntax) or space-delimited (using `value=` syntax). By default, all fields are processed. | + +## Example 1: Replace null values in a single field with a specified value -This example shows replacing null values in the email field with '\'. +The following query replaces null values in the `email` field with `\`: ```ppl source=accounts @@ -26,7 +41,7 @@ source=accounts | fillnull with '' in email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -40,9 +55,10 @@ fetched rows / total rows = 4/4 +-----------------------+----------+ ``` -## Example 2: Replace null values with a specified value on multiple fields -This example shows replacing null values in both email and employer fields with the same replacement value '\'. +## Example 2: Replace null values in multiple fields with a specified value + +The following query replaces null values in both the `email` and `employer` fields with `\`: ```ppl source=accounts @@ -50,7 +66,7 @@ source=accounts | fillnull with '' in email, employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -64,9 +80,10 @@ fetched rows / total rows = 4/4 +-----------------------+-------------+ ``` -## Example 3: Replace null values with a specified value on all fields -This example shows replacing null values in all fields when no field list is specified. +## Example 3: Replace null values in all fields with a specified value + +The following query replaces null values in all fields when no `field-list` is specified: ```ppl source=accounts @@ -74,7 +91,7 @@ source=accounts | fillnull with '' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -88,9 +105,10 @@ fetched rows / total rows = 4/4 +-----------------------+-------------+ ``` -## Example 4: Replace null values with multiple specified values on multiple fields -This example shows using different replacement values for different fields using the 'using' syntax. +## Example 4: Replace null values in multiple fields with different specified values + +The following query shows how to use the `fillnull` command with different replacement values for multiple fields using the `using` syntax: ```ppl source=accounts @@ -98,7 +116,7 @@ source=accounts | fillnull using email = '', employer = '' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -112,9 +130,10 @@ fetched rows / total rows = 4/4 +-----------------------+---------------+ ``` -## Example 5: Replace null with specified value on specific fields (value= syntax) -This example shows using the alternative 'value=' syntax to replace null values in specific fields. +## Example 5: Replace null values in specific fields using the value= syntax + +The following query shows how to use the `fillnull` command with the `value=` syntax to replace null values in specific fields: ```ppl source=accounts @@ -122,7 +141,7 @@ source=accounts | fillnull value="" email employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -136,9 +155,10 @@ fetched rows / total rows = 4/4 +-----------------------+-------------+ ``` -## Example 6: Replace null with specified value on all fields (value= syntax) -When no field list is specified, the replacement applies to all fields in the result. +## Example 6: Replace null values in all fields using the value= syntax + +When no `field-list` is specified, the replacement applies to all fields in the result: ```ppl source=accounts @@ -146,7 +166,7 @@ source=accounts | fillnull value='' ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -160,17 +180,17 @@ fetched rows / total rows = 4/4 +-----------------------+-------------+ ``` -## Limitations -* The `fillnull` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. -* When applying the same value to all fields without specifying field names, all fields must be the same type. For mixed types, use separate fillnull commands or explicitly specify fields. -* The replacement value type must match ALL field types in the field list. When applying the same value to multiple fields, all fields must be the same type (all strings or all numeric). - - **Example:** - -```sql ignore - # This FAILS - same value for mixed-type fields - source=accounts | fillnull value=0 firstname, age - # ERROR: fillnull failed: replacement value type INTEGER is not compatible with field 'firstname' (type: VARCHAR). The replacement value type must match the field type. -``` +## Limitations + +The `fillnull` command has the following limitations: + +* When applying the same value to all fields without specifying field names, all fields must be of the same type. For mixed types, use separate `fillnull` commands or explicitly specify fields. +* The replacement value type must match all field types in the field list. When applying the same value to multiple fields, all fields must be of the same type (all strings or all numeric). The following query shows the error that occurs when this rule is violated: + + ```sql + # This FAILS - same value for mixed-type fields + source=accounts | fillnull value=0 firstname, age + # ERROR: fillnull failed: replacement value type INTEGER is not compatible with field 'firstname' (type: VARCHAR). The replacement value type must match the field type. + ``` \ No newline at end of file diff --git a/docs/user/ppl/cmd/flatten.md b/docs/user/ppl/cmd/flatten.md index ba4f9077dc..b5fe6172fe 100644 --- a/docs/user/ppl/cmd/flatten.md +++ b/docs/user/ppl/cmd/flatten.md @@ -1,25 +1,37 @@ -# flatten -## Description +# flatten -The `flatten` command flattens a struct or an object field into separate fields in a document. -The flattened fields will be ordered **lexicographically** by their original key names in the struct. For example, if the struct has keys `b`, `c` and `Z`, the flattened fields will be ordered as `Z`, `b`, `c`. -Note that `flatten` should not be applied to arrays. Use the `expand` command to expand an array field into multiple rows instead. However, since an array can be stored in a non-array field in OpenSearch, when flattening a field storing a nested array, only the first element of the array will be flattened. -## Syntax +The `flatten` command converts a struct or object field into individual fields within a document. -flatten \ [as (\)] -* field: mandatory. The field to be flattened. Only object and nested fields are supported. -* alias-list: optional. The names to use instead of the original key names. Names are separated by commas. It is advised to put the alias-list in parentheses if there is more than one alias. The length must match the number of keys in the struct field. The provided alias names **must** follow the lexicographical order of the corresponding original keys in the struct. +The resulting flattened fields are ordered lexicographically by their original key names. For example, if a struct contains the keys `b`, `c`, and `Z`, the flattened fields are ordered as `Z`, `b`, `c`. + +> **Important**: `flatten` should not be applied to arrays. To expand an array field into multiple rows, use the `expand` command. Note that arrays can be stored in non-array fields in OpenSearch; when flattening a field that contains a nested array, only the first element of the array is flattened. + +## Syntax + +The `flatten` command has the following syntax: + +```syntax +flatten [as ()] +``` + +## Parameters + +The `flatten` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The field to be flattened. Only object and nested fields are supported. | +| `` | Optional | A list of names to use instead of the original key names, separated by commas. If specifying more than one alias, enclose the list in parentheses. The number of aliases must match the number of keys in the struct, and the aliases must follow the lexicographical order of the corresponding original keys. | -## Example: flatten an object field with aliases -This example shows flattening a message object field and using aliases to rename the flattened fields. -Given the following index `my-index` +## Example: Flatten an object field using aliases + +Given the following index `my-index`: -```text +```json {"message":{"info":"a","author":"e","dayOfWeek":1},"myNum":1} {"message":{"info":"b","author":"f","dayOfWeek":2},"myNum":2} - ``` with the following mapping: @@ -56,19 +68,16 @@ with the following mapping: } } } - - ``` - -The following query flattens the `message` field and renames the keys to -`creator, dow, info`: + +The following query flattens a `message` object field and uses aliases to rename the flattened fields to `creator, dow, info`: ```ppl source=my-index | flatten message as (creator, dow, info) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -80,14 +89,9 @@ fetched rows / total rows = 2/2 +-----------------------------------------+--------+---------+-----+------+ ``` -## Limitations -* `flatten` command may not work as expected when its flattened fields are - - invisible. - For example in query - `source=my-index | fields message | flatten message`, the - `flatten message` command doesn't work since some flattened fields such as - `message.info` and `message.author` after command `fields message` are - invisible. - As an alternative, you can change to `source=my-index | flatten message`. \ No newline at end of file +## Limitations + +The `flatten` command has the following limitations: + +* The `flatten` command may not function as expected if the fields to be flattened are not visible. For example, in the query `source=my-index | fields message | flatten message`, the `flatten message` command fails to execute as expected because some flattened fields, such as `message.info` and `message.author`, are hidden after the `fields message` command. As an alternative, use `source=my-index | flatten message`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/grok.md b/docs/user/ppl/cmd/grok.md index c2636b5358..fa585e8bd1 100644 --- a/docs/user/ppl/cmd/grok.md +++ b/docs/user/ppl/cmd/grok.md @@ -1,17 +1,29 @@ -# grok -## Description +# grok -The `grok` command parses a text field with a grok pattern and appends the results to the search result. -## Syntax +The `grok` command parses a text field using a Grok pattern and appends the extracted results to the search results. -grok \ \ -* field: mandatory. The field must be a text field. -* pattern: mandatory. The grok pattern used to extract new fields from the given text field. If a new field name already exists, it will replace the original field. +## Syntax + +The `grok` command has the following syntax: + +```syntax +grok +``` + +## Parameters + +The `grok` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The text field to parse. | +| `` | Required | The Grok pattern used to extract new fields from the specified text field. If a new field name already exists, it overwrites the original field. | -## Example 1: Create the new field -This example shows how to create new field `host` for each document. `host` will be the host name after `@` in `email` field. Parsing a null field will return an empty string. +## Example 1: Create a new field + +The following query shows how to use the `grok` command to create a new field, `host`, for each document. The `host` field captures the hostname following `@` in the `email` field. Parsing a null field returns an empty string: ```ppl source=accounts @@ -19,7 +31,7 @@ source=accounts | fields email, host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -33,9 +45,10 @@ fetched rows / total rows = 4/4 +-----------------------+------------+ ``` -## Example 2: Override the existing field -This example shows how to override the existing `address` field with street number removed. +## Example 2: Override an existing field + +The following query shows how to use the `grok` command to override the existing `address` field, removing the street number: ```ppl source=accounts @@ -43,7 +56,7 @@ source=accounts | fields address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -57,9 +70,10 @@ fetched rows / total rows = 4/4 +------------------+ ``` + ## Example 3: Using grok to parse logs -This example shows how to use grok to parse raw logs. +The following query parses raw logs: ```ppl source=apache @@ -67,7 +81,7 @@ source=apache | fields COMMONAPACHELOG, timestamp, response, bytes ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -81,6 +95,9 @@ fetched rows / total rows = 4/4 +-----------------------------------------------------------------------------------------------------------------------------+----------------------------+----------+-------+ ``` -## Limitations -The grok command has the same limitations as the parse command, see [parse limitations](./parse.md#Limitations) for details. \ No newline at end of file +## Limitations + +The `grok` command has the following limitations: + +* The `grok` command has the same [limitations](./parse.md#limitations) as the `parse` command. \ No newline at end of file diff --git a/docs/user/ppl/cmd/head.md b/docs/user/ppl/cmd/head.md index 5565c90d78..3392e7854b 100644 --- a/docs/user/ppl/cmd/head.md +++ b/docs/user/ppl/cmd/head.md @@ -1,17 +1,31 @@ -# head -## Description +# head -The `head` command returns the first N number of specified results after an optional offset in search order. -## Syntax +The `head` command returns the first N lines from a search result. -head [\] [from \] -* size: optional integer. Number of results to return. **Default:** 10 -* offset: optional integer after `from`. Number of results to skip. **Default:** 0 +> **Note**: The `head` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/index/). It is only executed on the coordinating node. + +## Syntax + +The `head` command has the following syntax: + +```syntax +head [] [from ] +``` + +## Parameters + +The `head` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | The number of results to return. Must be an integer. Default is `10`. | +| `` | Optional | The number of results to skip (used with the `from` keyword). Must be an integer. Default is `0`. | -## Example 1: Get first 10 results -This example shows getting a maximum of 10 results from accounts index. +## Example 1: Retrieve the first set of results using the default size + +The following query returns the default number of search results (10): ```ppl source=accounts @@ -19,7 +33,7 @@ source=accounts | head ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -33,9 +47,10 @@ fetched rows / total rows = 4/4 +-----------+-----+ ``` -## Example 2: Get first N results -This example shows getting the first 3 results from accounts index. +## Example 2: Retrieve a specified number of results + +The following query returns the first 3 search results: ```ppl source=accounts @@ -43,7 +58,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -56,9 +71,10 @@ fetched rows / total rows = 3/3 +-----------+-----+ ``` -## Example 3: Get first N results after offset M -This example shows getting the first 3 results after offset 1 from accounts index. +## Example 3: Retrieve the first N results after an offset M + +The following query demonstrates how to retrieve the first 3 results starting with the second result from the `accounts` index: ```ppl source=accounts @@ -66,7 +82,7 @@ source=accounts | head 3 from 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -79,6 +95,4 @@ fetched rows / total rows = 3/3 +-----------+-----+ ``` -## Limitations -The `head` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. \ No newline at end of file diff --git a/docs/user/ppl/cmd/index.md b/docs/user/ppl/cmd/index.md new file mode 100644 index 0000000000..6f6fa30db8 --- /dev/null +++ b/docs/user/ppl/cmd/index.md @@ -0,0 +1,4 @@ + +# Commands + +PPL supports most common [SQL functions](https://docs.opensearch.org/latest/search-plugins/sql/functions/), including [relevance search](https://docs.opensearch.org/latest/search-plugins/sql/full-text/), but also introduces several more functions, called _commands_, which are available in PPL only. diff --git a/docs/user/ppl/cmd/join.md b/docs/user/ppl/cmd/join.md index 39d3f5a24d..983c304560 100644 --- a/docs/user/ppl/cmd/join.md +++ b/docs/user/ppl/cmd/join.md @@ -1,66 +1,23 @@ -# join -## Description +# join -The `join` command combines two datasets together. The left side could be an index or results from a piped commands, the right side could be either an index or a subsearch. -## Syntax +The `join` command combines two datasets. The left side can be an index or the results of piped commands, while the right side can be either an index or a subsearch. +## Syntax -### Basic syntax: +The `join` command supports basic and extended syntax options. -[joinType] join [leftAlias] [rightAlias] (on \| where) \ \ -* joinType: optional. The type of join to perform. Options: `left`, `semi`, `anti`, and performance-sensitive types `right`, `full`, `cross`. **Default:** `inner`. -* leftAlias: optional. The subsearch alias to use with the left join side, to avoid ambiguous naming. Pattern: `left = ` -* rightAlias: optional. The subsearch alias to use with the right join side, to avoid ambiguous naming. Pattern: `right = ` -* joinCriteria: mandatory. Any comparison expression. Must follow `on` or `where` keyword. -* right-dataset: mandatory. Right dataset could be either an `index` or a `subsearch` with/without alias. +### Basic syntax -### Extended syntax: - -join [type=] [overwrite=] [max=n] (\ \| [leftAlias] [rightAlias] (on \| where) \) \ -* type: optional. Join type using extended syntax. Options: `left`, `outer` (alias of `left`), `semi`, `anti`, and performance-sensitive types `right`, `full`, `cross`. **Default:** `inner`. -* overwrite: optional boolean. Only works with `join-field-list`. Specifies whether duplicate-named fields from right-dataset should replace corresponding fields in the main search results. **Default:** `true`. -* max: optional integer. Controls how many subsearch results could be joined against each row in main search. **Default:** 0 (unlimited). -* join-field-list: optional. The fields used to build the join criteria. The join field list must exist on both sides. If not specified, all fields common to both sides will be used as join keys. -* leftAlias: optional. Same as basic syntax when used with extended syntax. -* rightAlias: optional. Same as basic syntax when used with extended syntax. -* joinCriteria: mandatory. Same as basic syntax when used with extended syntax. -* right-dataset: mandatory. Same as basic syntax. - -## Configuration +```syntax +[joinType] join [left = ] [right = ] (on | where) +``` -### plugins.ppl.join.subsearch_maxout +> **Note**: When using aliases, `left` must appear before `right`. -The size configures the maximum of rows from subsearch to join against. The default value is: `50000`. A value of `0` indicates that the restriction is unlimited. -Change the join.subsearch_maxout to 5000 - -```bash ignore -curl -sS -H 'Content-Type: application/json' \ --X PUT localhost:9200/_plugins/_query/settings \ --d '{"persistent" : {"plugins.ppl.join.subsearch_maxout" : "5000"}}' -``` - -```json -{ - "acknowledged": true, - "persistent": { - "plugins": { - "ppl": { - "join": { - "subsearch_maxout": "5000" - } - } - } - }, - "transient": {} -} -``` - -## Usage +The following are examples of the basic `join` command syntax: -Basic join syntax: - -``` +```syntax source = table1 | inner join left = l right = r on l.a = r.a table2 | fields l.a, r.a, b, c source = table1 | inner join left = l right = r where l.a = r.a table2 | fields l.a, r.a, b, c source = table1 | left join left = l right = r on l.a = r.a table2 | fields l.a, r.a, b, c @@ -76,10 +33,28 @@ source = table1 as t1 | join left = l right = r on l.a = r.a table2 as t2 | fiel source = table1 as t1 | join left = l right = r on l.a = r.a table2 as t2 | fields t1.a, t2.a source = table1 | join left = l right = r on l.a = r.a [ source = table2 ] as s | fields l.a, s.a ``` - -Extended syntax with options: - + +#### Basic syntax parameters + +The basic `join` syntax supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comparison expression specifying how to join the datasets. Must be placed after the `on` or `where` keyword in the query. | +| `` | Required | The right dataset, which can be an index or a subsearch, with or without an alias. | +| `joinType` | Optional | The type of join to perform. Valid values are `left`, `semi`, `anti`, and performance-sensitive types (`right`, `full`, and `cross`). Default is `inner`. | +| `left` | Optional | An alias for the left dataset (typically a subsearch) used to avoid ambiguous field names. Specify as `left = `. | +| `right` | Optional | An alias for the right dataset (typically, a subsearch) used to avoid ambiguous field names. Specify as `right = `. | + +### Extended syntax + +```syntax +join [type=] [overwrite=] [max=n] ( | [left = ] [right = ] (on | where) ) ``` + +The following are examples of the extended `join` command syntax: + +```syntax source = table1 | join type=outer left = l right = r on l.a = r.a table2 | fields l.a, r.a, b, c source = table1 | join type=left left = l right = r where l.a = r.a table2 | fields l.a, r.a, b, c source = table1 | join type=inner max=1 left = l right = r where l.a = r.a table2 | fields l.a, r.a, b, c @@ -87,12 +62,43 @@ source = table1 | join a table2 | fields a, b, c source = table1 | join a, b table2 | fields a, b, c source = table1 | join type=outer a b table2 | fields a, b, c source = table1 | join type=inner max=1 a, b table2 | fields a, b, c -source = table1 | join type=left overwrite=false max=0 a, b [source=table2 | rename d as b] | fields a, b, c +source = table1 | join type=left overwrite=false max=0 a, b [source=table2 | rename d as b] | fields a, b, c ``` + +#### Extended syntax parameters + +The extended `join` syntax supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comparison expression specifying how to join the datasets. Must be placed after the `on` or `where` keyword in the query. | +| `` | Required | The right dataset, which can be an index or a subsearch, with or without an alias. | +| `type` | Optional | The join type when using extended syntax. Valid values are `left`, `outer` (same as `left`), `semi`, `anti`, and performance-sensitive types (`right`, `full`, and `cross`). Default is `inner`. | +| `` | Optional | A list of fields used to build the join criteria. These fields must exist in both datasets. If not specified, all fields common to both datasets are used as join keys. | +| `overwrite` | Optional | Applicable only when `join-field-list` is specified. Specifies whether fields from the right dataset with duplicate names should replace corresponding fields in the main search results. Default is `true`. | +| `max` | Optional | The maximum number of subsearch results to join with each row in the main search. Default is `0` (unlimited). | +| `left` | Optional | An alias for the left dataset (typically a subsearch) used to avoid ambiguous field names. Specify as `left = `. | +| `right` | Optional | An alias for the right dataset (typically, a subsearch) used to avoid ambiguous field names. Specify as `right = `. | -## Example 1: Two indices join -This example shows joining two indices using the basic join syntax. +## Configuration + +The `join` command behavior is configured using the `plugins.ppl.join.subsearch_maxout` setting, which specifies the maximum number of rows from the subsearch to join against. Default is `50000`. A value of `0` indicates that the restriction is unlimited. + +To update the setting, send the following request: + +```bash ignore +PUT /_plugins/_query/settings +{ + "persistent": { + "plugins.ppl.join.subsearch_maxout": "5000" + } +} +``` + +## Example 1: Join two indexes + +The following query uses the basic `join` syntax to join two indexes: ```ppl source = state_country @@ -100,7 +106,7 @@ source = state_country | stats avg(salary) by span(age, 10) as age_span, b.country ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 5/5 @@ -115,9 +121,10 @@ fetched rows / total rows = 5/5 +-------------+----------+-----------+ ``` -## Example 2: Join with subsearch -This example shows joining with a subsearch using the basic join syntax. +## Example 2: Join with a subsearch + +The following query combines a dataset with a subsearch using the basic `join` syntax: ```ppl source = state_country as a @@ -130,7 +137,7 @@ source = state_country as a | stats avg(salary) by span(age, 10) as age_span, b.country ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -143,9 +150,10 @@ fetched rows / total rows = 3/3 +-------------+----------+-----------+ ``` -## Example 3: Join with field list -This example shows joining using the extended syntax with field list. +## Example 3: Join using a field list + +The following query uses the extended syntax and specifies a list of fields for the join criteria: ```ppl source = state_country @@ -158,7 +166,7 @@ source = state_country | stats avg(salary) by span(age, 10) as age_span, country ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -171,9 +179,10 @@ fetched rows / total rows = 3/3 +-------------+----------+---------+ ``` -## Example 4: Join with options -This example shows joining using the extended syntax with additional options. +## Example 4: Join with additional options + +The following query uses the extended syntax and optional parameters for more control over the join operation: ```ppl source = state_country @@ -181,7 +190,7 @@ source = state_country | stats avg(salary) by span(age, 10) as age_span, country ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -195,20 +204,22 @@ fetched rows / total rows = 4/4 +-------------+----------+---------+ ``` -## Limitations -For basic syntax, if fields in the left outputs and right outputs have the same name. Typically, in the join criteria -`ON t1.id = t2.id`, the names `id` in output are ambiguous. To avoid ambiguous, the ambiguous -fields in output rename to `.id`, or else `.id` if no alias existing. +## Limitations -Assume table1 and table2 only contain field `id`, following PPL queries and their outputs are: - -| Query | Output | -| --- | --- | -| source=table1 \| join left=t1 right=t2 on t1.id=t2.id table2 \| eval a = 1 | t1.id, t2.id, a | -| source=table1 \| join on table1.id=table2.id table2 \| eval a = 1 | table1.id, table2.id, a | -| source=table1 \| join on table1.id=t2.id table2 as t2 \| eval a = 1 | table1.id, t2.id, a | -| source=table1 \| join right=tt on table1.id=t2.id [ source=table2 as t2 \| eval b = id ] \| eval a = 1 | table1.id, tt.id, tt.b, a | - -For extended syntax (join with field list), when duplicate-named fields in output results are deduplicated, the fields in output determined by the value of 'overwrite' option. -Join types `inner`, `left`, `outer` (alias of `left`), `semi` and `anti` are supported by default. `right`, `full`, `cross` are performance-sensitive join types which are disabled by default. Set config `plugins.calcite.all_join_types.allowed = true` to enable. \ No newline at end of file +The `join` command has the following limitations: + +* **Field name ambiguity in basic syntax** – When fields from the left and right datasets share the same name, the field names in the output are ambiguous. To resolve this, conflicting fields are renamed to `.id` (or `.id` if no alias is specified). + + The following table demonstrates how field name conflicts are resolved when both `table1` and `table2` contain a field named `id`. + + | Query | Output | + | --- | --- | + | `source=table1 \| join left=t1 right=t2 on t1.id=t2.id table2 \| eval a = 1` | `t1.id, t2.id, a` | + | `source=table1 \| join on table1.id=table2.id table2 \| eval a = 1` | `table1.id, table2.id, a` | + | `source=table1 \| join on table1.id=t2.id table2 as t2 \| eval a = 1` | `table1.id, t2.id, a` | + | `source=table1 \| join right=tt on table1.id=t2.id [ source=table2 as t2 \| eval b = id ] \| eval a = 1` | `table1.id, tt.id, tt.b, a` | + +* **Field deduplication in extended syntax** – When using the extended syntax with a field list, duplicate field names in the output are deduplicated according to the `overwrite` option. + +* **Join type availability** – The join types `inner`, `left`, `outer` (alias of `left`), `semi`, and `anti` are enabled by default. The performance-sensitive join types `right`, `full`, and `cross` are disabled by default. To enable these types, set `plugins.calcite.all_join_types.allowed` to `true`. diff --git a/docs/user/ppl/cmd/kmeans.md b/docs/user/ppl/cmd/kmeans.md index 247902804d..e2064c9c2b 100644 --- a/docs/user/ppl/cmd/kmeans.md +++ b/docs/user/ppl/cmd/kmeans.md @@ -1,18 +1,34 @@ -# kmeans (deprecated by ml command) -## Description +# kmeans (Deprecated) -The `kmeans` command applies the kmeans algorithm in the ml-commons plugin on the search result returned by a PPL command. -## Syntax +> **Warning**: The `kmeans` command is deprecated in favor of the [`ml` command](ml.md). -kmeans \ \ \ -* centroids: optional. The number of clusters you want to group your data points into. **Default:** 2. -* iterations: optional. Number of iterations. **Default:** 10. -* distance_type: optional. The distance type can be COSINE, L1, or EUCLIDEAN. **Default:** EUCLIDEAN. +The `kmeans` command applies the k-means algorithm in the ML Commons plugin on the search results returned by a PPL command. + +> **Note**: To use the `kmeans` command, `plugins.calcite.enabled` must be set to `false`. + +## Syntax + +The `kmeans` command has the following syntax: + +```syntax +kmeans +``` + +## Parameters + +The `kmeans` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | The number of clusters to group data points into. Default is `2`. | +| `` | Optional | The number of iterations. Default is `10`. | +| `` | Optional | The distance type. Valid values are `COSINE`, `L1`, and `EUCLIDEAN`. Default is `EUCLIDEAN`. | -## Example: Clustering of Iris Dataset -This example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals. +## Example: Clustering of the Iris dataset + +The following query classifies three Iris species (Iris setosa, Iris virginica, and Iris versicolor) based on the combination of four features measured from each sample (the lengths and widths of sepals and petals): ```ppl source=iris_data @@ -20,7 +36,7 @@ source=iris_data | kmeans centroids=3 ``` -Expected output: +The query returns the following results: ```text +--------------------+-------------------+--------------------+-------------------+-----------+ @@ -32,6 +48,4 @@ Expected output: +--------------------+-------------------+--------------------+-------------------+-----------+ ``` -## Limitations -The `kmeans` command can only work with `plugins.calcite.enabled=false`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/lookup.md b/docs/user/ppl/cmd/lookup.md index 03683cdc47..745a39a28b 100644 --- a/docs/user/ppl/cmd/lookup.md +++ b/docs/user/ppl/cmd/lookup.md @@ -1,23 +1,19 @@ -# lookup -## Description +# lookup -The `lookup` command enriches your search data by adding or replacing data from a lookup index (dimension table). You can extend fields of an index with values from a dimension table, append or replace values when lookup condition is matched. As an alternative of join command, lookup command is more suitable for enriching the source data with a static dataset. -## Syntax +The `lookup` command enriches search data by adding or replacing values from a lookup index (dimension table). It allows you to extend fields in your index with values from a dimension table, appending or replacing values when the lookup condition matches. Compared with the `join` command, `lookup` is better suited for enriching source data with a static dataset. -lookup \ (\ [as \])... [(replace \| append) (\ [as \])...] -* lookupIndex: mandatory. The name of lookup index (dimension table). -* lookupMappingField: mandatory. A mapping key in `lookupIndex`, analogy to a join key from right table. You can specify multiple `lookupMappingField` with comma-delimited. -* sourceMappingField: optional. A mapping key from source (left side), analogy to a join key from left side. If not specified, defaults to `lookupMappingField`. -* inputField: optional. A field in `lookupIndex` where matched values are applied to result output. You can specify multiple `inputField` with comma-delimited. If not specified, all fields except `lookupMappingField` from `lookupIndex` are applied to result output. -* outputField: optional. A field of output. You can specify zero or multiple `outputField`. If `outputField` has an existing field name in source query, its values will be replaced or appended by matched values from `inputField`. If the field specified in `outputField` is a new field, in replace strategy, an extended new field will be applied to the results, but fail in append strategy. -* replace \| append: optional. The output strategies. If replace, matched values in `lookupIndex` field overwrite the values in result. If append, matched values in `lookupIndex` field only append to the missing values in result. **Default:** replace. - -## Usage +## Syntax -Lookup - +The `lookup` command has the following syntax: + +```syntax +lookup ( [as ])... [(replace | append) ( [as ])...] ``` + +The following are examples of the `lookup` command syntax: + +```syntax source = table1 | lookup table2 id source = table1 | lookup table2 id, name source = table1 | lookup table2 id as cid, name @@ -26,314 +22,103 @@ source = table1 | lookup table2 id as cid, name replace dept as department, city source = table1 | lookup table2 id as cid, name append dept as department source = table1 | lookup table2 id as cid, name append dept as department, city as location ``` - -## Example 1: Replace strategy -This example shows using the lookup command with the REPLACE strategy to overwrite existing values. +## Parameters + +The `lookup` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The name of the lookup index (dimension table). | +| `` | Required | A key in the lookup index used for matching, similar to a join key in the right table. Specify multiple fields as a comma-separated list. | +| `` | Optional | A key from the source data (left side) used for matching, similar to a join key in the left table. Default is `lookupMappingField`. | +| `` | Optional | A field in the lookup index whose matched values are applied to the results (output). Specify multiple fields as a comma-separated list. If not specified, all fields except `lookupMappingField` from the lookup index are applied to the results. | +| `` | Optional | The name of the field in the results (output) in which matched values are placed. Specify multiple fields as a comma-separated list. If the `outputField` specifies an existing field in the source query, its values are replaced or appended with matched values from the `inputField`. If the field specified in the `outputField` is not an existing field, a new field is added to the results when using `replace`, or the operation fails when using `append`. | +| `(replace | append)` | Optional | Specifies how matched values are applied to the output. `replace` overwrites existing values with matched values from the lookup index. `append` fills only missing values in the results with matched values from the lookup index. Default is `replace`. | + +## Example 1: Replace existing values + +The following query uses the `lookup` command with the `replace` strategy to overwrite existing values: -```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ - source = worker +```ppl ignore +source = worker | LOOKUP work_information uid AS id REPLACE department | fields id, name, occupation, country, salary, department - """ -}' ``` -Result set - -```json -{ - "schema": [ - { - "name": "id", - "type": "integer" - }, - { - "name": "name", - "type": "string" - }, - { - "name": "occupation", - "type": "string" - }, - { - "name": "country", - "type": "string" - }, - { - "name": "salary", - "type": "integer" - }, - { - "name": "department", - "type": "string" - } - ], - "datarows": [ - [ - 1000, - "Jake", - "Engineer", - "England", - 100000, - "IT" - ], - [ - 1001, - "Hello", - "Artist", - "USA", - 70000, - null - ], - [ - 1002, - "John", - "Doctor", - "Canada", - 120000, - "DATA" - ], - [ - 1003, - "David", - "Doctor", - null, - 120000, - "HR" - ], - [ - 1004, - "David", - null, - "Canada", - 0, - null - ], - [ - 1005, - "Jane", - "Scientist", - "Canada", - 90000, - "DATA" - ] - ], - "total": 6, - "size": 6 -} +The query returns the following results: + +```text ++------+-------+------------+---------+--------+------------+ +| id | name | occupation | country | salary | department | +|------+-------+------------+---------+--------+------------| +| 1000 | Jake | Engineer | England | 100000 | IT | +| 1001 | Hello | Artist | USA | 70000 | null | +| 1002 | John | Doctor | Canada | 120000 | DATA | +| 1003 | David | Doctor | null | 120000 | HR | +| 1004 | David | null | Canada | 0 | null | +| 1005 | Jane | Scientist | Canada | 90000 | DATA | ++------+-------+------------+---------+--------+------------+ ``` -## Example 2: Append strategy -This example shows using the lookup command with the APPEND strategy to fill missing values only. +## Example 2: Append missing values + +The following query uses the `lookup` command with the `append` strategy to append missing values only: -```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ - source = worker +```ppl ignore +source = worker | LOOKUP work_information uid AS id APPEND department | fields id, name, occupation, country, salary, department - """ -}' ``` -## Example 3: No inputField specified -This example shows using the lookup command without specifying inputField, which applies all fields from the lookup index. +## Example 3: No input field specified + +The following query uses the `lookup` command without specifying an `inputField`, which adds all fields from the lookup index to the results: -```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ +```ppl ignore source = worker | LOOKUP work_information uid AS id, name | fields id, name, occupation, country, salary, department - """ -}' ``` -Result set - -```json -{ - "schema": [ - { - "name": "id", - "type": "integer" - }, - { - "name": "name", - "type": "string" - }, - { - "name": "country", - "type": "string" - }, - { - "name": "salary", - "type": "integer" - }, - { - "name": "department", - "type": "string" - }, - { - "name": "occupation", - "type": "string" - } - ], - "datarows": [ - [ - 1000, - "Jake", - "England", - 100000, - "IT", - "Engineer" - ], - [ - 1001, - "Hello", - "USA", - 70000, - null, - null - ], - [ - 1002, - "John", - "Canada", - 120000, - "DATA", - "Scientist" - ], - [ - 1003, - "David", - null, - 120000, - "HR", - "Doctor" - ], - [ - 1004, - "David", - "Canada", - 0, - null, - null - ], - [ - 1005, - "Jane", - "Canada", - 90000, - "DATA", - "Engineer" - ] - ], - "total": 6, - "size": 6 -} +The query returns the following results: + +```text ++------+-------+---------+--------+------------+------------+ +| id | name | country | salary | department | occupation | +|------+-------+---------+--------+------------+------------| +| 1000 | Jake | England | 100000 | IT | Engineer | +| 1001 | Hello | USA | 70000 | null | null | +| 1002 | John | Canada | 120000 | DATA | Scientist | +| 1003 | David | null | 120000 | HR | Doctor | +| 1004 | David | Canada | 0 | null | null | +| 1005 | Jane | Canada | 90000 | DATA | Engineer | ++------+-------+---------+--------+------------+------------+ ``` -## Example 4: OutputField as a new field +## Example 4: Add matched values to a new field -This example shows using the lookup command with outputField as a new field name. +The following query places matched values into a new field specified by `outputField`: -```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ +```ppl ignore source = worker | LOOKUP work_information name REPLACE occupation AS new_col | fields id, name, occupation, country, salary, new_col - """ -}' ``` -Result set - -```json -{ - "schema": [ - { - "name": "id", - "type": "integer" - }, - { - "name": "name", - "type": "string" - }, - { - "name": "occupation", - "type": "string" - }, - { - "name": "country", - "type": "string" - }, - { - "name": "salary", - "type": "integer" - }, - { - "name": "new_col", - "type": "string" - } - ], - "datarows": [ - [ - 1003, - "David", - "Doctor", - null, - 120000, - "Doctor" - ], - [ - 1004, - "David", - null, - "Canada", - 0, - "Doctor" - ], - [ - 1001, - "Hello", - "Artist", - "USA", - 70000, - null - ], - [ - 1000, - "Jake", - "Engineer", - "England", - 100000, - "Engineer" - ], - [ - 1005, - "Jane", - "Scientist", - "Canada", - 90000, - "Engineer" - ], - [ - 1002, - "John", - "Doctor", - "Canada", - 120000, - "Scientist" - ] - ], - "total": 6, - "size": 6 -} +The query returns the following results: + +```text ++------+-------+------------+---------+--------+-----------+ +| id | name | occupation | country | salary | new_col | +|------+-------+------------+---------+--------+-----------| +| 1003 | David | Doctor | null | 120000 | Doctor | +| 1004 | David | null | Canada | 0 | Doctor | +| 1001 | Hello | Artist | USA | 70000 | null | +| 1000 | Jake | Engineer | England | 100000 | Engineer | +| 1005 | Jane | Scientist | Canada | 90000 | Engineer | +| 1002 | John | Doctor | Canada | 120000 | Scientist | ++------+-------+------------+---------+--------+-----------+ ``` \ No newline at end of file diff --git a/docs/user/ppl/cmd/ml.md b/docs/user/ppl/cmd/ml.md index 38098954bf..9b142c1cc9 100644 --- a/docs/user/ppl/cmd/ml.md +++ b/docs/user/ppl/cmd/ml.md @@ -1,44 +1,89 @@ -# ml -## Description +# ml -Use the `ml` command to train/predict/train and predict on any algorithm in the ml-commons plugin on the search result returned by a PPL command. -## Syntax +The `ml` command applies machine learning (ML) algorithms from the ML Commons plugin to the search results returned by a PPL command. It supports various ML operations, including anomaly detection and clustering. The command can perform train, predict, or combined train-and-predict operations, depending on the algorithm and specified action. -## AD - Fixed In Time RCF For Time-series Data: +> **Note**: To use the `ml` command, `plugins.calcite.enabled` must be set to `false`. -ml action='train' algorithm='rcf' \ \ \ \ \ \ \ \ \ -* number_of_trees: optional integer. Number of trees in the forest. **Default:** 30. -* shingle_size: optional integer. A shingle is a consecutive sequence of the most recent records. **Default:** 8. -* sample_size: optional integer. The sample size used by stream samplers in this forest. **Default:** 256. -* output_after: optional integer. The number of points required by stream samplers before results are returned. **Default:** 32. -* time_decay: optional double. The decay factor used by stream samplers in this forest. **Default:** 0.0001. -* anomaly_rate: optional double. The anomaly rate. **Default:** 0.005. -* time_field: mandatory string. It specifies the time field for RCF to use as time-series data. -* date_format: optional string. It's used for formatting time_field field. **Default:** "yyyy-MM-dd HH:mm:ss". -* time_zone: optional string. It's used for setting time zone for time_field field. **Default:** UTC. -* category_field: optional string. It specifies the category field used to group inputs. Each category will be independently predicted. - -## AD - Batch RCF for Non-time-series Data: +The `ml` command supports the following algorithms: + +- **Random Cut Forest (RCF)** for anomaly detection, with support for both time-series and non-time-series data + +- **K-means** for clustering data points into groups + +## Syntax -ml action='train' algorithm='rcf' \ \ \ \ \ -* number_of_trees: optional integer. Number of trees in the forest. **Default:** 30. -* sample_size: optional integer. Number of random samples given to each tree from the training data set. **Default:** 256. -* output_after: optional integer. The number of points required by stream samplers before results are returned. **Default:** 32. -* training_data_size: optional integer. **Default:** size of your training data set. -* anomaly_score_threshold: optional double. The threshold of anomaly score. **Default:** 1.0. -* category_field: optional string. It specifies the category field used to group inputs. Each category will be independently predicted. +The `ml` command supports different syntax options, depending on the algorithm. + +### Anomaly detection for time-series data + +Use this syntax to detect anomalies in time-series data. This method uses the RCF algorithm optimized for sequential data patterns: + +```syntax +ml action='train' algorithm='rcf' +``` + +### Parameters + +The fixed-in-time RCF algorithm supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. | +| `shingle_size` | Optional | The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is `8`. | +| `sample_size` | Optional | The sample size used by the stream samplers in this forest. Default is `256`. | +| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. | +| `time_decay` | Optional | The decay factor used by the stream samplers in this forest. Default is `0.0001`. | +| `anomaly_rate` | Optional | The anomaly rate. Default is `0.005`. | +| `time_field` | Required | The time field for RCF to use as time-series data. | +| `date_format` | Optional | The format for the `time_field`. Default is `yyyy-MM-dd HH:mm:ss`. | +| `time_zone` | Optional | The time zone for the `time_field`. Default is `UTC`. | +| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. | + +### Anomaly detection for non-time-series data + +Use this syntax to detect anomalies in data where the order doesn't matter. This method uses the RCF algorithm optimized for independent data points: + +```syntax +ml action='train' algorithm='rcf' +``` + +### Parameters + +The batch RCF algorithm supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. | +| `sample_size` | Optional | The number of random samples provided to each tree from the training dataset. Default is `256`. | +| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. | +| `training_data_size` | Optional | The size of the training dataset. Default is the full dataset size. | +| `anomaly_score_threshold` | Optional | The anomaly score threshold. Default is `1.0`. | +| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. | -## KMEANS: -ml action='train' algorithm='kmeans' \ \ \ -* centroids: optional integer. The number of clusters you want to group your data points into. **Default:** 2. -* iterations: optional integer. Number of iterations. **Default:** 10. -* distance_type: optional string. The distance type can be COSINE, L1, or EUCLIDEAN. **Default:** EUCLIDEAN. +### K-means clustering + +Use this syntax to group data points into clusters based on similarity: + +```syntax +ml action='train' algorithm='kmeans' +``` + +### Parameters + +The k-means clustering algorithm supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `centroids` | Optional | The number of clusters to group data points into. Default is `2`. | +| `iterations` | Optional | The number of iterations. Default is `10`. | +| `distance_type` | Optional | The distance type. Valid values are `COSINE`, `L1`, and `EUCLIDEAN`. Default is `EUCLIDEAN`. | -## Example 1: Detecting events in New York City from taxi ridership data with time-series data -This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data. +## Example 1: Time-series anomaly detection + +This example trains an RCF model and uses it to detect anomalies in time-series ridership data: ```ppl source=nyc_taxi @@ -47,7 +92,7 @@ source=nyc_taxi | where value=10844.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -58,9 +103,10 @@ fetched rows / total rows = 1/1 +---------+---------------------+-------+---------------+ ``` -## Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category -This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values. +## Example 2: Time-series anomaly detection by category + +This example trains an RCF model and uses it to detect anomalies in time-series ridership data across multiple category values: ```ppl source=nyc_taxi @@ -69,7 +115,7 @@ source=nyc_taxi | where value=10844.0 or value=6526.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -81,9 +127,10 @@ fetched rows / total rows = 2/2 +----------+---------+---------------------+-------+---------------+ ``` -## Example 3: Detecting events in New York City from taxi ridership data with non-time-series data -This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data. +## Example 3: Non-time-series anomaly detection + +This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data: ```ppl source=nyc_taxi @@ -92,7 +139,7 @@ source=nyc_taxi | where value=10844.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -103,9 +150,10 @@ fetched rows / total rows = 1/1 +---------+-------+-----------+ ``` -## Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category -This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values. +## Example 4: Non-time-series anomaly detection by category + +This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data across multiple category values: ```ppl source=nyc_taxi @@ -114,7 +162,7 @@ source=nyc_taxi | where value=10844.0 or value=6526.0 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -126,9 +174,10 @@ fetched rows / total rows = 2/2 +----------+---------+-------+-----------+ ``` -## Example 5: KMEANS - Clustering of Iris Dataset -This example shows how to use KMEANS to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals. +## Example 5: K-means clustering of the Iris dataset + +This example uses k-means clustering to classify three Iris species (Iris setosa, Iris virginica, and Iris versicolor) based on the combination of four features measured from each sample (the lengths and widths of sepals and petals): ```ppl source=iris_data @@ -136,7 +185,7 @@ source=iris_data | ml action='train' algorithm='kmeans' centroids=3 ``` -Expected output: +The query returns the following results: ```text +--------------------+-------------------+--------------------+-------------------+-----------+ @@ -148,6 +197,4 @@ Expected output: +--------------------+-------------------+--------------------+-------------------+-----------+ ``` -## Limitations -The `ml` command can only work with `plugins.calcite.enabled=false`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/multisearch.md b/docs/user/ppl/cmd/multisearch.md index 0b6e8ae208..b4bee18694 100644 --- a/docs/user/ppl/cmd/multisearch.md +++ b/docs/user/ppl/cmd/multisearch.md @@ -1,41 +1,49 @@ -# multisearch -## Description +# multisearch -Use the `multisearch` command to run multiple search subsearches and merge their results together. The command allows you to combine data from different queries on the same or different sources, and optionally apply subsequent processing to the combined result set. -Key aspects of `multisearch`: -1. Combines results from multiple search operations into a single result set. -2. Each subsearch can have different filtering criteria, data transformations, and field selections. -3. Results are merged and can be further processed with aggregations, sorting, and other PPL commands. -4. Particularly useful for comparative analysis, union operations, and creating comprehensive datasets from multiple search criteria. -5. Supports timestamp-based result interleaving when working with time-series data. - -Use Cases: -* **Comparative Analysis**: Compare metrics across different segments, regions, or time periods -* **Success Rate Monitoring**: Calculate success rates by comparing successful vs. total operations -* **Multi-source Data Combination**: Merge data from different indices or apply different filters to the same source -* **A/B Testing Analysis**: Combine results from different test groups for comparison -* **Time-series Data Merging**: Interleave events from multiple sources based on timestamps - -## Syntax -multisearch \ \ \ ... -* subsearch1, subsearch2, ...: mandatory. At least two subsearches required. Each subsearch must be enclosed in square brackets and start with the `search` keyword. Format: `[search source=index | commands...]`. All PPL commands are supported within subsearches. -* result-processing: optional. Commands applied to the merged results after the multisearch operation, such as `stats`, `sort`, `head`, etc. - -## Usage +The `multisearch` command runs multiple subsearches and merges their results. It allows you to combine data from different queries on the same or different sources. You can optionally apply subsequent processing, such as aggregation or sorting, to the combined results. Each subsearch can have different filtering criteria, data transformations, and field selections. + +Multisearch is particularly useful for comparative analysis, union operations, and creating comprehensive datasets from multiple search criteria. The command supports timestamp-based result interleaving when working with time-series data. + +Use multisearch for: -Basic multisearch +* **Comparative analysis**: Compare metrics across different segments, regions, or time periods. +* **Success rate monitoring**: Calculate success rates by comparing successful to total operations. +* **Multi-source data combination**: Merge data from different indexes or apply different filters to the same source. +* **A/B testing analysis**: Combine results from different test groups for comparison. +* **Time-series data merging**: Interleave events from multiple sources based on timestamps. + + +## Syntax + +The `multisearch` command has the following syntax: + +```syntax +multisearch [ ...] ``` + +The following are examples of the `multisearch` command syntax: + +```syntax | multisearch [search source=table | where condition1] [search source=table | where condition2] | multisearch [search source=index1 | fields field1, field2] [search source=index2 | fields field1, field2] | multisearch [search source=table | where status="success"] [search source=table | where status="error"] ``` - -## Example 1: Basic Age Group Analysis -This example combines young and adult customers into a single result set for further analysis. +## Parameters + +The `multisearch` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | At least two subsearches are required. Each subsearch must be enclosed in square brackets and start with the `search` keyword (`[search source=index | ]`). All PPL commands are supported within subsearches. | +| `` | Optional | Commands applied to the merged results after the multisearch operation (for example, `stats`, `sort`, or `head`). | + +## Example 1: Combining age groups for demographic analysis + +This example demonstrates how to merge customers from different age segments into a unified dataset. It combines `young` and `adult` customers into a single result set and adds categorization labels for further analysis: ```ppl | multisearch [search source=accounts @@ -48,7 +56,7 @@ This example combines young and adult customers into a single result set for fur | sort age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -62,9 +70,10 @@ fetched rows / total rows = 4/4 +-----------+-----+-----------+ ``` -## Example 2: Success Rate Pattern -This example combines high-balance and all valid accounts for comparison analysis. +## Example 2: Segmenting accounts by balance tier + +This example demonstrates how to create account segments based on balance thresholds for comparative analysis. It separates `high_balance` accounts from `regular` accounts and labels them for easy comparison: ```ppl | multisearch [search source=accounts @@ -77,7 +86,7 @@ This example combines high-balance and all valid accounts for comparison analysi | sort balance desc ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -91,9 +100,10 @@ fetched rows / total rows = 4/4 +-----------+---------+--------------+ ``` -## Example 3: Timestamp Interleaving -This example combines time-series data from multiple sources with automatic timestamp-based ordering. +## Example 3: Merging time-series data from multiple sources + +This example demonstrates how to combine time-series data from different sources while maintaining chronological order. The results are automatically sorted by timestamp to create a unified timeline: ```ppl | multisearch [search source=time_data @@ -103,7 +113,7 @@ This example combines time-series data from multiple sources with automatic time | head 5 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 5/5 @@ -118,9 +128,10 @@ fetched rows / total rows = 5/5 +---------------------+----------+-------+---------------------+ ``` -## Example 4: Type Compatibility - Missing Fields -This example demonstrates how missing fields are handled with NULL insertion. +## Example 4: Handling missing fields across subsearches + +This example demonstrates how `multisearch` handles schema differences when subsearches return different fields. When one subsearch includes a field that others don't have, missing values are automatically filled with null values: ```ppl | multisearch [search source=accounts @@ -132,7 +143,7 @@ This example demonstrates how missing fields are handled with NULL insertion. | sort age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -146,7 +157,10 @@ fetched rows / total rows = 4/4 +-----------+-----+------------+ ``` -## Limitations -* **Minimum Subsearches**: At least two subsearches must be specified -* **Schema Compatibility**: When fields with the same name exist across subsearches but have incompatible types, the system automatically resolves conflicts by renaming the conflicting fields. The first occurrence retains the original name, while subsequent conflicting fields are renamed with a numeric suffix (e.g., `age` becomes `age0`, `age1`, etc.). This ensures all data is preserved while maintaining schema consistency. \ No newline at end of file +## Limitations + +The `multisearch` command has the following limitations: + +* At least two subsearches must be specified. +* When fields with the same name exist across subsearches but have incompatible types, the system automatically resolves conflicts by renaming the conflicting fields. The first occurrence retains the original name, while subsequent conflicting fields are renamed using a numeric suffix (for example, `age` becomes `age0`, `age1`, and so on). This ensures that all data is preserved while maintaining schema consistency. \ No newline at end of file diff --git a/docs/user/ppl/cmd/parse.md b/docs/user/ppl/cmd/parse.md index 8e151ad888..553addd7d7 100644 --- a/docs/user/ppl/cmd/parse.md +++ b/docs/user/ppl/cmd/parse.md @@ -1,20 +1,36 @@ -# parse -## Description +# parse -The `parse` command parses a text field with a regular expression and appends the result to the search result. -## Syntax +The `parse` command extracts information from a text field using a regular expression and adds the extracted information to the search results. It uses Java regex patterns. For more information, see the [Java regular expression documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). -parse \ \ -* field: mandatory. The field must be a text field. -* pattern: mandatory. The regular expression pattern used to extract new fields from the given text field. If a new field name already exists, it will replace the original field. - -## Regular Expression +## The rex and parse commands compared + +The `rex` and `parse` commands both extract information from text fields using Java regular expressions with named capture groups. To compare the capabilities of the `rex` and `parse` commands, see the [`rex` command documentation](rex.md). + +## Syntax + +The `parse` command has the following syntax: + +```syntax +parse +``` + +## Parameters + +The `parse` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The text field to parse. | +| `` | Required | The regular expression pattern used to extract new fields from the specified text field. If a field with the same name already exists, its values are replaced. | + +## Regular expression + +The regular expression pattern is used to match the whole text field of each document based on the [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). Each named capture group in the expression becomes a new `STRING` field. -The regular expression pattern is used to match the whole text field of each document with Java regex engine. Each named capture group in the expression will become a new `STRING` field. ## Example 1: Create a new field -This example shows how to create a new field `host` for each document. `host` will be the host name after `@` in `email` field. Parsing a null field will return an empty string. +The following query extracts the hostname from email addresses. The regex pattern `.+@(?.+)` captures all characters after the `@` symbol and creates a new `host` field. When parsing a null field, the result is an empty string: ```ppl source=accounts @@ -22,7 +38,7 @@ source=accounts | fields email, host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -36,9 +52,10 @@ fetched rows / total rows = 4/4 +-----------------------+------------+ ``` + ## Example 2: Override an existing field -This example shows how to override the existing `address` field with street number removed. +The following query replaces the `address` field with only the street name, removing the street number. The regex pattern `\d+ (?
.+)` matches digits followed by a space, then captures the remaining text as the new `address` value: ```ppl source=accounts @@ -46,7 +63,7 @@ source=accounts | fields address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -60,9 +77,10 @@ fetched rows / total rows = 4/4 +------------------+ ``` -## Example 3: Filter and sort by casted parsed field -This example shows how to sort street numbers that are higher than 500 in `address` field. +## Example 3: Parse, filter, and sort address components + +The following query extracts street numbers and names from addresses, then filters for street numbers greater than 500 and sorts them numerically. The regex pattern `(?\d+) (?.+)` captures the numeric part as `streetNumber` and the remaining text as `street`: ```ppl source=accounts @@ -72,7 +90,7 @@ source=accounts | fields streetNumber, street ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -85,49 +103,39 @@ fetched rows / total rows = 3/3 +--------------+----------------+ ``` -## Limitations -There are a few limitations with parse command: -- Fields defined by parse cannot be parsed again. - -The following command will not work - -``` -source=accounts | parse address '\d+ (?.+)' | parse street '\w+ (?\w+)' ; -``` - -- Fields defined by parse cannot be overridden with other commands. - -`where` will not match any documents since `street` cannot be overridden - -``` -source=accounts | parse address '\d+ (?.+)' | eval street='1' | where street='1' ; -``` - -- The text field used by parse cannot be overridden. - -`street` will not be successfully parsed since `address` is overridden - -``` -source=accounts | parse address '\d+ (?.+)' | eval address='1' ; -``` - -- Fields defined by parse cannot be filtered/sorted after using them in `stats` command. - -`where` in the following command will not work - -``` -source=accounts | parse email '.+@(?.+)' | stats avg(age) by host | where host=pyrami.com ; -``` - -- Fields defined by parse will not appear in the final result unless the original source field is included in the `fields` command. - -For example, the following query will not display the parsed fields `host` unless the source field `email` is also explicitly included - -``` -source=accounts | parse email '.+@(?.+)' | fields email, host ; -``` - -- Named capture group must start with a letter and contain only letters and digits. - - For detailed Java regex pattern syntax and usage, refer to the [official Java Pattern documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) \ No newline at end of file +## Limitations + +The `parse` command has the following limitations: + +- Fields created by the `parse` command cannot be parsed again. For example, the following command does not function as intended: + + ```sql + source=accounts | parse address '\d+ (?.+)' | parse street '\w+ (?\w+)' + ``` + +- Fields created by the `parse` command cannot be overridden by other commands. For example, in the following query, the `where` clause does not match any documents because `street` cannot be overridden: + + ```sql + source=accounts | parse address '\d+ (?.+)' | eval street='1' | where street='1' + ``` + +- The source text field used by the `parse` command cannot be overridden. For example, in the following query, the `street` field is not parsed correctly because `address` is overridden: + + ```sql + source=accounts | parse address '\d+ (?.+)' | eval address='1' + ``` + +- Fields created by the `parse` command cannot be filtered or sorted after they are used in the `stats` command. For example, in the following query, the `where` clause does not function as intended: + + ```sql + source=accounts | parse email '.+@(?.+)' | stats avg(age) by host | where host=pyrami.com + ``` + +- Fields created by the `parse` command do not appear in the final results unless the original source field is included in the `fields` command. For example, the following query does not return the parsed field `host` unless the source field `email` is explicitly included: + + ```sql + source=accounts | parse email '.+@(?.+)' | fields email, host + ``` + + diff --git a/docs/user/ppl/cmd/patterns.md b/docs/user/ppl/cmd/patterns.md index 7b9cb71889..6941efbe4f 100644 --- a/docs/user/ppl/cmd/patterns.md +++ b/docs/user/ppl/cmd/patterns.md @@ -1,50 +1,96 @@ -# patterns -## Description +# patterns -The `patterns` command extracts log patterns from a text field and appends the results to the search result. Grouping logs by their patterns makes it easier to aggregate stats from large volumes of log data for analysis and troubleshooting. -`patterns` command allows users to select different log parsing algorithms to get high log pattern grouping accuracy. Two pattern methods are supported: `simple_pattern` and `brain`. -`simple_pattern` algorithm is basically a regex parsing method vs `brain` algorithm is an automatic log grouping algorithm with high grouping accuracy and keeps semantic meaning. -`patterns` command supports two modes: `label` and `aggregation`. `label` mode returns individual pattern labels. `aggregation` mode returns aggregated results on target field. -Calcite engine by default labels the variables with '\<*\>' placeholder. If `show_numbered_token` option is turned on, Calcite engine's `label` mode not only labels pattern of text but also labels variable tokens in map. In `aggregation` mode, it will also output labeled pattern as well as variable tokens per pattern. The variable placeholder is in the format of '' instead of '<\*>'. +The `patterns` command extracts log patterns from a text field and appends the results to the search results. Grouping logs by pattern simplifies aggregating statistics from large volumes of log data for analysis and troubleshooting. You can choose from the following log parsing methods to achieve high pattern-grouping accuracy: -## Syntax +* `simple_pattern`: A parsing method that uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). +* `brain`: An automatic log-grouping method that provides high grouping accuracy while preserving semantic meaning. -patterns \ [by byClause...] [method=simple_pattern \| brain] [mode=label \| aggregation] [max_sample_count=integer] [buffer_limit=integer] [show_numbered_token=boolean] [new_field=\] (algorithm parameters...) -* field: mandatory. The text field to analyze for patterns. -* byClause: optional. Fields or scalar functions used to group logs for labeling/aggregation. -* method: optional. Algorithm choice: `simple_pattern` or `brain`. **Default:** `simple_pattern`. -* mode: optional. Output mode: `label` or `aggregation`. **Default:** `label`. -* max_sample_count: optional. Max sample logs returned per pattern in aggregation mode. **Default:** 10. -* buffer_limit: optional. Safeguard parameter for `brain` algorithm to limit internal temporary buffer size (min: 50,000). **Default:** 100,000. -* show_numbered_token: optional. The flag to turn on numbered token output format. **Default:** false. -* new_field: optional. Alias of the output pattern field. **Default:** "patterns_field". -* algorithm parameters: optional. Algorithm-specific tuning: - * `simple_pattern`: Define regex via "pattern". - * `brain`: Adjust sensitivity with variable_count_threshold and frequency_threshold_percentage. - * `variable_count_threshold`: optional integer. Words are split by space. Algorithm counts how many distinct words are at specific position in initial log groups. Adjusting this threshold can determine the sensitivity of constant words. **Default:** 5. - * `frequency_threshold_percentage`: optional double. Brain's log pattern is selected based on longest word combination. This sets the lower bound of frequency to ignore low frequency words. **Default:** 0.3. - -## Change the default pattern method +The `patterns` command supports the following modes: -To override default pattern parameters, users can run following command - +* `label`: Returns individual pattern labels. +* `aggregation`: Returns aggregated results for the target field. + +The command identifies variable parts of log messages (such as timestamps, numbers, IP addresses, and unique identifiers) and replaces them with `<*>` placeholders to create reusable patterns. For example, email addresses like `amberduke@pyrami.com` and `hattiebond@netagy.com` are replaced with the pattern `<*>@<*>.<*>`. + +> **Note**: The `patterns` command is not executed on OpenSearch data nodes. It only groups log patterns from log messages that have been returned to the coordinator node. + +## Syntax + +The `patterns` command supports the following syntax options. + +### Simple pattern method syntax + +The `patterns` command with a `simple_pattern` method has the following syntax: + +```syntax +patterns [by ] [method=simple_pattern] [mode=label | aggregation] [max_sample_count=integer] [show_numbered_token=boolean] [new_field=] [pattern=] ``` - PUT _cluster/settings - { - "persistent": { - "plugins.ppl.pattern.method": "brain", - "plugins.ppl.pattern.mode": "aggregation", - "plugins.ppl.pattern.max.sample.count": 5, - "plugins.ppl.pattern.buffer.limit": 50000, - "plugins.ppl.pattern.show.numbered.token": true - } + +### Brain method syntax + +The `patterns` command with a `brain` method has the following syntax: + +```syntax +patterns [by ] [method=brain] [mode=label | aggregation] [max_sample_count=integer] [buffer_limit=integer] [show_numbered_token=boolean] [new_field=] [variable_count_threshold=integer] [frequency_threshold_percentage=decimal] +``` + +## Parameters + +The `patterns` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The text field that is analyzed to extract log patterns. | +| `` | Optional | The fields or scalar functions used to group logs before labeling or aggregation. | +| `method` | Optional | The pattern extraction method to use. Valid values are `simple_pattern` and `brain`. Default is `simple_pattern`. | +| `mode` | Optional | The output mode of the command. Valid values are `label` and `aggregation`. Default is `label`. | +| `max_sample_count` | Optional | The maximum number of sample log entries returned per pattern in `aggregation` mode. Default is `10`. | +| `buffer_limit` | Optional | A safeguard setting for the `brain` method that limits the size of its internal temporary buffer. Minimum is `50000`. Default is `100000`. | +| `show_numbered_token` | Optional | Enables numbered token placeholders in the output instead of the default wildcard token. See [Placeholder behavior](#placeholder-behavior). Default is `false`. | +| `` | Optional | An alias for the output field that contains the extracted pattern. Default is `patterns_field`. | + +The `simple_pattern` method accepts the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | A custom [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) pattern that identifies characters or sequences to replace with `<*>` placeholders. When not specified, the method uses a default pattern that automatically removes alphanumeric characters and replaces variable parts with `<*>` placeholders while preserving structural elements. | + +The `brain` method accepts the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `variable_count_threshold` | Optional | Controls the algorithm sensitivity to detecting constant words by counting distinct words at specific positions in the initial log groups. Default is `5`. | +| `frequency_threshold_percentage` | Optional | Sets the minimum word frequency percentage threshold. Words with frequencies below this value are ignored. The `brain` algorithm selects log patterns based on the longest word combination. Default is `0.3`. | + +## Placeholder behavior + +By default, the Apache Calcite engine labels variables using the `<*>` placeholder. If the `show_numbered_token` option is enabled, the Calcite engine's `label` mode not only labels the text pattern but also assigns numbered placeholders to variable tokens. In `aggregation` mode, it outputs both the labeled pattern and the variable tokens for each pattern. In this case, variable placeholders use the format `` instead of `<*>`. + +## Changing the default pattern method + +To override default pattern parameters, run the following command: + +```bash ignore +PUT _cluster/settings +{ + "persistent": { + "plugins.ppl.pattern.method": "brain", + "plugins.ppl.pattern.mode": "aggregation", + "plugins.ppl.pattern.max.sample.count": 5, + "plugins.ppl.pattern.buffer.limit": 50000, + "plugins.ppl.pattern.show.numbered.token": true } +} ``` -## Simple Pattern Example 1: Create the new field +## Simple pattern examples + +The following are examples of using the `simple_pattern` method. + +### Example 1: Create a new field -This example shows how to extract patterns in `email` for each document. Parsing a null field will return an empty string. +The following query extracts patterns from the `email` field for each document. If the `email` field is `null`, the command returns an empty string: ```ppl source=accounts @@ -52,7 +98,7 @@ source=accounts | fields email, patterns_field ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -66,9 +112,10 @@ fetched rows / total rows = 4/4 +-----------------------+----------------+ ``` -## Simple Pattern Example 2: Extract log patterns -This example shows how to extract patterns from a raw log field using the default patterns. +### Example 2: Extract log patterns + +The following query extracts default patterns from a raw log field: ```ppl source=apache @@ -76,7 +123,7 @@ source=apache | fields message, patterns_field ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -90,9 +137,10 @@ fetched rows / total rows = 4/4 +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------+ ``` -## Simple Pattern Example 3: Extract log patterns with custom regex pattern -This example shows how to extract patterns from a raw log field using user defined patterns. +### Example 3: Extract log patterns using a custom regex pattern + +The following query extracts patterns from a raw log field using a custom pattern: ```ppl source=apache @@ -100,7 +148,7 @@ source=apache | fields message, no_numbers ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -114,9 +162,10 @@ fetched rows / total rows = 4/4 +-----------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` -## Simple Pattern Example 4: Return log patterns aggregation result -This example shows how to get aggregated results from a raw log field. +### Example 4: Return a log pattern aggregation result + +The following query aggregates patterns extracted from a raw log field: ```ppl source=apache @@ -124,7 +173,7 @@ source=apache | fields patterns_field, pattern_count, sample_logs ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -138,12 +187,11 @@ fetched rows / total rows = 4/4 +---------------------------------------------------------------------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------+ ``` -## Simple Pattern Example 5: Return log patterns aggregation result with detected variable tokens -This example shows how to get aggregated results with detected variable tokens. -## Configuration +### Example 5: Return aggregated log patterns with detected variable tokens + +The following query returns aggregated results with detected variable tokens. When the `show_numbered_token` option is enabled, the pattern output uses numbered placeholders (for example, ``, ``) and returns a mapping of each placeholder to the values that it represents: -With option `show_numbered_token` enabled, the output can detect numbered variable tokens from the pattern field. ```ppl source=apache @@ -152,7 +200,7 @@ source=apache | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -162,10 +210,15 @@ fetched rows / total rows = 1/1 | ... - - [//::: -] " / /." | 1 | {'': ['HTTP'], '': ['users'], '': ['1'], '': ['1'], '': ['9481'], '': ['301'], '': ['28'], '': ['104'], '': ['2022'], '': ['Sep'], '': ['15'], '': ['10'], '': ['57'], '': ['210'], '': ['POST'], '': ['15'], '': ['0700'], '': ['204']} | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` - -## Brain Example 1: Extract log patterns -This example shows how to extract semantic meaningful log patterns from a raw log field using the brain algorithm. The default variable count threshold is 5. + +## Brain pattern examples + +The following are examples of using the `brain` method. + +### Example 1: Extract log patterns + +The following query extracts semantically meaningful log patterns from a raw log field using the `brain` algorithm. This query uses the default `variable_count_threshold` value of `5`: ```ppl source=apache @@ -173,7 +226,7 @@ source=apache | fields message, patterns_field ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -187,9 +240,10 @@ fetched rows / total rows = 4/4 +-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+ ``` -## Brain Example 2: Extract log patterns with custom parameters -This example shows how to extract semantic meaningful log patterns from a raw log field using custom parameters of the brain algorithm. +### Example 2: Extract log patterns using custom parameters + +The following query extracts semantically meaningful log patterns from a raw log field using custom parameters of the `brain` algorithm: ```ppl source=apache @@ -197,7 +251,7 @@ source=apache | fields message, patterns_field ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -211,9 +265,10 @@ fetched rows / total rows = 4/4 +-----------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ ``` -## Brain Example 3: Return log patterns aggregation result -This example shows how to get aggregated results from a raw log field using the brain algorithm. +### Example 3: Return a log pattern aggregation result + +The following query aggregates patterns extracted from a raw log field using the `brain` algorithm: ```ppl source=apache @@ -221,7 +276,7 @@ source=apache | fields patterns_field, pattern_count, sample_logs ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -232,11 +287,10 @@ fetched rows / total rows = 1/1 +----------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` -## Brain Example 4: Return log patterns aggregation result with detected variable tokens -This example shows how to get aggregated results with detected variable tokens using the brain algorithm. +### Example 4: Return aggregated log patterns with detected variable tokens -With option `show_numbered_token` enabled, the output can detect numbered variable tokens from the pattern field. +The following query returns aggregated results with detected variable tokens using the `brain` method. When the `show_numbered_token` option is enabled, the pattern output uses numbered placeholders (for example, ``, ``) and returns a mapping of each placeholder to the values that it represents: ```ppl source=apache @@ -244,7 +298,7 @@ source=apache | fields patterns_field, pattern_count, tokens ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -255,6 +309,5 @@ fetched rows / total rows = 1/1 +----------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` -## Limitations -- Patterns command is not pushed down to OpenSearch data node for now. It will only group log patterns on log messages returned to coordinator node. \ No newline at end of file + \ No newline at end of file diff --git a/docs/user/ppl/cmd/rare.md b/docs/user/ppl/cmd/rare.md index 6ee51c9f96..c3f28eb1a7 100644 --- a/docs/user/ppl/cmd/rare.md +++ b/docs/user/ppl/cmd/rare.md @@ -1,31 +1,41 @@ -# rare -## Description +# rare -The `rare` command finds the least common tuple of values of all fields in the field list. -**Note**: A maximum of 10 results is returned for each distinct tuple of values of the group-by fields. -## Syntax +The `rare` command identifies the least common combination of values across all fields specified in the field list. -rare [rare-options] \ [by-clause] -* field-list: mandatory. Comma-delimited list of field names. -* by-clause: optional. One or more fields to group the results by. -* rare-options: optional. Options for the rare command. Supported syntax is [countfield=\] [showcount=\]. -* showcount=\: optional. Whether to create a field in output that represent a count of the tuple of values. **Default:** `true`. -* countfield=\: optional. The name of the field that contains count. **Default:** `'count'`. -* usenull=\: optional. whether to output the null value. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`: - * When `plugins.ppl.syntax.legacy.preferred=true`, `usenull` defaults to `true` - * When `plugins.ppl.syntax.legacy.preferred=false`, `usenull` defaults to `false` +> **Note**: The command returns up to 10 results for each distinct combination of values in the group-by fields. + +> **Note**: The `rare` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/). It is only executed on the coordinating node. + +## Syntax + +The `rare` command has the following syntax: + +```syntax +rare [rare-options] [by-clause] +``` + +## Parameters + +The `rare` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comma-delimited list of field names. | +| `` | Optional | One or more fields to group the results by. | +| `rare-options` | Optional | Additional options for controlling output:
- `showcount`: Whether to create a field in the output containing the frequency count for each combination of values. Default is `true`.
- `countfield`: The name of the field that contains the count. Default is `count`.
- `usenull`: Whether to output null values. Default is the value of `plugins.ppl.syntax.legacy.preferred`. | -## Example 1: Find the least common values in a field -This example shows how to find the least common gender of all the accounts. +## Example 1: Find the least common values without showing counts + +The following query uses the `rare` command with `showcount=false` to find the least common gender without displaying frequency counts: ```ppl source=accounts | rare showcount=false gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -37,16 +47,17 @@ fetched rows / total rows = 2/2 +--------+ ``` -## Example 2: Find the least common values organized by gender -This example shows how to find the least common age of all the accounts grouped by gender. +## Example 2: Find the least common values grouped by field + +The following query uses the `rare` command with a `by` clause to find the least common age values grouped by gender: ```ppl source=accounts | rare showcount=false age by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -60,16 +71,17 @@ fetched rows / total rows = 4/4 +--------+-----+ ``` -## Example 3: Rare command -This example shows how to find the least common gender of all the accounts. +## Example 3: Find the least common values with frequency counts + +The following query uses the `rare` command with default settings to find the least common gender values and display their frequency counts: ```ppl source=accounts | rare gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -81,16 +93,17 @@ fetched rows / total rows = 2/2 +--------+-------+ ``` -## Example 4: Specify the count field option -This example shows how to specify the count field. +## Example 4: Customize the count field name + +The following query uses the `rare` command with the `countfield` parameter to specify a custom name for the frequency count field: ```ppl source=accounts | rare countfield='cnt' gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -102,14 +115,17 @@ fetched rows / total rows = 2/2 +--------+-----+ ``` -## Example 5: Specify the usenull field option - + +## Example 5: Specify null value handling + +The following query uses the `rare` command with `usenull=false` to exclude null values from the results: + ```ppl source=accounts | rare usenull=false email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -121,13 +137,15 @@ fetched rows / total rows = 3/3 | hattiebond@netagy.com | 1 | +-----------------------+-------+ ``` - + +The following query uses `usenull=true` to include null values in the results: + ```ppl source=accounts | rare usenull=true email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -141,6 +159,4 @@ fetched rows / total rows = 4/4 +-----------------------+-------+ ``` -## Limitations -The `rare` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. \ No newline at end of file diff --git a/docs/user/ppl/cmd/regex.md b/docs/user/ppl/cmd/regex.md index d108b635ab..80d131032f 100644 --- a/docs/user/ppl/cmd/regex.md +++ b/docs/user/ppl/cmd/regex.md @@ -1,29 +1,41 @@ -# regex -## Description +# regex -The `regex` command filters search results by matching field values against a regular expression pattern. Only documents where the specified field matches the pattern are included in the results. -## Syntax +The `regex` command filters search results by matching field values against a regular expression pattern. Only documents in which the specified field matches the pattern are included in the results. -regex \ = \ -regex \ != \ -* field: mandatory. The field name to match against. -* pattern: mandatory string. The regular expression pattern to match. Supports Java regex syntax including named groups, lookahead/lookbehind, and character classes. -* = : operator for positive matching (include matches) -* != : operator for negative matching (exclude matches) - -## Regular Expression Engine +## Syntax + +The `regex` command has the following syntax: + +```syntax +regex = +regex != +``` + +The following operators are supported: + +* `=` -- Positive matching (include matches) +* `!=` -- Negative matching (exclude matches) + +The `regex` command uses Java's built-in regular expression engine, which supports: + +* **Standard regex features**: Character classes, quantifiers, anchors. +* **Named capture groups**: `(?pattern)` syntax. +* **Lookahead/lookbehind**: `(?=...)` and `(?<=...)` assertions. +* **Inline flags**: Case-insensitive `(?i)`, multiline `(?m)`, dotall `(?s)`, and other modes. + +## Parameters + +The `regex` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The field name to match against. | +| `` | Required | The regular expression pattern to match. Supports [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). | -The regex command uses Java's built-in regular expression engine, which supports: -* **Standard regex features**: Character classes, quantifiers, anchors -* **Named capture groups**: `(?pattern)` syntax -* **Lookahead/lookbehind**: `(?=...)` and `(?<=...)` assertions -* **Inline flags**: Case-insensitive `(?i)`, multiline `(?m)`, dotall `(?s)`, and other modes - -For complete documentation of Java regex patterns and available modes, see the [Java Pattern documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). ## Example 1: Basic pattern matching -This example shows how to filter documents where the `lastname` field matches names starting with uppercase letters. +The following query uses the `regex` command to return any document in which the `lastname` field starts with an uppercase letter: ```ppl source=accounts @@ -31,7 +43,7 @@ source=accounts | fields account_number, firstname, lastname ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -45,33 +57,34 @@ fetched rows / total rows = 4/4 +----------------+-----------+----------+ ``` -## Example 2: Negative matching -This example shows how to exclude documents where the `lastname` field ends with "son". - +## Example 2: Negative matching + +The following query excludes documents in which the `lastname` field ends with `ms`: + ```ppl source=accounts -| regex lastname!=".*son$" +| regex lastname!=".*ms$" | fields account_number, lastname ``` -Expected output: +The query returns the following results: ```text -fetched rows / total rows = 4/4 +fetched rows / total rows = 3/3 +----------------+----------+ | account_number | lastname | |----------------+----------| | 1 | Duke | | 6 | Bond | | 13 | Bates | -| 18 | Adams | +----------------+----------+ ``` + ## Example 3: Email domain matching -This example shows how to filter documents by email domain patterns. +The following query filters documents by email domain patterns: ```ppl source=accounts @@ -79,7 +92,7 @@ source=accounts | fields account_number, email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -90,15 +103,16 @@ fetched rows / total rows = 1/1 +----------------+----------------------+ ``` + ## Example 4: Complex patterns with character classes -This example shows how to use complex regex patterns with character classes and quantifiers. +The following query uses complex regex patterns with character classes and quantifiers: ```ppl source=accounts | regex address="\\d{3,4}\\s+[A-Z][a-z]+\\s+(Street|Lane|Court)" | fields account_number, address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -112,9 +126,10 @@ fetched rows / total rows = 4/4 +----------------+----------------------+ ``` + ## Example 5: Case-sensitive matching -This example demonstrates that regex matching is case-sensitive by default. +By default, regex matching is case sensitive. The following query searches for the lowercase state name `va`: ```ppl source=accounts @@ -122,7 +137,7 @@ source=accounts | fields account_number, state ``` -Expected output: +The query returns no results because the regex pattern `va` (lowercase) does not match any state values in the data. ```text fetched rows / total rows = 0/0 @@ -131,14 +146,16 @@ fetched rows / total rows = 0/0 |----------------+-------| +----------------+-------+ ``` - + +The following query searches for the uppercase state name `VA`: + ```ppl source=accounts | regex state="VA" | fields account_number, state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -149,7 +166,10 @@ fetched rows / total rows = 1/1 +----------------+-------+ ``` -## Limitations -* **Field specification required**: A field name must be specified in the regex command. Pattern-only syntax (e.g., `regex "pattern"`) is not currently supported -* **String fields only**: The regex command currently only supports string fields. Using it on numeric or boolean fields will result in an error \ No newline at end of file +## Limitations + +The `regex` command has the following limitations: + +* A field name must be specified in the `regex` command. Pattern-only syntax (for example, `regex "pattern"`) is not supported. +* The `regex` command only supports string fields. Using it on numeric or Boolean fields results in an error. \ No newline at end of file diff --git a/docs/user/ppl/cmd/rename.md b/docs/user/ppl/cmd/rename.md index 346513f232..0360a0677d 100644 --- a/docs/user/ppl/cmd/rename.md +++ b/docs/user/ppl/cmd/rename.md @@ -1,24 +1,36 @@ -# rename -## Description +# rename -The `rename` command renames one or more fields in the search result. -## Syntax +The `rename` command renames one or more fields in the search results. -rename \ AS \["," \ AS \]... -* source-field: mandatory. The name of the field you want to rename. Supports wildcard patterns using `*`. -* target-field: mandatory. The name you want to rename to. Must have same number of wildcards as the source. - -## Behavior +The `rename` command handles non-existent fields as follows: -The rename command handles non-existent fields as follows: -* **Renaming a non-existent field to a non-existent field**: No change occurs to the result set. -* **Renaming a non-existent field to an existing field**: The existing target field is removed from the result set. -* **Renaming an existing field to an existing field**: The existing target field is removed and the source field is renamed to the target. - -## Example 1: Rename one field +* **Renaming a non-existent field to a non-existent field**: No change occurs to the search results. +* **Renaming a non-existent field to an existing field**: The existing target field is removed from the search results. +* **Renaming an existing field to an existing field**: The existing target field is removed and the source field is renamed to the target. + +> **Note**: The `rename` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/). It is only executed on the coordinating node. + +## Syntax + +The `rename` command has the following syntax: + +```syntax +rename AS ["," AS ]... +``` + +## Parameters + +The `rename` command supports the following parameters. -This example shows how to rename one field. +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The name of the field you want to rename. Supports wildcard patterns using `*`. | +| `` | Required | The name you want to rename to. Must contain the same number of wildcards as the source. | + +## Example 1: Rename a field + +The following query renames one field: ```ppl source=accounts @@ -26,7 +38,7 @@ source=accounts | fields an ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -40,9 +52,10 @@ fetched rows / total rows = 4/4 +----+ ``` + ## Example 2: Rename multiple fields -This example shows how to rename multiple fields. +The following query renames multiple fields: ```ppl source=accounts @@ -50,7 +63,7 @@ source=accounts | fields an, emp ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -64,9 +77,10 @@ fetched rows / total rows = 4/4 +----+---------+ ``` -## Example 3: Rename with wildcards -This example shows how to rename multiple fields using wildcard patterns. +## Example 3: Rename fields using wildcards + +The following query renames multiple fields using wildcard patterns: ```ppl source=accounts @@ -74,7 +88,7 @@ source=accounts | fields first_name, last_name ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -88,9 +102,10 @@ fetched rows / total rows = 4/4 +------------+-----------+ ``` -## Example 4: Rename with multiple wildcard patterns -This example shows how to rename multiple fields using multiple wildcard patterns. +## Example 4: Rename fields using multiple wildcard patterns + +The following query renames multiple fields using multiple wildcard patterns: ```ppl source=accounts @@ -98,7 +113,7 @@ source=accounts | fields first_name, last_name, accountnumber ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -112,9 +127,10 @@ fetched rows / total rows = 4/4 +------------+-----------+---------------+ ``` -## Example 5: Rename existing field to existing field -This example shows how to rename an existing field to an existing field. The target field gets removed and the source field is renamed to the target field. +## Example 5: Rename an existing field to another existing field + +The following query renames an existing field to another existing field. The target field is removed and the source field is renamed to the target field: ```ppl source=accounts @@ -122,7 +138,7 @@ source=accounts | fields age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -136,7 +152,9 @@ fetched rows / total rows = 4/4 +---------+ ``` -## Limitations -The `rename` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. -Literal asterisk (*) characters in field names cannot be replaced as asterisk is used for wildcard matching. \ No newline at end of file +## Limitations + +The `rename` command has the following limitations: + +* Literal asterisk (`*`) characters in field names cannot be replaced because the asterisk is used for wildcard matching. \ No newline at end of file diff --git a/docs/user/ppl/cmd/replace.md b/docs/user/ppl/cmd/replace.md index 2333f46b3b..b403e072e8 100644 --- a/docs/user/ppl/cmd/replace.md +++ b/docs/user/ppl/cmd/replace.md @@ -1,18 +1,29 @@ -# replace -## Description +# replace -The `replace` replaces text in one or more fields in the search result. Supports literal string replacement and wildcard patterns using `*`. -## Syntax +The `replace` command replaces text in one or more fields in the search results. It supports literal string replacement and wildcard patterns using `*`. + +## Syntax + +The `replace` command has the following syntax: + +```syntax +replace '' WITH '' [, '' WITH '']... IN [, ]... +``` + +## Parameters + +The `replace` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The text pattern to be replaced. | +| `` | Required | The text to use as the replacement. | +| `` | Required | One or more fields to which the replacement should be applied. | -replace '\' WITH '\' [, '\' WITH '\']... IN \[, \]... -* pattern: mandatory. The text pattern you want to replace. -* replacement: mandatory. The text you want to replace with. -* field-name: mandatory. One or more field names where the replacement should occur. - ## Example 1: Replace text in one field -This example shows replacing text in one field. +The following query replaces text in one field: ```ppl source=accounts @@ -20,7 +31,7 @@ source=accounts | fields state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -34,9 +45,10 @@ fetched rows / total rows = 4/4 +----------+ ``` + ## Example 2: Replace text in multiple fields -This example shows replacing text in multiple fields. +The following query replaces text in multiple fields: ```ppl source=accounts @@ -44,7 +56,7 @@ source=accounts | fields state, address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -58,9 +70,10 @@ fetched rows / total rows = 4/4 +----------+----------------------+ ``` -## Example 3: Replace with other commands in a pipeline -This example shows using replace with other commands in a query pipeline. +## Example 3: Use the replace command in a pipeline + +The following query uses the `replace` command with other commands in a query pipeline: ```ppl source=accounts @@ -69,7 +82,7 @@ source=accounts | fields state, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -82,9 +95,10 @@ fetched rows / total rows = 3/3 +----------+-----+ ``` -## Example 4: Replace with multiple pattern/replacement pairs -This example shows using multiple pattern/replacement pairs in a single replace command. The replacements are applied sequentially. +## Example 4: Replace text using multiple pattern-replacement pairs + +The following query uses the `replace` command with multiple pattern and replacement pairs in a single replace command. The replacements are applied sequentially: ```ppl source=accounts @@ -92,7 +106,7 @@ source=accounts | fields state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -106,9 +120,10 @@ fetched rows / total rows = 4/4 +-----------+ ``` -## Example 5: Pattern matching with LIKE and replace -Since replace command only supports plain string literals, you can use LIKE command with replace for pattern matching needs. +## Example 5: Pattern matching using LIKE + +The following query uses the `LIKE` command with the `replace` command for pattern matching, since the `replace` command only supports plain string literals: ```ppl source=accounts @@ -117,7 +132,7 @@ source=accounts | fields address, state, gender, age, city ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -128,9 +143,10 @@ fetched rows / total rows = 1/1 +-----------------+-------+--------+-----+--------+ ``` -## Example 6: Wildcard suffix match -Replace values that end with a specific pattern. The wildcard `*` matches any prefix. +## Example 6: Wildcard suffix matching + +The following query shows wildcard suffix matching, in which `*` matches any characters before a specific ending pattern: ```ppl source=accounts @@ -138,7 +154,7 @@ source=accounts | fields state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -152,9 +168,10 @@ fetched rows / total rows = 4/4 +----------+ ``` -## Example 7: Wildcard prefix match -Replace values that start with a specific pattern. The wildcard `*` matches any suffix. +## Example 7: Wildcard prefix matching + +The following query shows wildcard prefix matching, in which `*` matches any characters after a specific starting pattern: ```ppl source=accounts @@ -162,7 +179,7 @@ source=accounts | fields state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -176,9 +193,10 @@ fetched rows / total rows = 4/4 +----------+ ``` + ## Example 8: Wildcard capture and substitution -Use wildcards in both pattern and replacement to capture and reuse matched portions. The number of wildcards must match in pattern and replacement. +The following query uses wildcards in both the pattern and replacement to capture and reuse matched portions. The number of wildcards must match in the pattern and replacement: ```ppl source=accounts @@ -186,7 +204,7 @@ source=accounts | fields address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -200,9 +218,10 @@ fetched rows / total rows = 4/4 +----------------------+ ``` + ## Example 9: Multiple wildcards for pattern transformation -Use multiple wildcards to transform patterns. Each wildcard in the replacement substitutes the corresponding captured value. +The following query uses multiple wildcards to transform patterns. Each wildcard in the replacement is substituted with the corresponding captured value: ```ppl source=accounts @@ -210,7 +229,7 @@ source=accounts | fields address ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -224,9 +243,10 @@ fetched rows / total rows = 4/4 +----------------------+ ``` -## Example 10: Wildcard with zero wildcards in replacement -When replacement has zero wildcards, all matching values are replaced with the literal replacement string. +## Example 10: Replace any match with a fixed value + +The following query shows that when the replacement contains zero wildcards, all matching values are replaced with the literal replacement string: ```ppl source=accounts @@ -234,7 +254,7 @@ source=accounts | fields state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -248,9 +268,10 @@ fetched rows / total rows = 4/4 +----------+ ``` + ## Example 11: Matching literal asterisks -Use `\*` to match literal asterisk characters (`\*` = literal asterisk, `\\` = literal backslash). +Use `\*` to match literal asterisk characters and `\\` to match literal backslash characters. The following query uses `\*`: ```ppl source=accounts @@ -259,7 +280,7 @@ source=accounts | fields note ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -272,35 +293,10 @@ fetched rows / total rows = 4/4 | DISCOUNTED | +------------+ ``` - -## Example 12: Wildcard with no replacement wildcards -Use wildcards in pattern but none in replacement to create a fixed output. - -```ppl -source=accounts -| eval test = 'prefix-value-suffix' -| replace 'prefix-*-suffix' WITH 'MATCHED' IN test -| fields test -``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+---------+ -| test | -|---------| -| MATCHED | -| MATCHED | -| MATCHED | -| MATCHED | -+---------+ -``` - -## Example 13: Escaped asterisks with wildcards +## Example 12: Replace text with literal asterisk symbols -Combine escaped asterisks (literal) with wildcards for complex patterns. +The following query shows how to insert literal asterisk symbols into text while using wildcards to preserve other parts of the pattern: ```ppl source=accounts @@ -309,7 +305,7 @@ source=accounts | fields label ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -323,8 +319,11 @@ fetched rows / total rows = 4/4 +----------+ ``` -## Limitations -* Wildcards: `*` matches zero or more characters (case-sensitive) -* Replacement wildcards must match pattern wildcard count, or be zero -* Escape sequences: `\*` (literal asterisk), `\\` (literal backslash) \ No newline at end of file +## Limitations + +The `replace` command has the following limitations: + +* **Wildcards**: The `*` wildcard matches zero or more characters and is case sensitive. +* **Wildcard matching**: Replacement wildcards must match the pattern wildcard count or be zero. +* **Escape sequences**: Use `\*` for literal asterisk and `\\` for literal backslash characters. \ No newline at end of file diff --git a/docs/user/ppl/cmd/reverse.md b/docs/user/ppl/cmd/reverse.md index f63a8f18e9..9505abad93 100644 --- a/docs/user/ppl/cmd/reverse.md +++ b/docs/user/ppl/cmd/reverse.md @@ -1,19 +1,23 @@ -# reverse -## Description +# reverse -The `reverse` command reverses the display order of search results. The same results are returned, but in reverse order. -## Syntax +The `reverse` command reverses the display order of the search results. It returns the same results but in the opposite order. +> **Note**: The `reverse` command processes the entire dataset. If applied directly to millions of records, it consumes significant coordinating node memory resources. Only apply the `reverse` command to smaller datasets, typically after aggregation operations. + +## Syntax + +The `reverse` command has the following syntax: + +```syntax reverse -* No parameters: The reverse command takes no arguments or options. - -## Note +``` + +The `reverse` command takes no parameters. -The `reverse` command processes the entire dataset. If applied directly to millions of records, it will consume significant memory resources on the coordinating node. Users should only apply the `reverse` command to smaller datasets, typically after aggregation operations. ## Example 1: Basic reverse operation -This example shows reversing the order of all documents. +The following query reverses the order of all documents in the results: ```ppl source=accounts @@ -21,7 +25,7 @@ source=accounts | reverse ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -35,9 +39,10 @@ fetched rows / total rows = 4/4 +----------------+-----+ ``` -## Example 2: Reverse with sort -This example shows reversing results after sorting by age in ascending order, effectively giving descending order. +## Example 2: Use the reverse and sort commands + +The following query reverses results after sorting documents by age in ascending order, effectively implementing descending order: ```ppl source=accounts @@ -46,7 +51,7 @@ source=accounts | reverse ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -60,9 +65,10 @@ fetched rows / total rows = 4/4 +----------------+-----+ ``` -## Example 3: Reverse with head -This example shows using reverse with head to get the last 2 records from the original order. +## Example 3: Use the reverse and head commands + +The following query uses the `reverse` command together with the `head` command to retrieve the last two records from the original result order: ```ppl source=accounts @@ -71,7 +77,7 @@ source=accounts | fields account_number, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -83,9 +89,10 @@ fetched rows / total rows = 2/2 +----------------+-----+ ``` + ## Example 4: Double reverse -This example shows that applying reverse twice returns to the original order. +The following query shows that applying `reverse` twice returns documents in the original order: ```ppl source=accounts @@ -94,7 +101,7 @@ source=accounts | fields account_number, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -108,9 +115,10 @@ fetched rows / total rows = 4/4 +----------------+-----+ ``` -## Example 5: Reverse with complex pipeline -This example shows reverse working with filtering and field selection. +## Example 5: Use the reverse command with a complex pipeline + +The following query uses the `reverse` command with filtering and field selection: ```ppl source=accounts @@ -119,7 +127,7 @@ source=accounts | reverse ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -131,4 +139,4 @@ fetched rows / total rows = 3/3 | 1 | 32 | +----------------+-----+ ``` - \ No newline at end of file + diff --git a/docs/user/ppl/cmd/rex.md b/docs/user/ppl/cmd/rex.md index 0f117373d8..b4fe706f48 100644 --- a/docs/user/ppl/cmd/rex.md +++ b/docs/user/ppl/cmd/rex.md @@ -1,27 +1,50 @@ -# rex -## Description +# rex -The `rex` command extracts fields from a raw text field using regular expression named capture groups. -## Syntax +The `rex` command extracts fields from a raw text field using regular expression named capture groups. It uses Java regex patterns. For more information, see the [Java regular expression documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). -rex [mode=\] field=\ \ [max_match=\] [offset_field=\] -* field: mandatory. The field must be a string field to extract data from. -* pattern: mandatory string. The regular expression pattern with named capture groups used to extract new fields. Pattern must contain at least one named capture group using `(?pattern)` syntax. -* mode: optional. Either `extract` or `sed`. **Default:** extract - * **extract mode** (default): Creates new fields from regular expression named capture groups. This is the standard field extraction behavior. - * **sed mode**: Performs text substitution on the field using sed-style patterns - * `s/pattern/replacement/` - Replace first occurrence - * `s/pattern/replacement/g` - Replace all occurrences (global) - * `s/pattern/replacement/n` - Replace only the nth occurrence (where n is a number) - * `y/from_chars/to_chars/` - Character-by-character transliteration - * Backreferences: `\1`, `\2`, etc. reference captured groups in replacement -* max_match: optional integer (default=1). Maximum number of matches to extract. If greater than 1, extracted fields become arrays. The value 0 means unlimited matches, but is automatically capped to the configured limit (default: 10, configurable via `plugins.ppl.rex.max_match.limit`). -* offset_field: optional string. Field name to store the character offset positions of matches. Only available in extract mode. - -## Example 1: Basic Field Extraction +## The rex and parse commands compared + +The `rex` and [`parse`](parse.md) commands both extract information from text fields using Java regular expressions with named capture groups. The following table compares the capabilities of the `rex` and `parse` commands. + +| Feature | `rex` | `parse` | +| --- | --- | --- | +| Pattern type | Java regex | Java regex | +| Named groups required | Yes | Yes | +| Multiple named groups | Yes | No | +| Multiple matches | Yes | No | +| Text substitution | Yes | No | +| Offset tracking | Yes | No | +| Special characters in group names | No | No | + +## Syntax + +The `rex` command has the following syntax: + +```syntax +rex [mode=] field= [max_match=] [offset_field=] +``` + +## Parameters -This example shows extracting username and domain from email addresses using named capture groups. Both extracted fields are returned as string type. +The `rex` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `field` | Required | The field to extract data from. The field must be a string. | +| `` | Required | The regular expression pattern with named capture groups used to extract new fields. The pattern must contain at least one named capture group using the `(?pattern)` syntax. Group names must start with a letter and contain only letters and digits. | +| `mode` | Optional | The pattern-matching mode. Valid values are `extract` and `sed`. The `extract` mode creates new fields from regular expression named capture groups. The `sed` mode performs text substitution using sed-style patterns (supports `s/pattern/replacement/` with flags, `y/from_chars/to_chars/` transliteration, and backreferences). | +| `max_match` | Optional | The maximum number of matches to extract. If the value is greater than `1`, the extracted fields are returned as arrays. A value of `0` indicates unlimited matches; however, the effective number of matches is automatically limited by the configured maximum. The default maximum is `10` and can be configured using `plugins.ppl.rex.max_match.limit` (see the [note](rex.md/#note)). Default is `1`. | +| `offset_field` | Optional | Valid in `extract` mode only. The name of the field in which to store the character offset positions of the matches. | + +

+ +> **Note**: You can set the `max_match` limit in the `plugins.ppl.rex.max_match.limit` cluster setting. For more information, see [SQL settings](../../admin/settings.rst). Setting this limit to a large value is not recommended because it can lead to excessive memory consumption, especially with patterns that match empty strings (for example, `\d*` or `\w*`). + + +## Example 1: Basic text extraction + +The following query extracts the username and domain from email addresses using named capture groups. Both extracted fields are returned as strings: ```ppl source=accounts @@ -30,7 +53,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -42,9 +65,10 @@ fetched rows / total rows = 2/2 +-----------------------+------------+--------+ ``` -## Example 2: Handling Non-matching Patterns -This example shows the rex command returning all events, setting extracted fields to null for non-matching patterns. Extracted fields would be string type when matches are found. +## Example 2: Handle non-matching patterns + +The following query shows that the rex command returns all events, setting extracted fields to null for non-matching patterns. When matches are found, the extracted fields are returned as strings: ```ppl source=accounts @@ -53,7 +77,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -65,9 +89,10 @@ fetched rows / total rows = 2/2 +-----------------------+------+--------+ ``` -## Example 3: Multiple Matches with max_match -This example shows extracting multiple words from address field using max_match parameter. The extracted field is returned as an array type containing string elements. +## Example 3: Extract multiple words using max_match + +The following query uses the `rex` command with the `max_match` parameter to extract multiple words from the `address` field. The extracted field is returned as an array of strings: ```ppl source=accounts @@ -76,7 +101,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -89,9 +114,10 @@ fetched rows / total rows = 3/3 +--------------------+------------------+ ``` -## Example 4: Text Replacement with mode=sed -This example shows replacing email domains using sed mode for text substitution. The extracted field is returned as string type. +## Example 4: Replace text using sed mode + +The following query uses the `rex` command in `sed` mode to replace email domains through text substitution. The extracted field is returned as a string: ```ppl source=accounts @@ -100,7 +126,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -112,9 +138,10 @@ fetched rows / total rows = 2/2 +------------------------+ ``` -## Example 5: Using offset_field -This example shows tracking the character positions where matches occur. Extracted fields are string type, and the offset_field is also string type. +## Example 5: Track match positions using offset_field + +The following query tracks the character positions where matches occur. The extracted fields are returned as strings, and the `offset_field` is also returned as a string: ```ppl source=accounts @@ -123,7 +150,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -135,9 +162,10 @@ fetched rows / total rows = 2/2 +-----------------------+------------+--------+---------------------------+ ``` -## Example 6: Complex Email Pattern -This example shows extracting comprehensive email components including top-level domain. All extracted fields are returned as string type. +## Example 6: Extract a complex email pattern + +The following query extracts complete email components, including the top-level domain. All extracted fields are returned as strings: ```ppl source=accounts @@ -146,7 +174,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -158,9 +186,10 @@ fetched rows / total rows = 2/2 +-----------------------+------------+--------+-----+ ``` -## Example 7: Chaining Multiple rex Commands -This example shows extracting initial letters from both first and last names. All extracted fields are returned as string type. +## Example 7: Chain multiple rex commands + +The following query extracts initial letters from both first and last names. All extracted fields are returned as strings: ```ppl source=accounts @@ -170,7 +199,7 @@ source=accounts | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -183,10 +212,12 @@ fetched rows / total rows = 3/3 +-----------+----------+--------------+-------------+ ``` -## Example 8: Named Capture Group Limitations -This example demonstrates naming restrictions for capture groups. Group names cannot contain underscores due to Java regex limitations. -Invalid PPL query with underscores +## Example 8: Capture group naming restrictions + +The following query shows naming restrictions for capture groups. Group names cannot contain underscores because of [Java regex](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) limitations. + +**Invalid PPL query with underscores**: ```ppl source=accounts @@ -194,14 +225,14 @@ source=accounts | fields email, user_name, email_domain ``` -Expected output: +The query returns the following results: ```text {'reason': 'Invalid Query', 'details': "Invalid capture group name 'user_name'. Java regex group names must start with a letter and contain only letters and digits.", 'type': 'IllegalArgumentException'} Error: Query returned no data ``` - -Correct PPL query without underscores + +**Correct PPL query without underscores**: ```ppl source=accounts @@ -210,7 +241,7 @@ source=accounts | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -222,10 +253,12 @@ fetched rows / total rows = 2/2 +-----------------------+------------+-------------+ ``` -## Example 9: Max Match Limit Protection -This example demonstrates the max_match limit protection mechanism. When max_match=0 (unlimited) is specified, the system automatically caps it to prevent memory exhaustion. -PPL query with max_match=0 automatically capped to default limit of 10 +## Example 9: max_match limit enforcement + +The following query shows the `max_match` limit protection mechanism. When `max_match` is set to `0` (unlimited), the system automatically enforces a maximum limit on the number of matches to prevent memory exhaustion. + +**PPL query with `max_match=0` automatically limited to the default of 10**: ```ppl source=accounts @@ -235,7 +268,7 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -245,8 +278,8 @@ fetched rows / total rows = 1/1 | 880 Holmes Lane | 10 | +-----------------+-------------+ ``` - -PPL query exceeding the configured limit results in an error + +**A PPL query exceeding the configured limit results in an error**: ```ppl source=accounts @@ -255,37 +288,11 @@ source=accounts | head 1 ``` -Expected output: +The query returns the following results: ```text {'reason': 'Invalid Query', 'details': 'Rex command max_match value (100) exceeds the configured limit (10). Consider using a smaller max_match value or adjust the plugins.ppl.rex.max_match.limit setting.', 'type': 'IllegalArgumentException'} Error: Query returned no data ``` - -## Comparison with Related Commands - -| Feature | rex | parse | -| --- | --- | --- | -| Pattern Type | Java Regex | Java Regex | -| Named Groups Required | Yes | Yes | -| Multiple Named Groups | Yes | No | -| Multiple Matches | Yes | No | -| Text Substitution | Yes | No | -| Offset Tracking | Yes | No | -| Special Characters in Group Names | No | No | - -## Limitations - -**Named Capture Group Naming:** -* Group names must start with a letter and contain only letters and digits -* For detailed Java regex pattern syntax and usage, refer to the [official Java Pattern documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) - -**Pattern Requirements:** -* Pattern must contain at least one named capture group -* Regular capture groups `(...)` without names are not allowed - -**Max Match Limit:** -* The `max_match` parameter is subject to a configurable system limit to prevent memory exhaustion -* When `max_match=0` (unlimited) is specified, it is automatically capped at the configured limit (default: 10) -* User-specified values exceeding the configured limit will result in an error -* Users can adjust the limit via the `plugins.ppl.rex.max_match.limit` cluster setting. Setting this limit to a large value is not recommended as it can lead to excessive memory consumption, especially with patterns that match empty strings (e.g., `\d*`, `\w*`) \ No newline at end of file + +For detailed Java regex pattern syntax and usage, refer to the [official Java Pattern documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html). diff --git a/docs/user/ppl/cmd/search.md b/docs/user/ppl/cmd/search.md index f05f47aa19..9b5b740d3c 100644 --- a/docs/user/ppl/cmd/search.md +++ b/docs/user/ppl/cmd/search.md @@ -1,103 +1,157 @@ -# search -## Description +# search -The `search` command retrieves document from the index. The `search` command can only be used as the first command in the PPL query. -## Syntax +The `search` command retrieves documents from the index. The `search` command can only be used as the first command in a PPL query. -search source=[\:]\ [search-expression] -* search: search keyword, which could be ignored. -* index: mandatory. search command must specify which index to query from. The index name can be prefixed by "\:" for cross-cluster search. -* search-expression: optional. Search expression that gets converted to OpenSearch [query_string](https://docs.opensearch.org/latest/query-dsl/full-text/query-string/) function which uses [Lucene Query Syntax](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html). +## Syntax + +The `search` command has the following syntax: + +```syntax +search source=[:] [] +``` + +## Parameters + +The `search` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The index to query. The index name can be prefixed with `:` (the remote cluster name) for cross-cluster search. | +| `` | Optional | A search expression that is converted to an OpenSearch [query string](https://docs.opensearch.org/latest/query-dsl/full-text/query-string/) query. | -## Search Expression + +## Search expression The search expression syntax supports: -* **Full text search**: `error` or `"error message"` - Searches the default field configured by the `index.query.default_field` setting (defaults to `*` which searches all fields) -* **Field-value comparisons**: `field=value`, `field!=value`, `field>value`, `field>=value`, `field[+<...>]@` - Time offset from current time - -**Relative Time Components**: -* **Time offset**: `+` (future) or `-` (past) -* **Time amount**: Numeric value + time unit (`second`, `minute`, `hour`, `day`, `week`, `month`, `year`, and their variants) -* **Snap to unit**: Optional `@` to round to nearest unit (hour, day, month, etc.) - -**Examples of Time Modifier Values**: -* `earliest=now` - From current time -* `latest='2024-12-31 23:59:59'` - Until a specific date -* `earliest=-7d` - From 7 days ago -* `latest='+1d@d'` - Until tomorrow at start of day -* `earliest='-1month@month'` - From start of previous month -* `latest=1754020061` - Until a unix timestamp (August 1, 2025 03:47:41 at UTC) - -Read more details on time modifiers in the [PPL relative_timestamp documentation](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/functions/ppl-datetime.md#relative_timestamp). -**Notes:** -* **Column name conflicts**: If your data contains columns named "earliest" or "latest", use backticks to access them as regular fields (e.g., `` `earliest`="value"``) to avoid conflicts with time modifier syntax. -* **Time snap syntax**: Time modifiers with chained time offsets must be wrapped in quotes (e.g., `latest='+1d@month-10h'`) for proper query parsing. - -## Default Field Configuration - -When you search without specifying a field, it searches the default field configured by the `index.query.default_field` index setting (defaults to `*` which searches all fields). -You can check or modify the default field setting - GET /accounts/_settings/index.query.default_field - PUT /accounts/_settings - { - "index.query.default_field": "firstname,lastname,email" - } -## Field Types and Search Behavior - -**Text Fields**: Full-text search, phrase search -* `search message="error occurred" source=logs` -* Limitations: Wildcards apply to terms after analysis, not entire field value. - -**Keyword Fields**: Exact matching, wildcard patterns -* `search status="ACTIVE" source=logs` -* Limitations: No text analysis, case-sensitive matching - -**Numeric Fields**: Range queries, exact matching, IN operator -* `search age>=18 AND balance<50000 source=accounts` -* Limitations: No wildcard or text search support - -**Date Fields**: Range queries, exact matching, IN operator -* `search timestamp>="2024-01-01" source=logs` -* Limitations: Must use index mapping date format, no wildcards - -**Boolean Fields**: true/false values only, exact matching, IN operator -* `search active=true source=users` -* Limitations: No wildcards or range queries - -**IP Fields**: Exact matching, CIDR notation -* `search client_ip="192.168.1.0/24" source=logs` -* Limitations: No wildcards for partial IP matching. For wildcard search use multi field with keyword: `search ip_address.keyword='1*' source=logs` or WHERE clause: `source=logs | where cast(ip_address as string) like '1%'` - -**Field Type Performance Tips**: - * Each field type has specific search capabilities and limitations. Using the wrong field type during ingestion impacts performance and accuracy - * For wildcard searches on non-keyword fields: Add a keyword field copy for better performance. Example: If you need wildcards on a text field, create `message.keyword` alongside `message` - -## Cross-Cluster Search - -Cross-cluster search lets any node in a cluster execute search requests against other clusters. Refer to [Cross-Cluster Search](../admin/cross_cluster_search.md) for configuration. -## Example 1: Text Search - -**Basic Text Search** (unquoted single term) +* **Full-text search**: `error` or `"error message"` -- Searches the default field configured in the `index.query.default_field` setting (default is `*`, which specifies all fields). For more information, see [Default field configuration](#default-field-configuration). +* **Field-value comparisons**: `field=value`, `field!=value`, `field>value`, `field>=value`, `field][@]` | A time offset relative to the current time. See [Relative time components](#relative-time-components). | `earliest=-7d`, `latest='+1d@d'` | + +#### Relative time components + +Relative time modifiers use multiple components that can be combined. The following table describes each component. + +| Component | Syntax | Description | Examples | +| --- | --- | --- | --- | +| Time offset | `+` or `-` | Direction: `+` (future) or `-` (past) | `+7d`, `-1h` | +| Amount of time | `` | Numeric value + time unit | `7d`, `1h`, `30m` | +| Round to unit | `@` | Round to nearest unit | `@d` (day), `@h` (hour), `@m` (minute) | + +The following are examples of common time modifier patterns: + +* `earliest=now` -- Start from the current time. +* `latest='2024-12-31 23:59:59'` -- End at a specific date and time. +* `earliest=-7d` -- Start from 7 days ago. +* `latest='+1d@d'` -- End at the start of tomorrow. +* `earliest='-1month@month'` -- Start from the beginning of the previous month. +* `latest=1754020061` -- End at the Unix timestamp `1754020061` (August 1, 2025, 03:47:41 UTC). + +The following considerations apply when using time modifiers in the `search` command: + +* **Column name conflicts**: If your data contains columns named `earliest` or `latest`, use backticks to access them as regular fields (for example, `` `earliest`="value"``) to avoid conflicts with time modifier syntax. +* **Time round syntax**: Time modifiers with chained time offsets must be wrapped in quotation marks (for example, `latest='+1d@month-10h'`) for proper query parsing. + +## Default field configuration + +When a search is performed without specifying a field, it uses the default field configured by the `index.query.default_field` index setting. By default, this is set to `*`, which searches all fields. + +To retrieve the default field setting, use the following request: + +```bash ignore +GET /accounts/_settings/index.query.default_field +``` + +To modify the default field setting, use the following request: + +```bash ignore +PUT /accounts/_settings +{ + "index.query.default_field": "firstname,lastname,email" +} +``` + +## Search behavior by field type + +Different field types have specific search capabilities and limitations. The following table summarizes how search expressions work with each field type. + +| Field type | Supported operations | Example | Limitations | +| --- | --- | --- | --- | +| Text | Full-text search, phrase search | `search message="error occurred" source=logs` | Wildcards apply to terms after analysis, not the entire field value | +| Keyword | Exact matching, wildcard patterns | `search status="ACTIVE" source=logs` | No text analysis; matching is case sensitive | +| Numeric | Range queries, exact matching, `IN` operator | `search age>=18 AND balance<50000 source=accounts` | No wildcard or text search support | +| Date | Range queries, exact matching, `IN` operator | `search timestamp>="2024-01-01" source=logs` | Must follow index mapping date format; wildcards not supported | +| Boolean | Exact matching, `true` and `false` values, `IN` operator | `search active=true source=users` | No wildcards or range queries | +| IP | Exact matching, CIDR notation | `search client_ip="192.168.1.0/24" source=logs` | Partial IP wildcard matching not supported. For wildcard search, use multi-field with keyword: `search ip_address.keyword='1*' source=logs` or WHERE clause: `source=logs | where cast(ip_address as string) like '1%'` | + +Consider the following performance optimizations when working with different field types: + +* Each field type has specific search capabilities and limitations. Choosing an inappropriate field type during ingestion can negatively affect performance and query accuracy. +* For wildcard searches on non-keyword fields, create a `keyword` subfield to improve performance. For example, for wildcard searches on a `message` field of type `text`, add a `message.keyword` field. + +## Cross-cluster search + +Cross-cluster search lets any node in a cluster execute search requests against other clusters. Refer to [Cross-Cluster Search](../admin/cross_cluster_search.md/) for configuration. + +## Example 1: Fetching all data + +Retrieve all documents from an index by specifying only the source without any search conditions. This is useful for exploring small datasets or verifying data ingestion: + +```ppl +source=accounts +``` + +The query returns the following results: + +```text +fetched rows / total rows = 4/4 ++----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ +| account_number | firstname | address | balance | gender | city | employer | state | age | email | lastname | +|----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------| +| 1 | Amber | 880 Holmes Lane | 39225 | M | Brogan | Pyrami | IL | 32 | amberduke@pyrami.com | Duke | +| 6 | Hattie | 671 Bristol Street | 5686 | M | Dante | Netagy | TN | 36 | hattiebond@netagy.com | Bond | +| 13 | Nanette | 789 Madison Street | 32838 | F | Nogal | Quility | VA | 28 | null | Bates | +| 18 | Dale | 467 Hutchinson Court | 4180 | M | Orick | null | MD | 33 | daleadams@boink.com | Adams | ++----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ +``` + + +## Example 2: Text search + +For basic text search, use an unquoted single term: ```ppl search ERROR source=otellogs @@ -106,7 +160,7 @@ search ERROR source=otellogs | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -117,14 +171,14 @@ fetched rows / total rows = 1/1 +--------------+---------------------------------------------------------+ ``` -**Phrase Search** (requires quotes for multi-word exact match) +Phrase search requires quotation marks for multi-word exact matching: ```ppl search "Payment failed" source=otellogs | fields body ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -134,8 +188,8 @@ fetched rows / total rows = 1/1 | Payment failed: Insufficient funds for user@example.com | +---------------------------------------------------------+ ``` - -**Implicit AND with Multiple Terms** (unquoted literals are combined with AND) + +Multiple search terms (unquoted string literals) are automatically combined using the `AND` operator: ```ppl search user email source=otellogs @@ -144,7 +198,7 @@ search user email source=otellogs | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -155,15 +209,16 @@ fetched rows / total rows = 1/1 +--------------------------------------------------------------------------------------------------------------------+ ``` -Note: `search user email` is equivalent to `search user AND email`. Multiple unquoted terms are automatically combined with AND. -**Enclose in double quotes for terms which contain special characters** +> **Note**: `search user email` is equivalent to `search user AND email`. + +Enclose terms containing special characters in double quotation marks: ```ppl search "john.doe+newsletter@company.com" source=otellogs | fields body ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -174,8 +229,10 @@ fetched rows / total rows = 1/1 +--------------------------------------------------------------------------------------------------------------------+ ``` -### Mixed Phrase and Boolean - +### Combined phrase and Boolean search + +Combine quoted phrases with Boolean operators for more precise searches: + ```ppl search "User authentication" OR OAuth2 source=otellogs | sort @timestamp @@ -183,7 +240,7 @@ search "User authentication" OR OAuth2 source=otellogs | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -194,10 +251,15 @@ fetched rows / total rows = 1/1 +----------------------------------------------------------------------------------------------------------+ ``` -## Example 2: Boolean Logic and Operator Precedence -### Boolean Operators - +## Example 3: Boolean logic and operator precedence + +The following queries demonstrate Boolean operators and precedence. + +### Boolean operators + +Use `OR` to match documents containing any of the specified conditions: + ```ppl search severityText="ERROR" OR severityText="FATAL" source=otellogs | sort @timestamp @@ -205,7 +267,7 @@ search severityText="ERROR" OR severityText="FATAL" source=otellogs | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -217,14 +279,16 @@ fetched rows / total rows = 3/3 | ERROR | +--------------+ ``` - + +Combine conditions with `AND` to require all criteria to match: + ```ppl -search severityText="INFO" AND `resource.attributes.service.name`="cart-service" source=otellogs -| fields body -| head 1; +search severityText="INFO" AND `resource.attributes.service.name`="cart-service" source=otellogs +| fields body +| head 1 ``` -Expected output +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -235,8 +299,16 @@ fetched rows / total rows = 1/1 +----------------------------------------------------------------------------------+ ``` -**Operator Precedence** (highest to lowest): Parentheses → NOT → OR → AND - +### Operator precedence + +The operators are evaluated using the following precedence: + +``` +Parentheses > NOT > OR > AND +``` + +The following query demonstrates operator precedence: + ```ppl search severityText="ERROR" OR severityText="WARN" AND severityNumber>15 source=otellogs | sort @timestamp @@ -244,7 +316,7 @@ search severityText="ERROR" OR severityText="WARN" AND severityNumber>15 source= | head 2 ``` -Expected output: +The preceding expression is evaluated as `(severityText="ERROR" OR severityText="WARN") AND severityNumber>15`. The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -255,17 +327,20 @@ fetched rows / total rows = 2/2 | ERROR | 17 | +--------------+----------------+ ``` - -The above evaluates as `(severityText="ERROR" OR severityText="WARN") AND severityNumber>15` -## Example 3: NOT vs != Semantics -**!= operator** (field must exist and not equal the value) - +## Example 4: NOT compared to != semantics + +Both `!=` and `NOT` operators find documents in which the field value is not equal to the specified value. However, the `!=` operator excludes documents containing null or missing fields, while the `NOT` operator includes them. The following query shows this difference. + +**!= operator** + +Find all accounts for which the `employer` field exists and is not `Quility`: + ```ppl search employer!="Quility" source=accounts ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -277,13 +352,15 @@ fetched rows / total rows = 2/2 +----------------+-----------+--------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ ``` -**NOT operator** (excludes matching conditions, includes null fields) - +**`NOT` operator** + +Find all accounts that do not specify `Quility` as the employer (including those with null employer values): + ```ppl search NOT employer="Quility" source=accounts ``` -Expected output: +The query returns the following results. Dale Adams appears in the search results because his `employer` field is `null`: ```text fetched rows / total rows = 3/3 @@ -295,22 +372,65 @@ fetched rows / total rows = 3/3 | 18 | Dale | 467 Hutchinson Court | 4180 | M | Orick | null | MD | 33 | daleadams@boink.com | Adams | +----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ ``` - -**Key difference**: `!=` excludes null values, `NOT` includes them. -Dale Adams (account 18) has `employer=null`. He appears in `NOT employer="Quility"` but not in `employer!="Quility"`. -## Example 4: Wildcards -### Wildcard Patterns - +## Example 5: Range queries + +Use comparison operators (`>,` `<,` `>=` and `<=`) to filter numeric and date fields within specific ranges. Range queries are particularly useful for filtering by age, price, timestamps, or any numeric metrics: + +```ppl +search severityNumber>15 AND severityNumber<=20 source=otellogs +| sort @timestamp +| fields severityNumber +| head 3 +``` + +The query returns the following results: + +```text +fetched rows / total rows = 3/3 ++----------------+ +| severityNumber | +|----------------| +| 17 | +| 17 | +| 18 | ++----------------+ +``` + +The following query filters by decimal values within a specific range: + +```ppl +search `attributes.payment.amount`>=1000.0 AND `attributes.payment.amount`<=2000.0 source=otellogs +| fields body +``` + +The query returns the following results: + +```text +fetched rows / total rows = 1/1 ++---------------------------------------------------------+ +| body | +|---------------------------------------------------------| +| Payment failed: Insufficient funds for user@example.com | ++---------------------------------------------------------+ +``` + + +## Example 6: Wildcards + +The following queries demonstrate wildcard pattern matching. In wildcard patterns, `*` matches zero or more characters, while `?` matches exactly one character. + +Use `*` to match any number of characters at the end of a term: + ```ppl search severityText=ERR* source=otellogs | sort @timestamp | fields severityText | head 3 ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 3/3 +--------------+ @@ -321,16 +441,18 @@ fetched rows / total rows = 3/3 | ERROR2 | +--------------+ ``` - + +Wildcard searches also work within text fields to find partial matches: + ```ppl -search body=user* source=otellogs -| sort @timestamp -| fields body -| head 2; +search body=user* source=otellogs +| sort @timestamp +| fields body +| head 2 ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 2/2 +----------------------------------------------------------------------------------+ @@ -340,22 +462,18 @@ fetched rows / total rows = 2/2 | Payment failed: Insufficient funds for user@example.com | +----------------------------------------------------------------------------------+ ``` - -**Wildcard Rules**: -* `*` - Matches zero or more characters -* `?` - Matches exactly one character - -### Single character wildcard (?) - + +Use `?` to match exactly one character in specific positions: + ```ppl search severityText="INFO?" source=otellogs | sort @timestamp | fields severityText | head 3 ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 3/3 +--------------+ @@ -366,58 +484,22 @@ fetched rows / total rows = 3/3 | INFO4 | +--------------+ ``` - -## Example 5: Range Queries -Use comparison operators (>, <, >=, <=) to filter numeric and date fields within specific ranges. Range queries are particularly useful for filtering by age, price, timestamps, or any numeric metrics. - -```ppl -search severityNumber>15 AND severityNumber<=20 source=otellogs -| sort @timestamp -| fields severityNumber -| head 3 -``` - -Expected output: - -```text -fetched rows / total rows = 3/3 -+----------------+ -| severityNumber | -|----------------| -| 17 | -| 17 | -| 18 | -+----------------+ -``` - -```ppl -search `attributes.payment.amount`>=1000.0 AND `attributes.payment.amount`<=2000.0 source=otellogs -| fields body; -``` - -Expected output: - -```text -fetched rows / total rows = 1/1 -+---------------------------------------------------------+ -| body | -|---------------------------------------------------------| -| Payment failed: Insufficient funds for user@example.com | -+---------------------------------------------------------+ -``` - -## Example 6: Field Search with Wildcards -When searching in text or keyword fields, wildcards enable partial matching. This is particularly useful for finding records where you only know part of the value. Note that wildcards work best with keyword fields, while text fields may produce unexpected results due to tokenization. -**Partial Search in Keyword Fields** - +## Example 7: Wildcard patterns in field searches + +When searching in text or keyword fields, wildcards enable partial matching, which is useful when you only know part of a value. Wildcards work best on keyword fields, for which they match the exact value using patterns. Using wildcards on text fields may produce unexpected results because they apply to individual tokens after analysis, not the entire field value. Wildcards in keyword fields are case sensitive unless normalized at indexing. + +> **Note**: Leading wildcards (for example, `*@example.com`) can decrease query speed compared to trailing wildcards. + +Find records for which you only know the beginning of a field value: + ```ppl search employer=Py* source=accounts | fields firstname, employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -427,36 +509,31 @@ fetched rows / total rows = 1/1 | Amber | Pyrami | +-----------+----------+ ``` - -### Combining Wildcards with Field Comparisons - + +Combine wildcard patterns with other conditions for more precise filtering: + ```ppl search firstname=A* AND age>30 source=accounts | fields firstname, age, city ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 +-----------+-----+--------+ | firstname | age | city | -|-----------+-----+--------| -| Amber | 32 | Brogan | -+-----------+-----+--------+ +|-----------+-----+--------| +| Amber | 32 | Brogan | ++-----------+-----+--------+ ``` - -**Important Notes on Wildcard Usage**: -* **Keyword fields**: Best for wildcard searches - exact value matching with pattern support -* **Text fields**: Wildcards apply to individual tokens after analysis, not the entire field value -* **Performance**: Leading wildcards (e.g., `*@example.com`) are slower than trailing wildcards -* **Case sensitivity**: Keyword field wildcards are case-sensitive unless normalized during indexing - -## Example 7: IN Operator and Field Comparisons -The IN operator efficiently checks if a field matches any value from a list. This is cleaner and more performant than chaining multiple OR conditions for the same field. -**IN Operator** - +## Example 8: Field value matching + +The `IN` operator efficiently checks whether a field matches any value in a list, providing a more concise and more performant alternative to chaining multiple `OR` conditions on the same field. + +Check whether a field matches any value from a predefined list: + ```ppl search severityText IN ("ERROR", "WARN", "FATAL") source=otellogs | sort @timestamp @@ -464,7 +541,7 @@ search severityText IN ("ERROR", "WARN", "FATAL") source=otellogs | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -476,9 +553,10 @@ fetched rows / total rows = 3/3 | FATAL | +--------------+ ``` - -### Field Comparison Examples - + + +Filter logs by `severityNumber` to find errors with a specific numeric severity level: + ```ppl search severityNumber=17 source=otellogs | sort @timestamp @@ -486,7 +564,7 @@ search severityNumber=17 source=otellogs | head 1 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -496,13 +574,15 @@ fetched rows / total rows = 1/1 | Payment failed: Insufficient funds for user@example.com | +---------------------------------------------------------+ ``` - + +Search for logs containing a specific user email address in the attributes: + ```ppl -search `attributes.user.email`="user@example.com" source=otellogs -| fields body; +search `attributes.user.email`="user@example.com" source=otellogs +| fields body ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -513,9 +593,10 @@ fetched rows / total rows = 1/1 +---------------------------------------------------------+ ``` -## Example 8: Complex Expressions -Combine multiple conditions using boolean operators and parentheses to create sophisticated search queries. +## Example 9: Complex expressions + +To create sophisticated search queries, combine multiple conditions using Boolean operators and parentheses: ```ppl search (severityText="ERROR" OR severityText="WARN") AND severityNumber>10 source=otellogs @@ -524,7 +605,7 @@ search (severityText="ERROR" OR severityText="WARN") AND severityNumber>10 sourc | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -536,13 +617,15 @@ fetched rows / total rows = 3/3 | ERROR | +--------------+ ``` - + +Combine multiple conditions with `OR` and `AND` operators to search for logs matching either a specific user or high-severity fund errors: + ```ppl -search `attributes.user.email`="user@example.com" OR (`attributes.error.code`="INSUFFICIENT_FUNDS" AND severityNumber>15) source=otellogs -| fields body; +search `attributes.user.email`="user@example.com" OR (`attributes.error.code`="INSUFFICIENT_FUNDS" AND severityNumber>15) source=otellogs +| fields body ``` -Expected output: +The query returns the following results: ``` fetched rows / total rows = 1/1 @@ -553,17 +636,21 @@ fetched rows / total rows = 1/1 +---------------------------------------------------------+ ``` -## Example 9: Time Modifiers + +## Example 10: Time modifiers Time modifiers filter search results by time range using the implicit `@timestamp` field. They support various time formats for precise temporal filtering. -**Absolute Time Filtering** - + +### Absolute time filtering + +Filter logs within a specific time window using absolute timestamps: + ```ppl search earliest='2024-01-15 10:30:05' latest='2024-01-15 10:30:10' source=otellogs | fields @timestamp, severityText ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 6/6 @@ -579,8 +666,10 @@ fetched rows / total rows = 6/6 +-------------------------------+--------------+ ``` -**Relative Time Filtering** (before 30 seconds ago) - +### Relative time filtering + +Filter logs using relative time expressions, such as those that occurred before 30 seconds ago: + ```ppl search latest=-30s source=otellogs | sort @timestamp @@ -588,7 +677,7 @@ search latest=-30s source=otellogs | head 3 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -601,15 +690,17 @@ fetched rows / total rows = 3/3 +-------------------------------+--------------+ ``` -**Time Snapping** (before start of current minute) - +### Time rounding + +Use time rounding expressions to filter events relative to time boundaries, such as those before the start of the current minute: + ```ppl search latest='@m' source=otellogs | fields @timestamp, severityText | head 2 ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -621,14 +712,16 @@ fetched rows / total rows = 2/2 +-------------------------------+--------------+ ``` -### Unix Timestamp Filtering - +### Unix timestamp filtering + +Filter logs using Unix epoch timestamps for precise time ranges: + ```ppl search earliest=1705314600 latest=1705314605 source=otellogs | fields @timestamp, severityText ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 5/5 @@ -643,35 +736,39 @@ fetched rows / total rows = 5/5 +-------------------------------+--------------+ ``` -## Example 10: Special Characters and Escaping -Understand when and how to escape special characters in your search queries. There are two categories of characters that need escaping: -**Characters that must be escaped**: -* **Backslashes (\)**: Always escape as `\\` to search for literal backslash -* **Quotes (")**: Escape as `\"` when inside quoted strings - -**Wildcard characters (escape only to search literally)**: -* **Asterisk (*)**: Use as-is for wildcard, escape as `\\*` to search for literal asterisk -* **Question mark (?)**: Use as-is for wildcard, escape as `\\?` to search for literal question mark - +## Example 11: Escaping special characters -| Intent | PPL Syntax | Result | -|--------|------------|--------| -| Wildcard search | `field=user*` | Matches "user", "user123", "userABC" | -| Literal "user*" | `field="user\\*"` | Matches only "user*" | -| Wildcard search | `field=log?` | Matches "log1", "logA", "logs" | -| Literal "log?" | `field="log\\?"` | Matches only "log?" | - +Special characters fall into two categories, depending on whether they must always be escaped or only when you want to search for their literal value: + +- The following characters must always be escaped to be interpreted literally: + * **Backslash (`\`)**: Escape as `\\`. + * **Quotation mark (`"`)**: Escape as `\"` when used inside a quoted string. + +- These characters act as wildcards by default and should be escaped only when you want to match them literally: + * **Asterisk (`*`)**: Use as `*` for wildcard matching; escape as `\\*` for a literal asterisk. + * **Question mark (`?`)**: Use as `?` for wildcard matching; escape as `\\?` for a literal question mark. + +The following table compares wildcard and literal character matching. + +| Intent | PPL syntax | Result | +| ---| --- | --- | +| Wildcard search | `field=user*` | Matches `user`, `user123`, `userABC` | +| Literal `user*` | `field="user\\*"` | Matches only `user*` | +| Wildcard search | `field=log?` | Matches `log1`, `logA`, `logs` | +| Literal `log?` | `field="log\\?"` | Matches only `log?`| + +### Escaping backslash characters + +Each backslash in the search value must be escaped with another backslash. For example, the following query searches for Windows file paths by properly escaping backslashes: -**Backslash in file paths** - ```ppl search `attributes.error.type`="C:\\\\Users\\\\admin" source=otellogs | fields `attributes.error.type` ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 1/1 +-----------------------+ @@ -680,38 +777,43 @@ fetched rows / total rows = 1/1 | C:\Users\admin | +-----------------------+ ``` - -Note: Each backslash in the search value needs to be escaped with another backslash. When using REST API with JSON, additional JSON escaping is required. -**Quotes within strings** - + +> **Note**: When using the REST API with JSON, additional JSON escaping is required. + +### Quotation marks within strings + +Search for text containing quotation marks by escaping them with backslashes: + ```ppl search body="\"exact phrase\"" source=otellogs | sort @timestamp | fields body | head 1 ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 1/1 +--------------------------------------------------------------------------------------------------------------------------------------------------------+ -| body | +| body | |--------------------------------------------------------------------------------------------------------------------------------------------------------| | Query contains Lucene special characters: +field:value -excluded AND (grouped OR terms) NOT "exact phrase" wildcard* fuzzy~2 /regex/ [range TO search] | +--------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` - -**Text with special characters** - + +### Text containing special characters + +Search for literal text containing wildcard characters by escaping them: + ```ppl search "wildcard\\* fuzzy~2" source=otellogs | fields body | head 1 ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 1/1 +--------------------------------------------------------------------------------------------------------------------------------------------------------+ @@ -719,27 +821,4 @@ fetched rows / total rows = 1/1 |--------------------------------------------------------------------------------------------------------------------------------------------------------| | Query contains Lucene special characters: +field:value -excluded AND (grouped OR terms) NOT "exact phrase" wildcard* fuzzy~2 /regex/ [range TO search] | +--------------------------------------------------------------------------------------------------------------------------------------------------------+ -``` - -## Example 11: Fetch All Data - -Retrieve all documents from an index by specifying only the source without any search conditions. This is useful for exploring small datasets or verifying data ingestion. - -```ppl -source=accounts -``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ -| account_number | firstname | address | balance | gender | city | employer | state | age | email | lastname | -|----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------| -| 1 | Amber | 880 Holmes Lane | 39225 | M | Brogan | Pyrami | IL | 32 | amberduke@pyrami.com | Duke | -| 6 | Hattie | 671 Bristol Street | 5686 | M | Dante | Netagy | TN | 36 | hattiebond@netagy.com | Bond | -| 13 | Nanette | 789 Madison Street | 32838 | F | Nogal | Quility | VA | 28 | null | Bates | -| 18 | Dale | 467 Hutchinson Court | 4180 | M | Orick | null | MD | 33 | daleadams@boink.com | Adams | -+----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+ -``` - \ No newline at end of file +``` \ No newline at end of file diff --git a/docs/user/ppl/cmd/showdatasources.md b/docs/user/ppl/cmd/showdatasources.md index 10129873aa..1b9a4dd40f 100644 --- a/docs/user/ppl/cmd/showdatasources.md +++ b/docs/user/ppl/cmd/showdatasources.md @@ -1,23 +1,31 @@ -# show datasources -## Description +# show datasources -Use the `show datasources` command to query datasources configured in the PPL engine. The `show datasources` command can only be used as the first command in the PPL query. -## Syntax +The `show datasources` command queries data sources configured in the PPL engine. The `show datasources` command can only be used as the first command in a PPL query. +> **Note**: To use the `show datasources` command, `plugins.calcite.enabled` must be set to `false`. + +## Syntax + +The `show datasources` command has the following syntax: + +```syntax show datasources -## Example 1: Fetch all PROMETHEUS datasources +``` + +The `show datasources` command takes no parameters. -This example shows fetching all the datasources of type prometheus. -PPL query for all PROMETHEUS DATASOURCES +## Example 1: Fetch all Prometheus data sources + +The following query fetches all Prometheus data sources: ```ppl show datasources | where CONNECTOR_TYPE='PROMETHEUS' ``` -Expected output: - +The query returns the following results: + ```text fetched rows / total rows = 1/1 +-----------------+----------------+ @@ -26,7 +34,4 @@ fetched rows / total rows = 1/1 | my_prometheus | PROMETHEUS | +-----------------+----------------+ ``` - -## Limitations -The `show datasources` command can only work with `plugins.calcite.enabled=false`. \ No newline at end of file diff --git a/docs/user/ppl/cmd/sort.md b/docs/user/ppl/cmd/sort.md index a6e5ba1c0e..a6aaaf22a5 100644 --- a/docs/user/ppl/cmd/sort.md +++ b/docs/user/ppl/cmd/sort.md @@ -1,57 +1,51 @@ -# sort -## Description +# sort -The `sort` command sorts all the search results by the specified fields. -## Syntax +The `sort` command sorts the search results by the specified fields. -sort [count] <[+\|-] sort-field \| sort-field [asc\|a\|desc\|d]>... -* count: optional. The number of results to return. Specifying a count of 0 or less than 0 returns all results. **Default:** 0. -* [+\|-]: optional. The plus [+] stands for ascending order and NULL/MISSING first and a minus [-] stands for descending order and NULL/MISSING last. **Default:** ascending order and NULL/MISSING first. -* [asc\|a\|desc\|d]: optional. asc/a stands for ascending order and NULL/MISSING first. desc/d stands for descending order and NULL/MISSING last. **Default:** ascending order and NULL/MISSING first. -* sort-field: mandatory. The field used to sort. Can use `auto(field)`, `str(field)`, `ip(field)`, or `num(field)` to specify how to interpret field values. - -> **Note:** -> You cannot mix +/- and asc/desc in the same sort command. Choose one approach for all fields in a single sort command. -> -> +## Syntax -## Example 1: Sort by one field +The `sort` command supports two syntax notations. You must use one notation consistently within a single `sort` command. -This example shows sorting all documents by age field in ascending order. - -```ppl -source=accounts -| sort age -| fields account_number, age +### Prefix notation + +The `sort` command has the following syntax in prefix notation: + +```syntax +sort [] [+|-] [, [+|-] ]... ``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+----------------+-----+ -| account_number | age | -|----------------+-----| -| 13 | 28 | -| 1 | 32 | -| 18 | 33 | -| 6 | 36 | -+----------------+-----+ + +### Suffix notation + +The `sort` command has the following syntax in suffix notation: + +```syntax +sort [] [asc|desc|a|d] [, [asc|desc|a|d]]... ``` - -## Example 2: Sort by one field return all the result -This example shows sorting all documents by age field in ascending order and returning all results. - +## Parameters + +The `sort` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The field used to sort. Use `auto(field)`, `str(field)`, `ip(field)`, or `num(field)` to specify how to interpret field values. Multiple fields can be specified as a comma-separated list. | +| `` | Optional | The number of results to return. A value of `0` or less returns all results. Default is `0`. | +| `[+|-]` | Optional | **Prefix notation only.** The plus sign (`+`) specifies ascending order, and the minus sign (`-`) specifies descending order. Default is ascending order. | +| `[asc|desc|a|d]` | Optional | **Suffix notation only.** Specifies the sort order: `asc`/`a` for ascending, `desc`/`d` for descending. Default is ascending order. | + +## Example 1: Sort by one field + +The following query sorts all documents by the `age` field in ascending order. By default, the sort command returns all results, which is equivalent to specifying `sort 0 age`: + ```ppl source=accounts -| sort 0 age +| sort age | fields account_number, age ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 4/4 +----------------+-----+ @@ -63,43 +57,28 @@ fetched rows / total rows = 4/4 | 6 | 36 | +----------------+-----+ ``` - -## Example 3: Sort by one field in descending order (using -) -This example shows sorting all documents by age field in descending order. - + +## Example 2: Sort by one field in descending order + +The following query sorts all documents by the `age` field in descending order. You can use either prefix notation (`- age`) or suffix notation (`age desc`): + ```ppl source=accounts | sort - age | fields account_number, age ``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+----------------+-----+ -| account_number | age | -|----------------+-----| -| 6 | 36 | -| 18 | 33 | -| 1 | 32 | -| 13 | 28 | -+----------------+-----+ -``` - -## Example 4: Sort by one field in descending order (using desc) -This example shows sorting all the document by the age field in descending order using the desc keyword. - +This query is equivalent to the following query: + ```ppl source=accounts | sort age desc | fields account_number, age ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 4/4 +----------------+-----+ @@ -111,10 +90,11 @@ fetched rows / total rows = 4/4 | 13 | 28 | +----------------+-----+ ``` - -## Example 5: Sort by multiple fields (using +/-) -This example shows sorting all documents by gender field in ascending order and age field in descending order using +/- operators. + +## Example 3: Sort by multiple fields in prefix notation + +The following query uses prefix notation to sort all documents by the `gender` field in ascending order and the `age` field in descending order: ```ppl source=accounts @@ -122,7 +102,7 @@ source=accounts | fields account_number, gender, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -136,9 +116,10 @@ fetched rows / total rows = 4/4 +----------------+--------+-----+ ``` -## Example 6: Sort by multiple fields (using asc/desc) -This example shows sorting all the document by the gender field in ascending order and age field in descending order using asc/desc keywords. +## Example 4: Sort by multiple fields in suffix notation + +The following query uses suffix notation to sort all documents by the `gender` field in ascending order and the `age` field in descending order: ```ppl source=accounts @@ -146,7 +127,7 @@ source=accounts | fields account_number, gender, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -160,9 +141,10 @@ fetched rows / total rows = 4/4 +----------------+--------+-----+ ``` -## Example 7: Sort by field include null value -This example shows sorting employer field by default option (ascending order and null first). The result shows that null value is in the first row. +## Example 5: Sort fields with null values + +The default ascending order lists null values first. The following query sorts the `employer` field in the default order: ```ppl source=accounts @@ -170,7 +152,7 @@ source=accounts | fields employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -184,9 +166,10 @@ fetched rows / total rows = 4/4 +----------+ ``` -## Example 8: Specify the number of sorted documents to return -This example shows sorting all documents and returning 2 documents. +## Example 6: Specify the number of sorted documents to return + +The following query sorts all documents and returns two documents: ```ppl source=accounts @@ -194,7 +177,7 @@ source=accounts | fields account_number, age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -206,41 +189,18 @@ fetched rows / total rows = 2/2 +----------------+-----+ ``` -## Example 9: Sort with desc modifier -This example shows sorting with the desc modifier to reverse sort order. - -```ppl -source=accounts -| sort age desc -| fields account_number, age -``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+----------------+-----+ -| account_number | age | -|----------------+-----| -| 6 | 36 | -| 18 | 33 | -| 1 | 32 | -| 13 | 28 | -+----------------+-----+ -``` - -## Example 10: Sort with specifying field type +## Example 7: Sort by specifying field type + +The following query uses the `sort` command with `str()` to sort numeric values lexicographically: -This example shows sorting with str() to sort numeric values lexicographically. - ```ppl source=accounts | sort str(account_number) | fields account_number ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 diff --git a/docs/user/ppl/cmd/spath.md b/docs/user/ppl/cmd/spath.md index c83afc3a31..d9293113fb 100644 --- a/docs/user/ppl/cmd/spath.md +++ b/docs/user/ppl/cmd/spath.md @@ -1,21 +1,33 @@ -# spath -## Description +# spath -The `spath` command allows extracting fields from structured text data. It currently allows selecting from JSON data with JSON paths. -## Syntax +The `spath` command extracts fields from structured text data by allowing you to select JSON values using JSON paths. -spath input=\ [output=\] [path=]\ -* input: mandatory. The field to scan for JSON data. -* output: optional. The destination field that the data will be loaded to. **Default:** value of `path`. -* path: mandatory. The path of the data to load for the object. For more information on path syntax, see [json_extract](../functions/json.md#json_extract). - -## Note +> **Note**: The `spath` command is not executed on OpenSearch data nodes. It extracts fields from data after it has been returned to the coordinator node, which is slow on large datasets. We recommend indexing fields needed for filtering directly instead of using `spath` to filter nested fields. + +## Syntax + +The `spath` command has the following syntax: + +```syntax +spath input= [output=] [path=] +``` + +## Parameters + +The `spath` command supports the following parameters. -The `spath` command currently does not support pushdown behavior for extraction. It will be slow on large datasets. It's generally better to index fields needed for filtering directly instead of using `spath` to filter nested fields. -## Example 1: Simple Field Extraction +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `input` | Required | The field containing JSON data to parse. | +| `output` | Optional | The destination field in which the extracted data is stored. Default is the value of ``. | +| `` | Required | The JSON path that identifies the data to extract. | -The simplest spath is to extract a single field. This example extracts `n` from the `doc` field of type `text`. +For more information about path syntax, see [json_extract](../functions/json.md#json_extract). + +## Example 1: Basic field extraction + +The basic use of `spath` extracts a single field from JSON data. The following query extracts the `n` field from JSON objects in the `doc_n` field: ```ppl source=structured @@ -23,7 +35,7 @@ source=structured | fields doc_n n ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -36,9 +48,10 @@ fetched rows / total rows = 3/3 +----------+---+ ``` -## Example 2: Lists & Nesting -This example demonstrates more JSON path uses, like traversing nested fields and extracting list elements. +## Example 2: Lists and nesting + +The following query shows how to traverse nested fields and extract list elements: ```ppl source=structured @@ -48,7 +61,7 @@ source=structured | fields doc_list first_element all_elements nested ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -61,9 +74,10 @@ fetched rows / total rows = 3/3 +------------------------------------------------------+---------------+--------------+--------+ ``` + ## Example 3: Sum of inner elements -This example shows extracting an inner field and doing statistics on it, using the docs from example 1. It also demonstrates that `spath` always returns strings for inner types. +The following query shows how to use `spath` to extract the `n` field from JSON data and calculate the sum of all extracted values: ```ppl source=structured @@ -73,7 +87,7 @@ source=structured | fields `sum(n)` ``` -Expected output: +The query returns the following results. The `spath` command always returns inner values as strings: ```text fetched rows / total rows = 1/1 @@ -84,9 +98,10 @@ fetched rows / total rows = 1/1 +--------+ ``` + ## Example 4: Escaped paths -`spath` can escape paths with strings to accept any path that `json_extract` does. This includes escaping complex field names as array components. +Use quoted string syntax to access JSON field names that contain spaces, dots, or other special characters: ```ppl source=structured @@ -95,7 +110,7 @@ source=structured | fields a b ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -107,4 +122,4 @@ fetched rows / total rows = 3/3 | false | 2 | +-------+---+ ``` - \ No newline at end of file + diff --git a/docs/user/ppl/cmd/stats.md b/docs/user/ppl/cmd/stats.md index 000d910d97..ac8d3cfa5e 100644 --- a/docs/user/ppl/cmd/stats.md +++ b/docs/user/ppl/cmd/stats.md @@ -1,93 +1,68 @@ -# stats - -## Description - -The `stats` command calculates the aggregation from the search result. -## Syntax - -stats [bucket_nullable=bool] \... [by-clause] -* aggregation: mandatory. An aggregation function. -* bucket_nullable: optional. Controls whether the stats command includes null buckets in group-by aggregations. When set to `false`, the aggregation ignores records where the group-by field is null, resulting in faster performance by excluding null bucket. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. - * When `plugins.ppl.syntax.legacy.preferred=true`, `bucket_nullable` defaults to `true` - * When `plugins.ppl.syntax.legacy.preferred=false`, `bucket_nullable` defaults to `false` -* by-clause: optional. Groups results by specified fields or expressions. Syntax: by [span-expression,] [field,]... **Default:** If no by-clause is specified, the stats command returns only one row, which is the aggregation over the entire result set. -* span-expression: optional, at most one. Splits field into buckets by intervals. Syntax: span(field_expr, interval_expr). The unit of the interval expression is the natural unit by default. If the field is a date/time type field, the aggregation results always ignore null bucket. For example, `span(age, 10)` creates 10-year age buckets, `span(timestamp, 1h)` creates hourly buckets. - * Available time units - * millisecond (ms) - * second (s) - * minute (m, case sensitive) - * hour (h) - * day (d) - * week (w) - * month (M, case sensitive) - * quarter (q) - * year (y) - -## Aggregation Functions - -The stats command supports the following aggregation functions: -* COUNT/C: Count of values -* SUM: Sum of numeric values -* AVG: Average of numeric values -* MAX: Maximum value -* MIN: Minimum value -* VAR_SAMP: Sample variance -* VAR_POP: Population variance -* STDDEV_SAMP: Sample standard deviation -* STDDEV_POP: Population standard deviation -* DISTINCT_COUNT_APPROX: Approximate distinct count -* TAKE: List of original values -* PERCENTILE/PERCENTILE_APPROX: Percentile calculations -* PERC\/P\: Percentile shortcut functions -* MEDIAN: 50th percentile -* EARLIEST: Earliest value by timestamp -* LATEST: Latest value by timestamp -* FIRST: First non-null value -* LAST: Last non-null value -* LIST: Collect all values into array -* VALUES: Collect unique values into sorted array - -For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md). -## Limitations +# stats -### Bucket aggregation result may be approximate in large dataset +The `stats` command calculates aggregations on the search results. -In OpenSearch, `doc_count` values for a terms bucket aggregation may be approximate. As a result, any aggregations (such as `sum` and `avg`) on the terms bucket aggregation may also be approximate. -For example, the following PPL query (find the top 10 URLs) may return an approximate result if the cardinality of `URL` is high. +## Comparing stats, eventstats, and streamstats -```ppl ignore -source=hits -| stats bucket_nullable=false count() as c by URL -| sort - c -| head 10 +For a comprehensive comparison of `stats`, `eventstats`, and `streamstats` commands, including their differences in transformation behavior, output format, aggregation scope, and use cases, see [Comparing stats, eventstats, and streamstats](streamstats.md/#comparing-stats-eventstats-and-streamstats). + +## Syntax + +The `stats` command has the following syntax: + +```syntax +stats [bucket_nullable=bool] ... [by-clause] ``` -This query is pushed down to a terms bucket aggregation DSL query with `"order": { "_count": "desc" }`. In OpenSearch, this terms aggregation may throw away some buckets. +## Parameters -### Sorting by ascending doc_count may produce inaccurate results +The `stats` command supports the following parameters. -Similar to above PPL query, the following query (find the rare 10 URLs) often produces inaccurate results. +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | An aggregation function. | +| `` | Optional | Groups results by specified fields or expressions. Syntax: `by [span-expression,] [field,]...` If no `by-clause` is specified, the stats command returns only one row, which is the aggregation over the entire search results. | +| `bucket_nullable` | Optional | Controls whether to include `null` buckets in group-by aggregations. When `false`, ignores records in which the `group-by` field is null, resulting in faster performance. Default is the value of `plugins.ppl.syntax.legacy.preferred`. | +| `` | Optional | Splits a field into buckets by intervals (maximum of one). Syntax: `span(field_expr, interval_expr)`. By default, the interval uses the field's default unit. For date/time fields, aggregation results ignore null values. Examples: `span(age, 10)` creates 10-year age buckets, and `span(timestamp, 1h)` creates hourly buckets. Valid time units are millisecond (`ms`), second (`s`), minute (`m`), hour (`h`), day (`d`), week (`w`), month (`M`), quarter (`q`), year (`y`). | -```ppl ignore -source=hits -| stats bucket_nullable=false count() as c by URL -| sort + c -| head 10 -``` +## Aggregation functions -A term that is globally infrequent might not appear as infrequent on every individual shard or might be entirely absent from the least frequent results returned by some shards. Conversely, a term that appears infrequently on one shard might be common on another. In both scenarios, rare terms can be missed during shard-level aggregation, resulting in incorrect overall results. +The `stats` command supports the following aggregation functions: + +* `COUNT`/`C` -- Count of values +* `SUM` -- Sum of numeric values +* `AVG` -- Average of numeric values +* `MAX` -- Maximum value +* `MIN` -- Minimum value +* `VAR_SAMP` -- Sample variance +* `VAR_POP` -- Population variance +* `STDDEV_SAMP` -- Sample standard deviation +* `STDDEV_POP` -- Population standard deviation +* `DISTINCT_COUNT_APPROX` -- Approximate distinct count +* `TAKE` -- List of original values +* `PERCENTILE`/`PERCENTILE_APPROX` -- Percentile calculations +* `PERC`/`P` -- Percentile shortcut functions +* `MEDIAN` -- 50th percentile +* `EARLIEST` -- Earliest value by timestamp +* `LATEST` -- Latest value by timestamp +* `FIRST` -- First non-null value +* `LAST` -- Last non-null value +* `LIST` -- Collect all values into array +* `VALUES` -- Collect unique values into sorted array + +For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md). ## Example 1: Calculate the count of events -This example shows calculating the count of events in the accounts. +The following query calculates the count of events in the `accounts` index: ```ppl source=accounts | stats count() ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -98,17 +73,18 @@ fetched rows / total rows = 1/1 +---------+ ``` + ## Example 2: Calculate the average of a field -This example shows calculating the average age of all the accounts. +The following query calculates the average age for all accounts: ```ppl source=accounts | stats avg(age) ``` -Expected output: - +The query returns the following results: + ```text fetched rows / total rows = 1/1 +----------+ @@ -118,16 +94,17 @@ fetched rows / total rows = 1/1 +----------+ ``` + ## Example 3: Calculate the average of a field by group -This example shows calculating the average age of all the accounts group by gender. +The following query calculates the average age for all accounts, grouped by gender: ```ppl source=accounts | stats avg(age) by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -139,16 +116,17 @@ fetched rows / total rows = 2/2 +--------------------+--------+ ``` -## Example 4: Calculate the average, sum and count of a field by group -This example shows calculating the average age, sum age and count of events of all the accounts group by gender. +## Example 4: Calculate the average, sum, and count of a field by group + +The following query calculates the average age, sum of ages, and count of events for all accounts, grouped by gender: ```ppl source=accounts | stats avg(age), sum(age), count() by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -160,16 +138,17 @@ fetched rows / total rows = 2/2 +--------------------+----------+---------+--------+ ``` + ## Example 5: Calculate the maximum of a field -The example calculates the max age of all the accounts. +The following query calculates the maximum age for all accounts: ```ppl source=accounts | stats max(age) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -180,16 +159,17 @@ fetched rows / total rows = 1/1 +----------+ ``` + ## Example 6: Calculate the maximum and minimum of a field by group -The example calculates the max and min age values of all the accounts group by gender. +The following query calculates the maximum and minimum ages for all accounts, grouped by gender: ```ppl source=accounts | stats max(age), min(age) by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -201,16 +181,17 @@ fetched rows / total rows = 2/2 +----------+----------+--------+ ``` + ## Example 7: Calculate the distinct count of a field -To get the count of distinct values of a field, you can use `DISTINCT_COUNT` (or `DC`) function instead of `COUNT`. The example calculates both the count and the distinct count of gender field of all the accounts. +To retrieve the count of distinct values of a field, you can use the `DISTINCT_COUNT` (or `DC`) function instead of `COUNT`. The following query calculates both the count and the distinct count of the `gender` field for all accounts: ```ppl source=accounts | stats count(gender), distinct_count(gender) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -221,16 +202,17 @@ fetched rows / total rows = 1/1 +---------------+------------------------+ ``` + ## Example 8: Calculate the count by a span -The example gets the count of age by the interval of 10 years. +The following query retrieves the count of `age` values grouped into 10-year intervals: ```ppl source=accounts | stats count(age) by span(age, 10) as age_span ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -242,16 +224,17 @@ fetched rows / total rows = 2/2 +------------+----------+ ``` + ## Example 9: Calculate the count by a gender and span -The example gets the count of age by the interval of 10 years and group by gender. +The following query retrieves the count of `age` grouped into 5-year intervals and broken down by `gender`: ```ppl source=accounts | stats count() as cnt by span(age, 5) as age_span, gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -264,14 +247,14 @@ fetched rows / total rows = 3/3 +-----+----------+--------+ ``` -Span will always be the first grouping key whatever order you specify. +The `span` expression is always treated as the first grouping key, regardless of its position in the `by` clause: ```ppl source=accounts | stats count() as cnt by gender, span(age, 5) as age_span ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -284,16 +267,17 @@ fetched rows / total rows = 3/3 +-----+----------+--------+ ``` -## Example 10: Calculate the count and get email list by a gender and span -The example gets the count of age by the interval of 10 years and group by gender, additionally for each row get a list of at most 5 emails. - +## Example 10: Count and retrieve an email list by gender and age span + +The following query calculates the count of `age` values grouped into 5-year intervals as well as by `gender` and also returns a list of up to 5 emails for each group: + ```ppl source=accounts | stats count() as cnt, take(email, 5) by span(age, 5) as age_span, gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -306,16 +290,17 @@ fetched rows / total rows = 3/3 +-----+--------------------------------------------+----------+--------+ ``` + ## Example 11: Calculate the percentile of a field -This example shows calculating the percentile 90th age of all the accounts. +The following query calculates the 90th percentile of `age` for all accounts: ```ppl source=accounts | stats percentile(age, 90) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -326,16 +311,17 @@ fetched rows / total rows = 1/1 +---------------------+ ``` + ## Example 12: Calculate the percentile of a field by group -This example shows calculating the percentile 90th age of all the accounts group by gender. +The following query calculates the 90th percentile of `age` for all accounts, grouped by `gender`: ```ppl source=accounts | stats percentile(age, 90) by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -347,16 +333,17 @@ fetched rows / total rows = 2/2 +---------------------+--------+ ``` + ## Example 13: Calculate the percentile by a gender and span -The example gets the percentile 90th age by the interval of 10 years and group by gender. +The following query calculates the 90th percentile of `age`, grouped into 10-year intervals as well as by `gender`: ```ppl source=accounts | stats percentile(age, 90) as p90 by span(age, 10) as age_span, gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -368,16 +355,17 @@ fetched rows / total rows = 2/2 +-----+----------+--------+ ``` + ## Example 14: Collect all values in a field using LIST -The example shows how to collect all firstname values, preserving duplicates and order. +The following query collects all `firstname` values, preserving duplicates and order: ```ppl source=accounts | stats list(firstname) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -388,14 +376,17 @@ fetched rows / total rows = 1/1 +-----------------------------+ ``` -## Example 15: Ignore null bucket - + +## Example 15: Ignore a null bucket + +The following query excludes null values from grouping by setting `bucket_nullable=false`: + ```ppl source=accounts | stats bucket_nullable=false count() as cnt by email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -408,16 +399,17 @@ fetched rows / total rows = 3/3 +-----+-----------------------+ ``` + ## Example 16: Collect unique values in a field using VALUES -The example shows how to collect all unique firstname values, sorted lexicographically with duplicates removed. +The following query collects all unique `firstname` values, sorted lexicographically with duplicates removed: ```ppl source=accounts | stats values(firstname) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -428,27 +420,30 @@ fetched rows / total rows = 1/1 +-----------------------------+ ``` -## Example 17: Span on date/time field always ignore null bucket -Index example data: -+-------+--------+------------+ -Name | DEPTNO | birthday | -+=======+========+============+ -Alice | 1 | 2024-04-21 | -+-------+--------+------------+ -Bob | 2 | 2025-08-21 | -+-------+--------+------------+ -Jeff | null | 2025-04-22 | +## Example 17: Date span grouping with null handling + +The following example uses this sample index data: + +```text +-------+--------+------------+ -Adam | 2 | null | +| Name | DEPTNO | birthday | +|-------+--------+------------| +| Alice | 1 | 2024-04-21 | +| Bob | 2 | 2025-08-21 | +| Jeff | null | 2025-04-22 | +| Adam | 2 | null | +-------+--------+------------+ - +``` + +The following query groups data by yearly spans of the `birthday` field, automatically excluding null values: + ```ppl ignore source=example | stats count() as cnt by span(birthday, 1y) as year ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -459,13 +454,15 @@ fetched rows / total rows = 3/3 | 2 | 2025-01-01 | +-----+------------+ ``` - + +Group by both yearly spans and department number (by default, null `DEPTNO` values are included in the results): + ```ppl ignore source=example | stats count() as cnt by span(birthday, 1y) as year, DEPTNO ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -477,13 +474,15 @@ fetched rows / total rows = 3/3 | 1 | 2025-01-01 | null | +-----+------------+--------+ ``` - + +Use `bucket_nullable=false` to exclude null `DEPTNO` values from the grouping: + ```ppl ignore source=example | stats bucket_nullable=false count() as cnt by span(birthday, 1y) as year, DEPTNO ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -495,16 +494,17 @@ fetched rows / total rows = 3/3 +-----+------------+--------+ ``` + ## Example 18: Calculate the count by the implicit @timestamp field -This example demonstrates that if you omit the field parameter in the span function, it will automatically use the implicit `@timestamp` field. +If you omit the `field` parameter in the `span` function, it automatically uses the implicit `@timestamp` field: ```ppl ignore source=big5 | stats count() by span(1month) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -514,4 +514,37 @@ fetched rows / total rows = 1/1 | 1 | 2023-01-01 00:00:00 | +---------+---------------------+ ``` - \ No newline at end of file + +## Limitations + +The following limitations apply to the `stats` command. + +### Bucket aggregation results may be approximate for high-cardinality fields + +In OpenSearch, `doc_count` values for a `terms` bucket aggregation can be approximate. Thus, any aggregations (such as `sum` or `avg`) performed on those buckets may also be approximate. + +For example, the following query retrieves the top 10 URLs: + +```ppl ignore +source=hits +| stats bucket_nullable=false count() as c by URL +| sort - c +| head 10 +``` + +This query is translated into a `terms` aggregation in OpenSearch with `"order": { "_count": "desc" }`. For fields with high cardinality, some buckets may be discarded, so the results may only be approximate. + +### Sorting by doc_count in ascending order may produce inaccurate results + +When retrieving the least frequent terms for high-cardinality fields, results may be inaccurate. Shard-level aggregations can miss globally rare terms or misrepresent their frequency, causing errors in the overall results. + +For example, the following query retrieves the 10 least frequent URLs: + +```ppl ignore +source=hits +| stats bucket_nullable=false count() as c by URL +| sort + c +| head 10 +``` + +A globally rare term might not appear as rare on every shard or could be entirely absent from some shard results. Conversely, a term that is infrequent on one shard might be common on another. In both cases, shard-level approximations can cause rare terms to be missed, leading to inaccurate overall results. diff --git a/docs/user/ppl/cmd/streamstats.md b/docs/user/ppl/cmd/streamstats.md index c7f79b2133..c7ea6b2da9 100644 --- a/docs/user/ppl/cmd/streamstats.md +++ b/docs/user/ppl/cmd/streamstats.md @@ -1,80 +1,33 @@ -# streamstats -## Description +# streamstats -The `streamstats` command is used to calculate cumulative or rolling statistics as events are processed in order. Unlike `stats` or `eventstats` which operate on the entire dataset at once, it computes values incrementally on a per-event basis, often respecting the order of events in the search results. It allows you to generate running totals, moving averages, and other statistics that evolve with the stream of events. -Key aspects of `streamstats`: -1. It computes statistics incrementally as each event is processed, making it suitable for time-series and sequence-based analysis. -2. Supports arguments such as window (for sliding window calculations) and current (to control whether the current event included in calculation). -3. Retains all original events and appends new fields containing the calculated statistics. -4. Particularly useful for calculating running totals, identifying trends, or detecting changes over sequences of events. - -Difference between `stats`, `eventstats` and `streamstats` -All of these commands can be used to generate aggregations such as average, sum, and maximum, but they have some key differences in how they operate and what they produce: -* Transformation Behavior - * `stats`: Transforms all events into an aggregated result table, losing original event structure. - * `eventstats`: Adds aggregation results as new fields to the original events without removing the event structure. - * `streamstats`: Adds cumulative (running) aggregation results to each event as they stream through the pipeline. -* Output Format - * `stats`: Output contains only aggregated values. Original raw events are not preserved. - * `eventstats`: Original events remain, with extra fields containing summary statistics. - * `streamstats`: Original events remain, with extra fields containing running totals or cumulative statistics. -* Aggregation Scope - * `stats`: Based on all events in the search (or groups defined by BY clause). - * `eventstats`: Based on all relevant events, then the result is added back to each event in the group. - * `streamstats`: Calculations occur progressively as each event is processed; can be scoped by window. -* Use Cases - * `stats`: When only aggregated results are needed (e.g., counts, averages, sums). - * `eventstats`: When aggregated statistics are needed alongside original event data. - * `streamstats`: When a running total or cumulative statistic is needed across event streams. - -## Syntax - -streamstats [bucket_nullable=bool] [current=\] [window=\] [global=\] [reset_before="("\")"] [reset_after="("\")"] \... [by-clause] -* function: mandatory. A aggregation function or window function. -* bucket_nullable: optional. Controls whether the streamstats command consider null buckets as a valid group in group-by aggregations. When set to `false`, it will not treat null group-by values as a distinct group during aggregation. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. - * When `plugins.ppl.syntax.legacy.preferred=true`, `bucket_nullable` defaults to `true` - * When `plugins.ppl.syntax.legacy.preferred=false`, `bucket_nullable` defaults to `false` -* current: optional. If true, the search includes the given, or current, event in the summary calculations. If false, the search uses the field value from the previous event. Syntax: current=\. **Default:** true. -* window: optional. Specifies the number of events to use when computing the statistics. Syntax: window=\. **Default:** 0, which means that all previous and current events are used. -* global: optional. Used only when the window argument is set. Defines whether to use a single window, global=true, or to use separate windows based on the by clause. If global=false and window is set to a non-zero value, a separate window is used for each group of values of the field specified in the by clause. Syntax: global=\. **Default:** true. -* reset_before: optional. Before streamstats calculates for an event, reset_before resets all accumulated statistics when the eval-expression evaluates to true. If used with window, the window is also reset. Syntax: reset_before="("\")". **Default:** false. -* reset_after: optional. After streamstats calculations for an event, reset_after resets all accumulated statistics when the eval-expression evaluates to true. This expression can reference fields returned by streamstats. If used with window, the window is also reset. Syntax: reset_after="("\")". **Default:** false. -* by-clause: optional. The by clause could be the fields and expressions like scalar functions and aggregation functions. Besides, the span clause can be used to split specific field into buckets in the same interval, the stats then does the aggregation by these span buckets. Syntax: by [span-expression,] [field,]... **Default:** If no \ is specified, all events are processed as a single group and running statistics are computed across the entire event stream. -* span-expression: optional, at most one. Splits field into buckets by intervals. Syntax: span(field_expr, interval_expr). For example, `span(age, 10)` creates 10-year age buckets, `span(timestamp, 1h)` creates hourly buckets. - * Available time units - * millisecond (ms) - * second (s) - * minute (m, case sensitive) - * hour (h) - * day (d) - * week (w) - * month (M, case sensitive) - * quarter (q) - * year (y) - -## Aggregation Functions - -The streamstats command supports the following aggregation functions: -* COUNT: Count of values -* SUM: Sum of numeric values -* AVG: Average of numeric values -* MAX: Maximum value -* MIN: Minimum value -* VAR_SAMP: Sample variance -* VAR_POP: Population variance -* STDDEV_SAMP: Sample standard deviation -* STDDEV_POP: Population standard deviation -* DISTINCT_COUNT/DC: Distinct count of values -* EARLIEST: Earliest value by timestamp -* LATEST: Latest value by timestamp +The `streamstats` command calculates cumulative or rolling statistics as events that are processed in order. Unlike `stats` or `eventstats`, which operate on the entire dataset at once, `streamstats` processes events incrementally, making it suitable for time-series and sequence-based analysis. + +Key features include support for the `window` (sliding window calculations) and `current` (whether to include the current event in calculations) parameters and specialized use cases such as identifying trends or detecting changes over sequences of events. -For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md). -## Usage +## Comparing stats, eventstats, and streamstats -Streamstats +The `stats`, `eventstats`, and `streamstats` commands can all generate aggregations such as average, sum, and maximum. However, they differ in how they operate and the results they produce. The following table summarizes these differences. + +| Aspect | `stats` | `eventstats` | `streamstats` | +| --- | --- | --- | --- | +| Transformation behavior | Transforms all events into an aggregated result table, losing original event structure | Adds aggregation results as new fields to the original events without removing the event structure | Adds cumulative (running) aggregation results to each event as it streams through the pipeline | +| Output format | Output contains only aggregated values. Original raw events are not preserved | Original events remain, with extra fields containing summary statistics | Original events remain, with extra fields containing running totals or cumulative statistics | +| Aggregation scope | Based on all events in the search (or groups defined by the `by` clause) | Based on all relevant events, then the result is added back to each event in the group | Calculations occur progressively as each event is processed; can be scoped by window | +| Use cases | When only aggregated results are needed (for example, counts, averages, sums) | When aggregated statistics are needed alongside original event data | When a running total or cumulative statistic is needed across event streams | + +## Syntax + +The `streamstats` command has the following syntax: + +```syntax +streamstats [bucket_nullable=bool] [current=] [window=] [global=] [reset_before="("")"] [reset_after="("")"] ... [by-clause] ``` + +The following are examples of the `streamstats` command syntax: + +```ppl ignore source = table | streamstats avg(a) source = table | streamstats current = false avg(a) source = table | streamstats window = 5 sum(b) @@ -88,17 +41,53 @@ source = table | streamstats current=false window=2 global=false avg(a) by b source = table | streamstats window=2 reset_before=a>31 avg(b) source = table | streamstats current=false reset_after=a>31 avg(b) by c ``` + +## Parameters + +The `streamstats` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | An aggregation function or window function. | +| `bucket_nullable` | Optional | Controls whether to consider null buckets as a valid group in group-by aggregations. When `false`, does not treat null group-by values as a distinct group during aggregation. Default is the value of `plugins.ppl.syntax.legacy.preferred`. | +| `current` | Optional | Whether to include the current event in summary calculations. When `true`, includes the current event; when `false`, uses the field value from the previous event. Default is `true`. | +| `window` | Optional | The number of events to use when computing statistics. Default is `0` (all previous and current events are used). | +| `global` | Optional | Used only when `window` is specified. Determines whether to use a single window (`true`) or separate windows for each group defined by the `by` clause (`false`). When `false` and `window` is non-zero, a separate window is used for each group of values of the field specified in the `by` clause. Default is `true`. | +| `reset_before` | Optional | Resets all accumulated statistics before `streamstats` computes the running metrics for an event when the `eval-expression` evaluates to `true`. If used with `window`, the window is also reset. Syntax: `reset_before="()"`. Default is `false`. | +| `reset_after` | Optional | Resets all accumulated statistics after `streamstats` computes the running metrics for an event when the `eval-expression` evaluates to `true`. The expression can reference fields returned by `streamstats`. If used with `window`, the window is also reset. Syntax: `reset_after="()"`. Default is `false`. | +| `` | Optional | Fields and expressions for grouping, including scalar functions and aggregation functions. The `span` clause can be used to split specific fields into buckets by intervals. Syntax: `by [span-expression,] [field,]...` If not specified, all events are processed as a single group and running statistics are computed across the entire event stream. | +| `` | Optional | Splits a field into buckets by intervals (maximum of one). Syntax: `span(field_expr, interval_expr)`. By default, the interval uses the field's default unit. For date/time fields, aggregation results ignore null values. Examples: `span(age, 10)` creates 10-year age buckets, and `span(timestamp, 1h)` creates hourly buckets. Valid time units are millisecond (`ms`), second (`s`), minute (`m`), hour (`h`), day (`d`), week (`w`), month (`M`), quarter (`q`), year (`y`). | + + +## Aggregation functions + +The `streamstats` command supports the following aggregation functions: + +* `COUNT` -- Count of values +* `SUM` -- Sum of numeric values +* `AVG` -- Average of numeric values +* `MAX` -- Maximum value +* `MIN` -- Minimum value +* `VAR_SAMP` -- Sample variance +* `VAR_POP` -- Population variance +* `STDDEV_SAMP` -- Sample standard deviation +* `STDDEV_POP` -- Population standard deviation +* `DISTINCT_COUNT`/`DC` -- Distinct count of values +* `EARLIEST` -- Earliest value by timestamp +* `LATEST` -- Latest value by timestamp +For detailed documentation of each function, see [Aggregation Functions](../functions/aggregations.md). + ## Example 1: Calculate the running average, sum, and count of a field by group -This example calculates the running average age, running sum of age, and running count of events for all the accounts, grouped by gender. +The following query calculates the running average `age`, running sum of `age`, and running count of events for all accounts, grouped by `gender`: ```ppl source=accounts | streamstats avg(age) as running_avg, sum(age) as running_sum, count() as running_count by gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -112,16 +101,17 @@ fetched rows / total rows = 4/4 +----------------+-----------+----------------------+---------+--------+--------+----------+-------+-----+-----------------------+----------+--------------------+-------------+---------------+ ``` -## Example 2: Running maximum age over a 2-row window -This example calculates the running maximum age over a 2-row window, excluding the current event. +## Example 2: Calculate the running maximum over a 2-row window + +The following query calculates the running maximum `age` over a 2-row window, excluding the current event: ```ppl source=state_country | streamstats current=false window=2 max(age) as prev_max_age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -139,37 +129,43 @@ fetched rows / total rows = 8/8 +-------+---------+------------+-------+------+-----+--------------+ ``` -## Example 3: Use the global argument to calculate running statistics -The global argument is only applicable when a window argument is set. It defines how the window is applied in relation to the grouping fields: -* global=true: a global window is applied across all rows, but the calculations inside the window still respect the by groups. -* global=false: the window itself is created per group, meaning each group gets its own independent window. - -This example shows how to calculate the running average of age across accounts by country, using global argument. -original data - +-------+---------+------------+-------+------+-----+ - | name | country | state | month | year | age | - - |-------+---------+------------+-------+------+-----+ - | Jake | USA | California | 4 | 2023 | 70 | - | Hello | USA | New York | 4 | 2023 | 30 | - | John | Canada | Ontario | 4 | 2023 | 25 | - | Jane | Canada | Quebec | 4 | 2023 | 20 | - | Jim | Canada | B.C | 4 | 2023 | 27 | - | Peter | Canada | B.C | 4 | 2023 | 57 | - | Rick | Canada | B.C | 4 | 2023 | 70 | - | David | USA | Washington | 4 | 2023 | 40 | +## Example 3: Global compared to group-specific windows + +The `global` parameter takes the following values: + +* `true`: A global window is applied across all rows, but the calculations inside the window still respect the `by` groups. +* `false`: The window itself is created per group, meaning each group receives an independent window. - +-------+---------+------------+-------+------+-----+ -* global=true: The window slides across all rows globally (following their input order), but inside each window, aggregation is still computed by country. So we process the data stream row by row to build the sliding window with size 2. We can see that David and Rick are in a window. -* global=false: Each by group (country) forms its own independent stream and window (size 2). So David and Hello are in one window for USA. This time we get running_avg 35 for David, rather than 40 when global is set true. +The following example uses a sample index containing the following data: + +```text ++-------+---------+------------+-------+------+-----+ +| name | country | state | month | year | age | + +|-------+---------+------------+-------+------+-----+ +| Jake | USA | California | 4 | 2023 | 70 | +| Hello | USA | New York | 4 | 2023 | 30 | +| John | Canada | Ontario | 4 | 2023 | 25 | +| Jane | Canada | Quebec | 4 | 2023 | 20 | +| Jim | Canada | B.C | 4 | 2023 | 27 | +| Peter | Canada | B.C | 4 | 2023 | 57 | +| Rick | Canada | B.C | 4 | 2023 | 70 | +| David | USA | Washington | 4 | 2023 | 40 | + ++-------+---------+------------+-------+------+-----+ +``` + +The following examples calculate the running average of `age` across accounts by country, using a different `global` parameter. + +When `global=true`, the window slides across all rows in input order, but aggregation is still computed by `country`. The sliding window size is `2`: ```ppl source=state_country | streamstats window=2 global=true avg(age) as running_avg by country ``` -Expected output: +As a result, `David` and `Rick` are included in the same sliding window when computing `running_avg` across all rows globally: ```text fetched rows / total rows = 8/8 @@ -187,12 +183,14 @@ fetched rows / total rows = 8/8 +-------+---------+------------+-------+------+-----+-------------+ ``` +In contrast, when `global=false`, each `by` group forms an independent stream and window: + ```ppl -source=state_country -| streamstats window=2 global=false avg(age) as running_avg by country ; +source=state_country +| streamstats window=2 global=false avg(age) as running_avg by country ``` -Expected output: +`David` and `Hello` form a window for the `USA` group. As a result, for `David`, the `running_avg` is `35.0` instead of `40.0` in the previous case: ```text fetched rows / total rows = 8/8 @@ -210,16 +208,17 @@ fetched rows / total rows = 8/8 +-------+---------+------------+-------+------+-----+-------------+ ``` -## Example 4: Use the reset_before and reset_after arguments to reset statistics -This example calculates the running average of age across accounts by country, with resets applied. +## Example 4: Conditional statistics reset + +The following query calculates the running average of `age` across accounts by `country`, with resets applied: ```ppl source=state_country | streamstats current=false reset_before=age>34 reset_after=age<25 avg(age) as avg_age by country ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -237,15 +236,18 @@ fetched rows / total rows = 8/8 +-------+---------+------------+-------+------+-----+---------+ ``` -## Example 5: Null buckets handling - + +## Example 5: Null bucket behavior + +When `bucket_nullable=false`, null values are excluded from group-by aggregations: + ```ppl source=accounts | streamstats bucket_nullable=false count() as cnt by employer | fields account_number, firstname, employer, cnt ``` -Expected output: +Rows in which the `by` field is `null` are excluded from aggregation, so the `cnt` for `Dale` is `null`: ```text fetched rows / total rows = 4/4 @@ -259,13 +261,15 @@ fetched rows / total rows = 4/4 +----------------+-----------+----------+------+ ``` +When `bucket_nullable=true`, null values are treated as a valid group: + ```ppl source=accounts | streamstats bucket_nullable=true count() as cnt by employer | fields account_number, firstname, employer, cnt ``` -Expected output: +As a result, the `cnt` for `Dale` is included and calculated normally: ```text fetched rows / total rows = 4/4 diff --git a/docs/user/ppl/cmd/subquery.md b/docs/user/ppl/cmd/subquery.md index aa33fbbb11..5dcc63c85c 100644 --- a/docs/user/ppl/cmd/subquery.md +++ b/docs/user/ppl/cmd/subquery.md @@ -1,80 +1,39 @@ -# subquery -## Description +# subquery -The `subquery` command allows you to embed one PPL query inside another, enabling complex filtering and data retrieval operations. A subquery is a nested query that executes first and returns results that are used by the outer query for filtering, comparison, or joining operations. -Subqueries are useful for: -1. Filtering data based on results from another query -2. Checking for the existence of related data -3. Performing calculations that depend on aggregated values from other tables -4. Creating complex joins with dynamic conditions - -## Syntax +The `subquery` command allows you to embed one PPL query within another, enabling advanced filtering and data retrieval. A subquery is executed first, and its results are used by the outer query for filtering, comparison, or joining. -subquery: [ source=... \| ... \| ... ] +Common use cases for subqueries include: -Subqueries use the same syntax as regular PPL queries but must be enclosed in square brackets. There are four main types of subqueries: +* Filtering data based on the results of another query. +* Checking for the existence of related data. +* Performing calculations that rely on aggregated values from other tables. +* Creating complex joins with dynamic conditions. -**IN Subquery** -Tests whether a field value exists in the results of a subquery: - -```sql ignore -where [not] in [ source=... | ... | ... ] -``` - -**EXISTS Subquery** -Tests whether a subquery returns any results: - -```sql ignore -where [not] exists [ source=... | ... | ... ] -``` - -**Scalar Subquery** -Returns a single value that can be used in comparisons or calculations - -```sql ignore -where = [ source=... | ... | ... ] -``` - -**Relation Subquery** -Used in join operations to provide dynamic right-side data - -```sql ignore -| join ON condition [ source=... | ... | ... ] -``` - -## Configuration +## Syntax -### plugins.ppl.subsearch.maxout +The `subquery` command has the following syntax: -The size configures the maximum of rows to return from subsearch. The default value is: `10000`. A value of `0` indicates that the restriction is unlimited. +`subquery: [ source=... | ... | ... ]` -Change the subsearch.maxout to unlimited: - -```bash ignore -sh$ curl -sS -H 'Content-Type: application/json' \ -... -X PUT localhost:9200/_plugins/_query/settings \ -... -d '{"persistent" : {"plugins.ppl.subsearch.maxout" : "0"}}' -{ - "acknowledged": true, - "persistent": { - "plugins": { - "ppl": { - "subsearch": { - "maxout": "-1" - } - } - } - }, - "transient": {} -} -``` - -## Usage +Subqueries use the same syntax as regular PPL queries but must be enclosed in square brackets. There are four main subquery types: + +- [`IN`](#in-subquery) +- [`EXISTS`](#exists-subquery) +- [Scalar](#scalar-subquery) +- [Relation](#relation-subquery) -InSubquery: +### IN subquery + +Tests whether a field value exists in the results of a subquery: +```ppl ignore +where [not] in [ source=... | ... | ... ] ``` + +The following are examples of the `IN` subquery syntax: + +```ppl ignore source = outer | where a in [ source = inner | fields b ] source = outer | where (a) in [ source = inner | fields b ] source = outer | where (a,b,c) in [ source = inner | fields d,e,f ] @@ -82,21 +41,29 @@ source = outer | where a not in [ source = inner | fields b ] source = outer | where (a) not in [ source = inner | fields b ] source = outer | where (a,b,c) not in [ source = inner | fields d,e,f ] source = outer a in [ source = inner | fields b ] // search filtering with subquery -source = outer a not in [ source = inner | fields b ] // search filtering with subquery) +source = outer a not in [ source = inner | fields b ] // search filtering with subquery source = outer | where a in [ source = inner1 | where b not in [ source = inner2 | fields c ] | fields b ] // nested source = table1 | inner join left = l right = r on l.a = r.a AND r.a in [ source = inner | fields d ] | fields l.a, r.a, b, c //as join filter ``` -ExistsSubquery: +### EXISTS subquery + +Tests whether a subquery returns any results: +```ppl ignore +where [not] exists [ source=... | ... | ... ] ``` + +The following are examples of the `EXISTS` subquery syntax: + +```ppl ignore // Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table inner, `e`, `f` are fields of table nested source = outer | where exists [ source = inner | where a = c ] source = outer | where not exists [ source = inner | where a = c ] source = outer | where exists [ source = inner | where a = c and b = d ] source = outer | where not exists [ source = inner | where a = c and b = d ] source = outer exists [ source = inner | where a = c ] // search filtering with subquery -source = outer not exists [ source = inner | where a = c ] //search filtering with subquery +source = outer not exists [ source = inner | where a = c ] // search filtering with subquery source = table as t1 exists [ source = table as t2 | where t1.a = t2.a ] //table alias is useful in exists subquery source = outer | where exists [ source = inner1 | where a = c and exists [ source = nested | where c = e ] ] //nested source = outer | where exists [ source = inner1 | where a = c | where exists [ source = nested | where c = e ] ] //nested @@ -105,9 +72,17 @@ source = outer | where not exists [ source = inner | where c > 10 ] //uncorrelat source = outer | where exists [ source = inner ] | eval l = "nonEmpty" | fields l //special uncorrelated exists ``` -ScalarSubquery: +### Scalar subquery + +Returns a single value that can be used in comparisons or calculations: +```ppl ignore +where = [ source=... | ... | ... ] ``` + +The following are examples of the scalar subquery syntax: + +```ppl ignore //Uncorrelated scalar subquery in Select source = outer | eval m = [ source = inner | stats max(c) ] | fields m, a source = outer | eval m = [ source = inner | stats max(c) ] + b | fields m, a @@ -129,69 +104,90 @@ source = outer [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats //Nested scalar subquery source = outer | where a = [ source = inner | stats max(c) | sort c ] OR b = [ source = inner | where c = 1 | stats min(d) | sort d ] source = outer | where a = [ source = inner | where c = [ source = nested | stats max(e) by f | sort f ] | stats max(d) by c | sort c | head 1 ] -RelationSubquery -source = table1 | join left = l right = r on condition [ source = table2 | where d > 10 | head 5 ] //subquery in join right side -source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1 ``` -## Example 1: TPC-H q20 +### Relation subquery -This example shows a complex TPC-H query 20 implementation using nested subqueries. +Used in `join` operations to provide dynamic right-side data: +```ppl ignore +| join ON condition [ source=... | ... | ... ] +``` + +The following are examples of the relation subquery syntax: + +```ppl ignore +source = table1 | join left = l right = r on condition [ source = table2 | where d > 10 | head 5 ] //subquery in join right side +source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1 +``` + +## Configuration + +The `subquery` command behavior is configured using the `plugins.ppl.subsearch.maxout` setting, which specifies the maximum number of rows to return from the subsearch. Default is `10000`. A value of `0` indicates that the restriction is unlimited. + +To update the setting, send the following request: + ```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ - source = supplier - | join ON s_nationkey = n_nationkey nation - | where n_name = 'CANADA' - and s_suppkey in [ - source = partsupp - | where ps_partkey in [ - source = part - | where like(p_name, 'forest%') - | fields p_partkey - ] - and ps_availqty > [ - source = lineitem - | where l_partkey = ps_partkey - and l_suppkey = ps_suppkey - and l_shipdate >= date('1994-01-01') - and l_shipdate < date_add(date('1994-01-01'), interval 1 year) - | stats sum(l_quantity) as sum_l_quantity - | eval half_sum_l_quantity = 0.5 * sum_l_quantity // Stats and Eval commands can combine when issues/819 resolved - | fields half_sum_l_quantity - ] - | fields ps_suppkey - ] - """ -}' +PUT /_plugins/_query/settings +{ + "persistent": { + "plugins.ppl.subsearch.maxout": "0" + } +} ``` -## Example 2: TPC-H q22 -This example shows a TPC-H query 22 implementation using EXISTS and scalar subqueries. +## Example 1: TPC-H q20 + +The following query demonstrates a complex TPC-H query 20 implementation using nested subqueries: + +```ppl ignore +source = supplier +| join ON s_nationkey = n_nationkey nation +| where n_name = 'CANADA' + and s_suppkey in [ + source = partsupp + | where ps_partkey in [ + source = part + | where like(p_name, 'forest%') + | fields p_partkey + ] + and ps_availqty > [ + source = lineitem + | where l_partkey = ps_partkey + and l_suppkey = ps_suppkey + and l_shipdate >= date('1994-01-01') + and l_shipdate < date_add(date('1994-01-01'), interval 1 year) + | stats sum(l_quantity) as sum_l_quantity + | eval half_sum_l_quantity = 0.5 * sum_l_quantity // Stats and Eval commands can combine when issues/819 resolved + | fields half_sum_l_quantity + ] + | fields ps_suppkey + ] +``` -```bash ignore -curl -H 'Content-Type: application/json' -X POST localhost:9200/_plugins/_ppl -d '{ - "query" : """ - source = [ + +## Example 2: TPC-H q22 + +The following query demonstrates a TPC-H query 22 implementation using `EXISTS` and scalar subqueries: + +```ppl ignore +source = [ + source = customer + | where substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') + and c_acctbal > [ source = customer - | where substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') - and c_acctbal > [ - source = customer - | where c_acctbal > 0.00 - and substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') - | stats avg(c_acctbal) - ] - and not exists [ - source = orders - | where o_custkey = c_custkey - ] - | eval cntrycode = substring(c_phone, 1, 2) - | fields cntrycode, c_acctbal - ] as custsale - | stats count() as numcust, sum(c_acctbal) as totacctbal by cntrycode - | sort cntrycode - """ - }' - ``` \ No newline at end of file + | where c_acctbal > 0.00 + and substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') + | stats avg(c_acctbal) + ] + and not exists [ + source = orders + | where o_custkey = c_custkey + ] + | eval cntrycode = substring(c_phone, 1, 2) + | fields cntrycode, c_acctbal + ] as custsale +| stats count() as numcust, sum(c_acctbal) as totacctbal by cntrycode +| sort cntrycode +``` diff --git a/docs/user/ppl/cmd/syntax.md b/docs/user/ppl/cmd/syntax.md index 32c5ebe89d..cc6fa3c212 100644 --- a/docs/user/ppl/cmd/syntax.md +++ b/docs/user/ppl/cmd/syntax.md @@ -1,18 +1,106 @@ -# Syntax -## Command Order +# PPL syntax -The PPL query starts with either the `search` command to reference a table to search from, or the `describe` command to reference a table to get its metadata. All the following command could be in any order. In the following example, `search` command refer the accounts index as the source, then using fields and where command to do the further processing. +Every PPL query starts with the `search` command. It specifies the index to search and retrieve documents from. + +`PPL` supports exactly one `search` command per PPL query, and it is always the first command. The word `search` can be omitted. + +Subsequent commands can follow in any order. + + +## Syntax + +```sql +search source= [boolean-expression] +source= [boolean-expression] +``` + +## Parameters + +The `search` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | Specifies the index to query. | +| `` | Optional | Specifies an expression that evaluates to a Boolean value. | + + +## Syntax notation conventions + +PPL command syntax uses the following notation conventions. + +### Placeholders + +Placeholders are shown in angle brackets (`< >`). These must be replaced with actual values. + +**Example**: `` means you must specify an actual field name like `age` or `firstname`. + +### Optional elements + +Optional elements are enclosed in square brackets (`[ ]`). These can be omitted from the command. + +**Examples**: +- `[+|-]` means the plus or minus signs are optional. +- `[]` means the alias placeholder is optional. + +### Required choices + +Required choices between alternatives are shown in parentheses and are delimited with pipe separators (`(option1 | option2)`). You must choose exactly one of the specified options. + +**Example**: `(on | where)` means you must use either `on` or `where`, but not both. + +### Optional choices + +Optional choices between alternatives are shown in square brackets with pipe separators (`[option1 | option2]`). You can choose one of the options or omit them entirely. + +**Example**: `[asc | desc]` means you can specify `asc`, `desc`, or neither. + +### Repetition + +An ellipsis (`...`) indicates that the preceding element can be repeated multiple times. + +**Examples**: +- `...` means one or more fields without commas: `field1 field2 field3` +- `, ...` means comma-separated repetition: `field1, field2, field3` -```text + +## Examples + +**Example 1: Search through accounts index** + +In the following query, the `search` command refers to an `accounts` index as the source and uses the `fields` and `where` commands for the conditions: + +```ppl ignore search source=accounts | where age > 18 | fields firstname, lastname ``` - -## Required arguments -Required arguments are shown in angle brackets < >. -## Optional arguments +**Example 2: Get all documents** + +To get all documents from the `accounts` index, specify it as the `source`: + +```ppl ignore +search source=accounts; +``` + + +| account_number | firstname | address | balance | gender | city | employer | state | age | email | lastname | +:--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- +| 1 | Amber | 880 Holmes Lane | 39225 | M | Brogan | Pyrami | IL | 32 | amberduke@pyrami.com | Duke +| 6 | Hattie | 671 Bristol Street | 5686 | M | Dante | Netagy | TN | 36 | hattiebond@netagy.com | Bond +| 13 | Nanette | 789 Madison Street | 32838 | F | Nogal | Quility | VA | 28 | null | Bates +| 18 | Dale | 467 Hutchinson Court | 4180 | M | Orick | null | MD | 33 | daleadams@boink.com | Adams + +**Example 3: Get documents that match a condition** + +To get all documents from the `accounts` index that either have `account_number` equal to 1 or have `gender` as `F`, use the following query: + +```ppl ignore +search source=accounts account_number=1 or gender=\"F\"; +``` -Optional arguments are enclosed in square brackets [ ]. \ No newline at end of file +| account_number | firstname | address | balance | gender | city | employer | state | age | email | lastname | +:--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- +| 1 | Amber | 880 Holmes Lane | 39225 | M | Brogan | Pyrami | IL | 32 | amberduke@pyrami.com | Duke | +| 13 | Nanette | 789 Madison Street | 32838 | F | Nogal | Quility | VA | 28 | null | Bates | diff --git a/docs/user/ppl/cmd/table.md b/docs/user/ppl/cmd/table.md index 176752ebfb..fafdb23168 100644 --- a/docs/user/ppl/cmd/table.md +++ b/docs/user/ppl/cmd/table.md @@ -1,24 +1,35 @@ -# table -## Description +# table -The `table` command is an alias for the [`fields`](fields.md) command and provides the same field selection capabilities. It allows you to keep or remove fields from the search result using enhanced syntax options. -## Syntax +The `table` command is an alias for the [`fields`](fields.md) command and provides the same field selection capabilities. It allows you to keep or remove fields from the search results using enhanced syntax options. -table [+\|-] \ -* [+\|-]: optional. If the plus (+) is used, only the fields specified in the field list will be kept. If the minus (-) is used, all the fields specified in the field list will be removed. **Default:** +. -* field-list: mandatory. Comma-delimited or space-delimited list of fields to keep or remove. Supports wildcard patterns. - -## Example 1: Basic table command usage +## Syntax + +The `table` command has the following syntax: + +```syntax +table [+|-] +``` + +## Parameters -This example shows basic field selection using the table command. +The `table` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | A comma-delimited or space-delimited list of fields to keep or remove. Supports wildcard patterns. | +| `[+|-]` | Optional | Specifies the fields to keep or remove. If the plus sign (`+`) is used, only the fields specified in the field list are kept. If the minus sign (`-`) is used, all the fields specified in the field list are removed. Default is `+`. | + +## Example: Basic table command usage + +The following query shows basic field selection using the `table` command: ```ppl source=accounts | table firstname lastname age ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -32,6 +43,7 @@ fetched rows / total rows = 4/4 +-----------+----------+-----+ ``` -## See Also -- [fields](fields.md) - Alias command with identical functionality \ No newline at end of file +## Related documentation + +- [`fields`](fields.md) -- An alias command with identical functionality \ No newline at end of file diff --git a/docs/user/ppl/cmd/timechart.md b/docs/user/ppl/cmd/timechart.md index da3831c7ae..4e53a26878 100644 --- a/docs/user/ppl/cmd/timechart.md +++ b/docs/user/ppl/cmd/timechart.md @@ -1,83 +1,102 @@ -# timechart - -## Description - -The `timechart` command creates a time-based aggregation of data. It groups data by time intervals and optionally by a field, then applies an aggregation function to each group. The results are returned in an unpivoted format with separate rows for each time-field combination. -## Syntax - -timechart [timefield=\] [span=\] [limit=\] [useother=\] \ [by \] -* timefield: optional. Specifies the timestamp field to use for time interval grouping. **Default**: `@timestamp`. -* span: optional. Specifies the time interval for grouping data. **Default:** 1m (1 minute). - * Available time units: - * millisecond (ms) - * second (s) - * minute (m, case sensitive) - * hour (h) - * day (d) - * week (w) - * month (M, case sensitive) - * quarter (q) - * year (y) -* limit: optional. Specifies the maximum number of distinct values to display when using the "by" clause. **Default:** 10. - * When there are more distinct values than the limit, the additional values are grouped into an "OTHER" category if useother is not set to false. - * The "most distinct" values are determined by calculating the sum of the aggregation values across all time intervals for each distinct field value. The top N values with the highest sums are displayed individually, while the rest are grouped into the "OTHER" category. - * Set to 0 to show all distinct values without any limit (when limit=0, useother is automatically set to false). - * The parameters can be specified in any order before the aggregation function. - * Only applies when using the "by" clause to group results. -* useother: optional. Controls whether to create an "OTHER" category for values beyond the limit. **Default:** true. - * When set to false, only the top N values (based on limit) are shown without an "OTHER" column. - * When set to true, values beyond the limit are grouped into an "OTHER" category. - * Only applies when using the "by" clause and when there are more distinct values than the limit. -* usenull: optional. Controls whether NULL values are placed into a separate category in the chart. **Default:** true. - * When set to true, NULL values are grouped into a separate category with the label specified by nullstr. - * When set to false, NULL values are excluded from the results. -* nullstr: optional. The display label used for NULL values when usenull is true. **Default:** "NULL". - * Specifies the string representation for the NULL category in the chart output. -* aggregation_function: mandatory. The aggregation function to apply to each time bucket. - * Currently, only a single aggregation function is supported. - * Available functions: All aggregation functions supported by the [stats](stats.md) command, as well as the timechart-specific aggregations listed below. -* by: optional. Groups the results by the specified field in addition to time intervals. If not specified, the aggregation is performed across all documents in each time interval. - -## PER_SECOND - -Usage: per_second(field) calculates the per-second rate for a numeric field within each time bucket. -The calculation formula is: `per_second(field) = sum(field) / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. -Return type: DOUBLE -## PER_MINUTE - -Usage: per_minute(field) calculates the per-minute rate for a numeric field within each time bucket. -The calculation formula is: `per_minute(field) = sum(field) * 60 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. -Return type: DOUBLE -## PER_HOUR - -Usage: per_hour(field) calculates the per-hour rate for a numeric field within each time bucket. -The calculation formula is: `per_hour(field) = sum(field) * 3600 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. -Return type: DOUBLE -## PER_DAY - -Usage: per_day(field) calculates the per-day rate for a numeric field within each time bucket. -The calculation formula is: `per_day(field) = sum(field) * 86400 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. -Return type: DOUBLE -## Notes + +# timechart + +The `timechart` command creates a time-based aggregation of data. It groups data by time intervals and, optionally, by a field, and then applies an aggregation function to each group. The results are returned in an unpivoted format, with separate rows for each time-field combination. + +## Syntax + +The `timechart` command has the following syntax: + +```syntax +timechart [timefield=] [span=] [limit=] [useother=] [usenull=] [nullstr=] [by ] +``` + +## Parameters + +The `timechart` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `timefield` | Optional | The field to use for time-based grouping. Must be a timestamp field. Default is `@timestamp`. | +| `span` | Optional | Specifies the time interval for grouping data. Default is `1m` (1 minute). For a complete list of supported time units, see [Time units](#time-units). | +| `limit` | Optional | Specifies the maximum number of distinct values to display when using the `by` clause. Default is `10`. When there are more distinct values than the limit, additional values are grouped into an `OTHER` category if `useother` is not set to `false`. The "most distinct" values are determined by calculating the sum of aggregation values across all time intervals. Set to `0` to show all distinct values without any limit (when `limit=0`, `useother` is automatically set to `false`). Only applies when using the `by` clause. | +| `useother` | Optional | Controls whether to create an `OTHER` category for values beyond the `limit`. When set to `false`, only the top N values (based on `limit`) are shown without an `OTHER` category. When set to `true`, values beyond the `limit` are grouped into an `OTHER` category. This parameter only applies when using the `by` clause and when there are more values than the `limit`. Default is `true`. | +| `usenull` | Optional | Controls whether to group documents that have null values in the `by` field into a separate `NULL` category. When `usenull=false`, documents with null values in the `by` field are excluded from the results. When `usenull=true`, documents with null values in the `by` field are grouped into a separate `NULL` category. Default is `true`. | +| `nullstr` | Optional | Specifies the category name for documents that have null values in the `by` field. This parameter only applies when `usenull` is `true`. Default is `"NULL"`. | +| `` | Required | The aggregation function to apply to each time bucket. Only a single aggregation function is supported. Available functions: All aggregation functions supported by the [stats](stats.md) command as well as the timechart-specific aggregations. | +| `by` | Optional | Groups the results by the specified field in addition to time intervals. If not specified, the aggregation is performed across all documents in each time interval. | + +## Notes + +The following considerations apply when using the `timechart` command: * The `timechart` command requires a timestamp field in the data. By default, it uses the `@timestamp` field, but you can specify a different field using the `timefield` parameter. * Results are returned in an unpivoted format with separate rows for each time-field combination that has data. -* Only combinations with actual data are included in the results - empty combinations are omitted rather than showing null or zero values. -* The "top N" values for the `limit` parameter are selected based on the sum of values across all time intervals for each distinct field value. -* When using the `limit` parameter, values beyond the limit are grouped into an "OTHER" category (unless `useother=false`). -* Examples 6 and 7 use different datasets: Example 6 uses the `events` dataset with fewer hosts for simplicity, while Example 7 uses the `events_many_hosts` dataset with 11 distinct hosts. -* **Null values**: Documents with null values in the "by" field are treated as a separate category and appear as null in the results. +* Only combinations with actual data are included in the results---empty combinations are omitted rather than showing null or zero values. +* The top N values for the `limit` parameter are selected based on the sum of values across all time intervals for each distinct field value. +* When using the `limit` parameter, values beyond the limit are grouped into an `OTHER` category (unless `useother=false`). +* Documents with null values in the `by` field are treated as a separate category and appear as null in the results. + +### Time units + +The following time units are available for the `span` parameter: + +* Milliseconds (`ms`) +* Seconds (`s`) +* Minutes (`m`, case sensitive) +* Hours (`h`) +* Days (`d`) +* Weeks (`w`) +* Months (`M`, case sensitive) +* Quarters (`q`) +* Years (`y`) + +## Timechart-specific aggregation functions + +The `timechart` command provides specialized rate-based aggregation functions that calculate values per unit of time. + +### per_second + +**Usage**: `per_second(field)` calculates the per-second rate for a numeric field within each time bucket. + +**Calculation formula**: `per_second(field) = sum(field) / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. + +**Return type**: DOUBLE + +### per_minute + +**Usage**: `per_minute(field)` calculates the per-minute rate for a numeric field within each time bucket. + +**Calculation formula**: `per_minute(field) = sum(field) * 60 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. + +**Return type**: DOUBLE + +### per_hour + +**Usage**: `per_hour(field)` calculates the per-hour rate for a numeric field within each time bucket. + +**Calculation formula**: `per_hour(field) = sum(field) * 3600 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. + +**Return type**: DOUBLE + +### per_day + +**Usage**: `per_day(field)` calculates the per-day rate for a numeric field within each time bucket. + +**Calculation formula**: `per_day(field) = sum(field) * 86400 / span_in_seconds`, where `span_in_seconds` is the span interval in seconds. + +**Return type**: DOUBLE ## Example 1: Count events by hour -This example counts events for each hour and groups them by host. +The following query counts events in each hourly interval and groups the results by `host`: ```ppl source=events | timechart span=1h count() by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -89,16 +108,17 @@ fetched rows / total rows = 2/2 +---------------------+---------+---------+ ``` + ## Example 2: Count events by minute -This example counts events for each minute and groups them by host. +The following query counts events in each 1-minute interval and groups the results by `host`: ```ppl source=events | timechart span=1m count() by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -116,16 +136,17 @@ fetched rows / total rows = 8/8 +---------------------+---------+---------+ ``` -## Example 3: Calculate average number of packets by minute -This example calculates the average packets for each minute without grouping by any field. +## Example 3: Calculate the average number of packets per minute + +The following query calculates the average number of packets per minute without grouping by any additional field: ```ppl source=events | timechart span=1m avg(packets) ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -143,16 +164,17 @@ fetched rows / total rows = 8/8 +---------------------+--------------+ ``` -## Example 4: Calculate average number of packets by every 20 minutes and status -This example calculates the average number of packets for every 20 minutes and groups them by status. +## Example 4: Calculate the average number of packets per 20 minutes and status + +The following query calculates the average number of packets in each 20-minute interval and groups the results by `status`: ```ppl source=events | timechart span=20m avg(packets) by status ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -170,16 +192,17 @@ fetched rows / total rows = 8/8 +---------------------+------------+--------------+ ``` + ## Example 5: Count events by hour and category -This example counts events for each second and groups them by category +The following query counts events in each 1-second interval and groups the results by `category`: ```ppl source=events | timechart span=1h count() by category ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -191,17 +214,20 @@ fetched rows / total rows = 2/2 +---------------------+----------+---------+ ``` -## Example 6: Using the limit parameter with count() function +## Example 6: Using the limit parameter with count() + +This example uses the `events` dataset with fewer hosts for simplicity. + +When there are many distinct values in the `by` field, the `timechart` command displays only the top values according to the `limit` parameter and groups the remaining values into an `OTHER` category. + +The following query displays the top `2` hosts with the highest event counts and groups all remaining hosts into an `OTHER` category: -When there are many distinct values in the "by" field, the timechart command will display the top values based on the limit parameter and group the rest into an "OTHER" category. -This query will display the top 2 hosts with the highest count values, and group the remaining hosts into an "OTHER" category. - ```ppl source=events | timechart span=1m limit=2 count() by host ``` - -Expected output: + +The query returns the following results: ```text fetched rows / total rows = 8/8 @@ -219,16 +245,19 @@ fetched rows / total rows = 8/8 +---------------------+---------+---------+ ``` -## Example 7: Using limit=0 with count() to show all values -To display all distinct values without any limit, set limit=0: +## Example 7: Use limit=0 with count() to show all values + +This example uses the `events_many_hosts` dataset, which contains 11 distinct hosts. + +To display all distinct values without applying any limit, set `limit=0`: ```ppl source=events_many_hosts | timechart span=1h limit=0 count() by host ``` -Expected output: +All 11 hosts are returned as separate rows without an `OTHER` category: ```text fetched rows / total rows = 11/11 @@ -248,18 +277,17 @@ fetched rows / total rows = 11/11 | 2024-07-01 00:00:00 | web-11 | 1 | +---------------------+--------+---------+ ``` - -This shows all 11 hosts as separate rows without an "OTHER" category. -## Example 8: Using useother=false with count() function -Limit to top 10 hosts without OTHER category (useother=false): +## Example 8: Use useother=false with the count() function + +The following query limits the results to the top 10 hosts without creating an `OTHER` category by setting `useother=false`: ```ppl source=events_many_hosts | timechart span=1h useother=false count() by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 10/10 @@ -279,16 +307,17 @@ fetched rows / total rows = 10/10 +---------------------+--------+---------+ ``` -## Example 9: Using limit with useother parameter and avg() function -Limit to top 3 hosts with OTHER category (default useother=true): +## Example 9: Use the limit parameter with the useother parameter and the avg() function + +The following query displays the top 3 hosts based on average `cpu_usage` per hour. All remaining hosts are grouped into an `OTHER` category (by default, `useother=true`): ```ppl source=events_many_hosts | timechart span=1h limit=3 avg(cpu_usage) by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -302,14 +331,14 @@ fetched rows / total rows = 4/4 +---------------------+--------+----------------+ ``` -Limit to top 3 hosts without OTHER category (useother=false): - +The following query displays the top 3 hosts based on average `cpu_usage` per hour without creating an `OTHER` category by setting `useother=false`: + ```ppl source=events_many_hosts | timechart span=1h limit=3 useother=false avg(cpu_usage) by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -322,17 +351,17 @@ fetched rows / total rows = 3/3 +---------------------+--------+----------------+ ``` -## Example 10: Handling null values in the "by" field -This example shows how null values in the "by" field are treated as a separate category. The dataset events_null has 1 entry that does not have a host field. -It is put into a separate "NULL" category because the defaults for `usenull` and `nullstr` are `true` and `"NULL"` respectively. - +## Example 10: Handling null values in the by field + +The following query demonstrates how null values in the `by` field are treated as a separate category: + ```ppl source=events_null | timechart span=1h count() by host ``` -Expected output: +The `events_null` dataset contains one entry without a `host` value. Because the default settings are `usenull=true` and `nullstr="NULL"`, this entry is grouped into a separate `NULL` category: ```text fetched rows / total rows = 4/4 @@ -346,16 +375,17 @@ fetched rows / total rows = 4/4 +---------------------+--------+---------+ ``` -## Example 11: Calculate packets per second rate -This example calculates the per-second packet rate for network traffic data using the per_second() function. +## Example 11: Calculate the per-second packet rate + +The following query calculates the per-second packet rate for network traffic data using the `per_second()` function: ```ppl source=events | timechart span=30m per_second(packets) by host ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -369,7 +399,10 @@ fetched rows / total rows = 4/4 +---------------------+---------+---------------------+ ``` -## Limitations -* Only a single aggregation function is supported per timechart command. -* The `bins` parameter and other bin options are not supported since the `bin` command is not implemented yet. Use the `span` parameter to control time intervals. \ No newline at end of file +## Limitations + +The `timechart` command has the following limitations: + +* Only a single aggregation function is supported per `timechart` command. +* The `bins` parameter and other `bin` options are not supported. To control the time intervals, use the `span` parameter. \ No newline at end of file diff --git a/docs/user/ppl/cmd/top.md b/docs/user/ppl/cmd/top.md index fa644f2a11..4f93c8fd58 100644 --- a/docs/user/ppl/cmd/top.md +++ b/docs/user/ppl/cmd/top.md @@ -1,133 +1,145 @@ -# top - -## Description - -The `top` command finds the most common tuple of values of all fields in the field list. -## Syntax - -top [N] [top-options] \ [by-clause] -* N: optional. number of results to return. **Default**: 10 -* top-options: optional. options for the top command. Supported syntax is [countfield=\] [showcount=\]. - * showcount=\: optional. whether to create a field in output that represent a count of the tuple of values. **Default:** true. - * countfield=\: optional. the name of the field that contains count. **Default:** 'count'. - * usenull=\: optional (since 3.4.0). whether to output the null value. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. - * When `plugins.ppl.syntax.legacy.preferred=true`, `usenull` defaults to `true` - * When `plugins.ppl.syntax.legacy.preferred=false`, `usenull` defaults to `false` -* field-list: mandatory. comma-delimited list of field names. -* by-clause: optional. one or more fields to group the results by. - -## Example 1: Find the most common values in a field -This example finds the most common gender of all the accounts. - +# top {#top-command} + +The `top` command finds the most common combination of values across all fields specified in the field list. + +> **Note**: The `top` command is not rewritten to [query domain-specific language (DSL)](https://docs.opensearch.org/latest/query-dsl/). It is only executed on the coordinating node. + +## Syntax + +The `top` command has the following syntax: + +```syntax +top [N] [top-options] [by-clause] +``` + +## Parameters + +The `top` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Optional | The number of results to return. Default is `10`. | +| `top-options` | Optional | `showcount`: Whether to create a field in the output that represents a count of the tuple of values. Default is `true`.
`countfield`: The name of the field that contains the count. Default is `count`.
`usenull`: Whether to output `null` values. Default is the value of `plugins.ppl.syntax.legacy.preferred`. | +| `` | Required | A comma-delimited list of field names. | +| `` | Optional | One or more fields to group the results by. | + +## Example 1: Display counts in the default count column + +The following query finds the most common gender values: + ```ppl source=accounts -| top showcount=false gender +| top gender ``` - -Expected output: - + +By default, the `top` command automatically includes a `count` column showing the frequency of each value: + ```text fetched rows / total rows = 2/2 -+--------+ -| gender | -|--------| -| M | -| F | -+--------+ ++--------+-------+ +| gender | count | +|--------+-------| +| M | 3 | +| F | 1 | ++--------+-------+ ``` - -## Example 2: Limit results to top N values -This example finds the most common gender and limits results to 1 value. - + +## Example 2: Find the most common values without the count display + +The following query uses `showcount=false` to hide the `count` column in the results: + ```ppl source=accounts -| top 1 showcount=false gender +| top showcount=false gender ``` - -Expected output: - + +The query returns the following results: + ```text -fetched rows / total rows = 1/1 +fetched rows / total rows = 2/2 +--------+ | gender | |--------| | M | +| F | +--------+ ``` - -## Example 3: Find the most common values grouped by field -This example finds the most common age of all the accounts grouped by gender. +## Example 3: Rename the count column + +The following query uses the `countfield` parameter to specify a custom name (`cnt`) for the count column instead of the default `count`: ```ppl source=accounts -| top 1 showcount=false age by gender +| top countfield='cnt' gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 +--------+-----+ -| gender | age | +| gender | cnt | |--------+-----| -| F | 28 | -| M | 32 | +| M | 3 | +| F | 1 | +--------+-----+ ``` -## Example 4: Top command with count field +## Example 4: Limit the number of returned results + +The following query returns the top 1 most common gender value: -This example finds the most common gender of all the accounts and includes the count. - ```ppl source=accounts -| top gender +| top 1 showcount=false gender ``` - -Expected output: - + +The query returns the following results: + ```text -fetched rows / total rows = 2/2 -+--------+-------+ -| gender | count | -|--------+-------| -| M | 3 | -| F | 1 | -+--------+-------+ +fetched rows / total rows = 1/1 ++--------+ +| gender | +|--------| +| M | ++--------+ ``` - -## Example 5: Specify the count field option -This example specifies a custom name for the count field. - + +## Example 5: Group the results + +The following query uses the `by` clause to find the most common age within each gender group and show it separately for each gender: + ```ppl source=accounts -| top countfield='cnt' gender +| top 1 showcount=false age by gender ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 2/2 +--------+-----+ -| gender | cnt | +| gender | age | |--------+-----| -| M | 3 | -| F | 1 | +| F | 28 | +| M | 32 | +--------+-----+ ``` - -## Example 5: Specify the usenull field option - + +## Example 6: Specify null value handling + +The following query specifies `usenull=false` to exclude null values: + ```ppl source=accounts | top usenull=false email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 3/3 @@ -139,13 +151,15 @@ fetched rows / total rows = 3/3 | hattiebond@netagy.com | 1 | +-----------------------+-------+ ``` - + +The following query specifies `usenull=true` to include null values in the results: + ```ppl source=accounts | top usenull=true email ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -159,6 +173,4 @@ fetched rows / total rows = 4/4 +-----------------------+-------+ ``` -## Limitations -The `top` command is not rewritten to OpenSearch DSL, it is only executed on the coordination node. \ No newline at end of file diff --git a/docs/user/ppl/cmd/trendline.md b/docs/user/ppl/cmd/trendline.md index ff4c2fcef3..7b01a8a9c1 100644 --- a/docs/user/ppl/cmd/trendline.md +++ b/docs/user/ppl/cmd/trendline.md @@ -1,21 +1,32 @@ -# trendline -## Description +# trendline The `trendline` command calculates moving averages of fields. -## Syntax -trendline [sort <[+\|-] sort-field>] \[sma\|wma\](number-of-datapoints, field) [as \] [\[sma\|wma\](number-of-datapoints, field) [as \]]... -* [+\|-]: optional. The plus [+] stands for ascending order and NULL/MISSING first and a minus [-] stands for descending order and NULL/MISSING last. **Default:** ascending order and NULL/MISSING first. -* sort-field: mandatory when sorting is used. The field used to sort. -* sma\|wma: mandatory. Simple Moving Average (sma) applies equal weighting to all values, Weighted Moving Average (wma) applies greater weight to more recent values. -* number-of-datapoints: mandatory. The number of datapoints to calculate the moving average (must be greater than zero). -* field: mandatory. The name of the field the moving average should be calculated for. -* alias: optional. The name of the resulting column containing the moving average. **Default:** field name with "_trendline". - -## Example 1: Calculate the simple moving average on one field. +## Syntax + +The `trendline` command has the following syntax: + +```syntax +trendline [sort [+|-] ] (sma | wma)(, ) [as ] [(sma | wma)(, ) [as ]]... +``` + +## Parameters + +The `trendline` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `[+|-]` | Optional | The sort order for the data. `+` specifies ascending order with `NULL`/`MISSING` first, `-` specifies descending order with `NULL`/`MISSING` last. Default is `+`. | +| `` | Required | The field used to sort the data. | +| `(sma | wma)` | Required | The type of moving average to calculate. `sma` calculates the simple moving average with equal weighting for all values, `wma` calculates the weighted moving average with more weight given to recent values. | +| `number-of-datapoints` | Required | The number of data points used to calculate the moving average. Must be greater than zero. | +| `` | Required | The field for which the moving average is calculated. | +| `` | Optional | The name of the resulting column containing the moving average. Default is the `` name with `_trendline` appended. | -This example shows how to calculate the simple moving average on one field. +## Example 1: Calculate the simple moving average for one field + +The following query calculates the simple moving average for one field: ```ppl source=accounts @@ -23,7 +34,7 @@ source=accounts | fields an ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -37,9 +48,10 @@ fetched rows / total rows = 4/4 +------+ ``` -## Example 2: Calculate the simple moving average on multiple fields. -This example shows how to calculate the simple moving average on multiple fields. +## Example 2: Calculate the simple moving average for multiple fields + +The following query calculates the simple moving average for multiple fields: ```ppl source=accounts @@ -47,7 +59,7 @@ source=accounts | fields an, age_trend ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -61,9 +73,10 @@ fetched rows / total rows = 4/4 +------+-----------+ ``` -## Example 3: Calculate the simple moving average on one field without specifying an alias. -This example shows how to calculate the simple moving average on one field. +## Example 3: Calculate the simple moving average for one field without specifying an alias + +The following query calculates the simple moving average for one field without specifying an alias: ```ppl source=accounts @@ -71,7 +84,7 @@ source=accounts | fields account_number_trendline ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -85,9 +98,10 @@ fetched rows / total rows = 4/4 +--------------------------+ ``` -## Example 4: Calculate the weighted moving average on one field. -This example shows how to calculate the weighted moving average on one field. +## Example 4: Calculate the weighted moving average for one field + +The following query calculates the weighted moving average for one field: ```ppl source=accounts @@ -95,7 +109,7 @@ source=accounts | fields account_number_trendline ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 4/4 @@ -109,6 +123,9 @@ fetched rows / total rows = 4/4 +--------------------------+ ``` -## Limitations -The `trendline` command requires all values in the specified `field` to be non-null. Any rows with null values present in the calculation field will be automatically excluded from the command's output. \ No newline at end of file +## Limitations + +The `trendline` command has the following limitations: + +* The `trendline` command requires all values in the specified `` parameter to be non-null. Any rows with `null` values in this field are automatically excluded from the command's output. \ No newline at end of file diff --git a/docs/user/ppl/cmd/where.md b/docs/user/ppl/cmd/where.md index 6d87ba4946..c34d9567e5 100644 --- a/docs/user/ppl/cmd/where.md +++ b/docs/user/ppl/cmd/where.md @@ -1,47 +1,36 @@ -# where -## Description +# where -The `where` command filters the search result. The `where` command only returns the result when the bool-expression evaluates to true. -## Syntax +The `where` command filters the search results. It only returns results that match the specified conditions. -where \ -* bool-expression: optional. Any expression which could be evaluated to boolean value. - -## Example 1: Filter result set with condition +## Syntax -This example shows fetching all the documents from the accounts index where account_number is 1 or gender is "F". - -```ppl -source=accounts -| where account_number=1 or gender="F" -| fields account_number, gender -``` - -Expected output: - -```text -fetched rows / total rows = 2/2 -+----------------+--------+ -| account_number | gender | -|----------------+--------| -| 1 | M | -| 13 | F | -+----------------+--------+ +The `where` command has the following syntax: + +```syntax +where ``` - -## Example 2: Basic Field Comparison -The example shows how to filter accounts with balance greater than 30000. - +## Parameters + +The `where` command supports the following parameters. + +| Parameter | Required/Optional | Description | +| --- | --- | --- | +| `` | Required | The condition used to filter the results. Only rows in which this condition evaluates to `true` are returned. | + +## Example 1: Filter by numeric values + +The following query returns accounts in which `balance` is greater than `30000`: + ```ppl source=accounts | where balance > 30000 | fields account_number, balance ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 2/2 +----------------+---------+ @@ -51,20 +40,70 @@ fetched rows / total rows = 2/2 | 13 | 32838 | +----------------+---------+ ``` - -## Example 3: Pattern Matching with LIKE -Pattern Matching with Underscore (\_) -The example demonstrates using LIKE with underscore (\_) to match a single character. +## Example 2: Filter using combined criteria + +The following query combines multiple conditions using an `AND` operator: + +```ppl +source=accounts +| where age > 30 AND gender = 'M' +| fields account_number, age, gender +``` + +The query returns the following results: + +```text +fetched rows / total rows = 3/3 ++----------------+-----+--------+ +| account_number | age | gender | +|----------------+-----+--------| +| 1 | 32 | M | +| 6 | 36 | M | +| 18 | 33 | M | ++----------------+-----+--------+ +``` + + +## Example 3: Filter with multiple possible values + +The following query fetches all the documents from the `accounts` index in which `account_number` is `1` or `gender` is `F`: + +```ppl +source=accounts +| where account_number=1 or gender="F" +| fields account_number, gender +``` + +The query returns the following results: + +```text +fetched rows / total rows = 2/2 ++----------------+--------+ +| account_number | gender | +|----------------+--------| +| 1 | M | +| 13 | F | ++----------------+--------+ +``` + +## Example 4: Filter by text patterns + +The `LIKE` operator enables pattern matching on string fields using wildcards. + +### Matching a single character + +The following query uses an underscore (`_`) to match a single character: + ```ppl source=accounts | where LIKE(state, 'M_') | fields account_number, state ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 1/1 +----------------+-------+ @@ -73,18 +112,19 @@ fetched rows / total rows = 1/1 | 18 | MD | +----------------+-------+ ``` - -Pattern Matching with Percent (%) -The example demonstrates using LIKE with percent (%) to match multiple characters. - + +### Matching multiple characters + +The following query uses a percent sign (`%`) to match multiple characters: + ```ppl source=accounts | where LIKE(state, 'V%') | fields account_number, state ``` - -Expected output: - + +The query returns the following results: + ```text fetched rows / total rows = 1/1 +----------------+-------+ @@ -93,33 +133,35 @@ fetched rows / total rows = 1/1 | 13 | VA | +----------------+-------+ ``` - -## Example 4: Multiple Conditions -The example shows how to combine multiple conditions using AND operator. +## Example 5: Filter by excluding specific values + +The following query uses a `NOT` operator to exclude matching records: ```ppl source=accounts -| where age > 30 AND gender = 'M' -| fields account_number, age, gender +| where NOT state = 'CA' +| fields account_number, state ``` -Expected output: +The query returns the following results: ```text -fetched rows / total rows = 3/3 -+----------------+-----+--------+ -| account_number | age | gender | -|----------------+-----+--------| -| 1 | 32 | M | -| 6 | 36 | M | -| 18 | 33 | M | -+----------------+-----+--------+ +fetched rows / total rows = 4/4 ++----------------+-------+ +| account_number | state | +|----------------+-------| +| 1 | IL | +| 6 | TN | +| 13 | VA | +| 18 | MD | ++----------------+-------+ ``` -## Example 5: Using IN Operator -The example demonstrates using IN operator to match multiple values. +## Example 6: Filter using value lists + +The following query uses an `IN` operator to match multiple values: ```ppl source=accounts @@ -127,7 +169,7 @@ source=accounts | fields account_number, state ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 2/2 @@ -139,9 +181,10 @@ fetched rows / total rows = 2/2 +----------------+-------+ ``` -## Example 6: NULL Checks -The example shows how to filter records with NULL values. +## Example 7: Filter records with missing data + +The following query returns records in which the `employer` field is `null`: ```ppl source=accounts @@ -149,7 +192,7 @@ source=accounts | fields account_number, employer ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -160,9 +203,10 @@ fetched rows / total rows = 1/1 +----------------+----------+ ``` -## Example 7: Complex Conditions -The example demonstrates combining multiple conditions with parentheses and logical operators. +## Example 8: Filter using grouped conditions + +The following query combines multiple conditions using parentheses and logical operators: ```ppl source=accounts @@ -170,7 +214,7 @@ source=accounts | fields account_number, balance, age, gender ``` -Expected output: +The query returns the following results: ```text fetched rows / total rows = 1/1 @@ -180,28 +224,4 @@ fetched rows / total rows = 1/1 | 6 | 5686 | 36 | M | +----------------+---------+-----+--------+ ``` - -## Example 8: NOT Conditions - -The example shows how to use NOT operator to exclude matching records. - -```ppl -source=accounts -| where NOT state = 'CA' -| fields account_number, state -``` - -Expected output: - -```text -fetched rows / total rows = 4/4 -+----------------+-------+ -| account_number | state | -|----------------+-------| -| 1 | IL | -| 6 | TN | -| 13 | VA | -| 18 | MD | -+----------------+-------+ -``` - + \ No newline at end of file diff --git a/docs/user/ppl/functions/aggregations.md b/docs/user/ppl/functions/aggregations.md index b2cabef985..c4a84e3114 100644 --- a/docs/user/ppl/functions/aggregations.md +++ b/docs/user/ppl/functions/aggregations.md @@ -24,7 +24,7 @@ The following table shows how NULL/MISSING values are handled by aggregation fun #### Description Usage: Returns a count of the number of expr in the rows retrieved. The `C()` function, `c`, and `count` can be used as abbreviations for `COUNT()`. To perform a filtered counting, wrap the condition to satisfy in an `eval` expression. -Example +### Example ```ppl source=accounts @@ -64,8 +64,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: SUM(expr). Returns the sum of expr. -Example +Usage: `SUM(expr)`. Returns the sum of expr. +### Example ```ppl source=accounts @@ -88,8 +88,8 @@ fetched rows / total rows = 2/2 #### Description -Usage: AVG(expr). Returns the average value of expr. -Example +Usage: `AVG(expr)`. Returns the average value of expr. +### Example ```ppl source=accounts @@ -112,9 +112,9 @@ fetched rows / total rows = 2/2 #### Description -Usage: MAX(expr). Returns the maximum value of expr. +Usage: `MAX(expr)`. Returns the maximum value of expr. For non-numeric fields, values are sorted lexicographically. -Example +### Example ```ppl source=accounts @@ -154,9 +154,9 @@ fetched rows / total rows = 1/1 #### Description -Usage: MIN(expr). Returns the minimum value of expr. +Usage: `MIN(expr)`. Returns the minimum value of expr. For non-numeric fields, values are sorted lexicographically. -Example +### Example ```ppl source=accounts @@ -196,8 +196,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: VAR_SAMP(expr). Returns the sample variance of expr. -Example +Usage: `VAR_SAMP(expr)`. Returns the sample variance of expr. +### Example ```ppl source=accounts @@ -219,8 +219,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: VAR_POP(expr). Returns the population standard variance of expr. -Example +Usage: `VAR_POP(expr)`. Returns the population standard variance of expr. +### Example ```ppl source=accounts @@ -242,8 +242,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: STDDEV_SAMP(expr). Return the sample standard deviation of expr. -Example +Usage: `STDDEV_SAMP(expr)`. Return the sample standard deviation of expr. +### Example ```ppl source=accounts @@ -265,8 +265,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: STDDEV_POP(expr). Return the population standard deviation of expr. -Example +Usage: `STDDEV_POP(expr)`. Return the population standard deviation of expr. +### Example ```ppl source=accounts @@ -288,9 +288,9 @@ fetched rows / total rows = 1/1 #### Description -Usage: DISTINCT_COUNT(expr), DC(expr). Returns the approximate number of distinct values using the HyperLogLog++ algorithm. Both functions are equivalent. +Usage: `DISTINCT_COUNT(expr)`, `DC(expr)`. Returns the approximate number of distinct values using the HyperLogLog++ algorithm. Both functions are equivalent. For details on algorithm accuracy and precision control, see the [OpenSearch Cardinality Aggregation documentation](https://docs.opensearch.org/latest/aggregations/metric/cardinality/#controlling-precision). -Example +### Example ```ppl source=accounts @@ -313,8 +313,8 @@ fetched rows / total rows = 2/2 #### Description -Usage: DISTINCT_COUNT_APPROX(expr). Return the approximate distinct count value of the expr, using the hyperloglog++ algorithm. -Example +Usage: `DISTINCT_COUNT_APPROX(expr)`. Return the approximate distinct count value of the expr, using the hyperloglog++ algorithm. +### Example ```ppl source=accounts @@ -336,11 +336,11 @@ fetched rows / total rows = 1/1 #### Description -Usage: EARLIEST(field [, time_field]). Return the earliest value of a field based on timestamp ordering. -* field: mandatory. The field to return the earliest value for. -* time_field: optional. The field to use for time-based ordering. Defaults to @timestamp if not specified. +Usage: `EARLIEST(field [, time_field])`. Return the earliest value of a field based on timestamp ordering. +* `field`: mandatory. The field to return the earliest value for. +* `time_field`: optional. The field to use for time-based ordering. Defaults to @timestamp if not specified. -Example +### Example ```ppl source=events @@ -384,11 +384,11 @@ fetched rows / total rows = 2/2 #### Description -Usage: LATEST(field [, time_field]). Return the latest value of a field based on timestamp ordering. -* field: mandatory. The field to return the latest value for. -* time_field: optional. The field to use for time-based ordering. Defaults to @timestamp if not specified. +Usage: `LATEST(field [, time_field])`. Return the latest value of a field based on timestamp ordering. +* `field`: mandatory. The field to return the latest value for. +* `time_field`: optional. The field to use for time-based ordering. Defaults to @timestamp if not specified. -Example +### Example ```ppl source=events @@ -432,11 +432,11 @@ fetched rows / total rows = 2/2 #### Description -Usage: TAKE(field [, size]). Return original values of a field. It does not guarantee on the order of values. -* field: mandatory. The field must be a text field. -* size: optional integer. The number of values should be returned. Default is 10. +Usage: `TAKE(field [, size])`. Return original values of a field. It does not guarantee on the order of values. +* `field`: mandatory. The field must be a text field. +* `size`: optional integer. The number of values should be returned. Default is 10. -Example +### Example ```ppl source=accounts @@ -458,11 +458,11 @@ fetched rows / total rows = 1/1 #### Description -Usage: PERCENTILE(expr, percent) or PERCENTILE_APPROX(expr, percent). Return the approximate percentile value of expr at the specified percentage. -* percent: The number must be a constant between 0 and 100. +Usage: `PERCENTILE(expr, percent)` or `PERCENTILE_APPROX(expr, percent)`. Return the approximate percentile value of expr at the specified percentage. +* `percent`: The number must be a constant between 0 and 100. Note: From 3.1.0, the percentile implementation is switched to MergingDigest from AVLTreeDigest. Ref [issue link](https://github.com/opensearch-project/OpenSearch/issues/18122). -Example +### Example ```ppl source=accounts @@ -525,8 +525,8 @@ fetched rows / total rows = 1/1 #### Description -Usage: MEDIAN(expr). Returns the median (50th percentile) value of `expr`. This is equivalent to `PERCENTILE(expr, 50)`. -Example +Usage: `MEDIAN(expr)`. Returns the median (50th percentile) value of `expr`. This is equivalent to `PERCENTILE(expr, 50)`. +### Example ```ppl source=accounts @@ -548,10 +548,10 @@ fetched rows / total rows = 1/1 #### Description -Usage: FIRST(field). Return the first non-null value of a field based on natural document order. Returns NULL if no records exist, or if all records have NULL values for the field. -* field: mandatory. The field to return the first value for. +Usage: `FIRST(field)`. Return the first non-null value of a field based on natural document order. Returns NULL if no records exist, or if all records have NULL values for the field. +* `field`: mandatory. The field to return the first value for. -Example +### Example ```ppl source=accounts @@ -574,10 +574,10 @@ fetched rows / total rows = 2/2 #### Description -Usage: LAST(field). Return the last non-null value of a field based on natural document order. Returns NULL if no records exist, or if all records have NULL values for the field. -* field: mandatory. The field to return the last value for. +Usage: `LAST(field)`. Return the last non-null value of a field based on natural document order. Returns NULL if no records exist, or if all records have NULL values for the field. +* `field`: mandatory. The field to return the last value for. -Example +### Example ```ppl source=accounts @@ -600,9 +600,9 @@ fetched rows / total rows = 2/2 #### Description -Usage: LIST(expr). Collects all values from the specified expression into an array. Values are converted to strings, nulls are filtered, and duplicates are preserved. +Usage: `LIST(expr)`. Collects all values from the specified expression into an array. Values are converted to strings, nulls are filtered, and duplicates are preserved. The function returns up to 100 values with no guaranteed ordering. -* expr: The field expression to collect values from. +* `expr`: The field expression to collect values from. * This aggregation function doesn't support Array, Struct, Object field types. Example with string fields @@ -627,7 +627,7 @@ fetched rows / total rows = 1/1 #### Description -Usage: VALUES(expr). Collects all unique values from the specified expression into a sorted array. Values are converted to strings, nulls are filtered, and duplicates are removed. +Usage: `VALUES(expr)`. Collects all unique values from the specified expression into a sorted array. Values are converted to strings, nulls are filtered, and duplicates are removed. The maximum number of unique values returned is controlled by the `plugins.ppl.values.max.limit` setting: * Default value is 0, which means unlimited values are returned * Can be configured to any positive integer to limit the number of unique values diff --git a/docs/user/ppl/functions/collection.md b/docs/user/ppl/functions/collection.md index c37f8390dd..ca9f7015c1 100644 --- a/docs/user/ppl/functions/collection.md +++ b/docs/user/ppl/functions/collection.md @@ -5,9 +5,9 @@ ### Description Usage: `array(value1, value2, value3...)` create an array with input values. Currently we don't allow mixture types. We will infer a least restricted type, for example `array(1, "demo")` -> ["1", "demo"] -Argument type: value1: ANY, value2: ANY, ... -Return type: ARRAY -Example +**Argument type:** `value1: ANY, value2: ANY, ...` +**Return type:** `ARRAY` +### Example ```ppl source=people @@ -50,9 +50,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `array_length(array)` returns the length of input array. -Argument type: array:ARRAY -Return type: INTEGER -Example +**Argument type:** `array:ARRAY` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -78,9 +78,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `forall(array, function)` check whether all element inside array can meet the lambda function. The function should also return boolean. The lambda function accepts one single input. -Argument type: array:ARRAY, function:LAMBDA -Return type: BOOLEAN -Example +**Argument type:** `array:ARRAY, function:LAMBDA` +**Return type:** `BOOLEAN` +### Example ```ppl source=people @@ -105,9 +105,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `exists(array, function)` check whether existing one of element inside array can meet the lambda function. The function should also return boolean. The lambda function accepts one single input. -Argument type: array:ARRAY, function:LAMBDA -Return type: BOOLEAN -Example +**Argument type:** `array:ARRAY, function:LAMBDA` +**Return type:** `BOOLEAN` +### Example ```ppl source=people @@ -132,9 +132,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `filter(array, function)` filter the element in the array by the lambda function. The function should return boolean. The lambda function accepts one single input. -Argument type: array:ARRAY, function:LAMBDA -Return type: ARRAY -Example +**Argument type:** `array:ARRAY, function:LAMBDA` +**Return type:** `ARRAY` +### Example ```ppl source=people @@ -159,9 +159,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `transform(array, function)` transform the element of array one by one using lambda. The lambda function can accept one single input or two input. If the lambda accepts two argument, the second one is the index of element in array. -Argument type: array:ARRAY, function:LAMBDA -Return type: ARRAY -Example +**Argument type:** `array:ARRAY, function:LAMBDA` +**Return type:** `ARRAY` +### Example ```ppl source=people @@ -204,9 +204,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `reduce(array, acc_base, function, )` use lambda function to go through all element and interact with acc_base. The lambda function accept two argument accumulator and array element. If add one more reduce_function, will apply reduce_function to accumulator finally. The reduce function accept accumulator as the one argument. -Argument type: array:ARRAY, acc_base:ANY, function:LAMBDA, reduce_function:LAMBDA -Return type: ANY -Example +**Argument type:** `array:ARRAY, acc_base:ANY, function:LAMBDA, reduce_function:LAMBDA` +**Return type:** `ANY` +### Example ```ppl source=people @@ -248,10 +248,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: mvjoin(array, delimiter) joins string array elements into a single string, separated by the specified delimiter. NULL elements are excluded from the output. Only string arrays are supported. -Argument type: array: ARRAY of STRING, delimiter: STRING -Return type: STRING -Example +Usage: `mvjoin(array, delimiter)` joins string array elements into a single string, separated by the specified delimiter. NULL elements are excluded from the output. Only string arrays are supported. +**Argument type:** `array: ARRAY of STRING, delimiter: STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -294,10 +294,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: mvappend(value1, value2, value3...) appends all elements from arguments to create an array. Flattens array arguments and collects all individual elements. Always returns an array or null for consistent type behavior. -Argument type: value1: ANY, value2: ANY, ... -Return type: ARRAY -Example +Usage: `mvappend(value1, value2, value3...)` appends all elements from arguments to create an array. Flattens array arguments and collects all individual elements. Always returns an array or null for consistent type behavior. +**Argument type:** `value1: ANY, value2: ANY, ...` +**Return type:** `ARRAY` +### Example ```ppl source=people @@ -465,11 +465,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: split(str, delimiter) splits the string values on the delimiter and returns the string values as a multivalue field (array). Use an empty string ("") to split the original string into one value per character. If the delimiter is not found, returns an array containing the original string. If the input string is empty, returns an empty array. +Usage: `split(str, delimiter)` splits the string values on the delimiter and returns the string values as a multivalue field (array). Use an empty string ("") to split the original string into one value per character. If the delimiter is not found, returns an array containing the original string. If the input string is empty, returns an empty array. -Argument type: str: STRING, delimiter: STRING +**Argument type:** `str: STRING, delimiter: STRING` -Return type: ARRAY of STRING +**Return type:** `ARRAY of STRING` ### Example @@ -567,10 +567,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: mvdedup(array) removes duplicate values from a multivalue array while preserving the order of first occurrence. NULL elements are filtered out. Returns an array with duplicates removed, or null if the input is null. -Argument type: array: ARRAY -Return type: ARRAY -Example +Usage: `mvdedup(array)` removes duplicate values from a multivalue array while preserving the order of first occurrence. NULL elements are filtered out. Returns an array with duplicates removed, or null if the input is null. +**Argument type:** `array: ARRAY` +**Return type:** `ARRAY` +### Example ```ppl source=people @@ -711,10 +711,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: mvindex(array, start, [end]) returns a subset of the multivalue array using the start and optional end index values. Indexes are 0-based (first element is at index 0). Supports negative indexing where -1 refers to the last element. When only start is provided, returns a single element. When both start and end are provided, returns an array of elements from start to end (inclusive). -Argument type: array: ARRAY, start: INTEGER, end: INTEGER (optional) -Return type: ANY (single element) or ARRAY (range) -Example +Usage: `mvindex(array, start, [end])` returns a subset of the multivalue array using the start and optional end index values. Indexes are 0-based (first element is at index 0). Supports negative indexing where -1 refers to the last element. When only start is provided, returns a single element. When both start and end are provided, returns an array of elements from start to end (inclusive). +**Argument type:** `array: ARRAY, start: INTEGER, end: INTEGER (optional)` +**Return type:** `ANY (single element) or ARRAY (range)` +### Example ```ppl source=people @@ -878,7 +878,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: mvzip(mv_left, mv_right, [delim]) combines the values in two multivalue arrays by pairing corresponding elements and joining them into strings. The delimiter is used to specify a delimiting character to join the two values. This is similar to the Python zip command. +Usage: `mvzip(mv_left, mv_right, [delim])` combines the values in two multivalue arrays by pairing corresponding elements and joining them into strings. The delimiter is used to specify a delimiting character to join the two values. This is similar to the Python zip command. The values are stitched together combining the first value of mv_left with the first value of mv_right, then the second with the second, and so on. Each pair is concatenated into a string using the delimiter. The function stops at the length of the shorter array. @@ -886,9 +886,9 @@ The delimiter is optional. When specified, it must be enclosed in quotation mark Returns null if either input is null. Returns an empty array if either input array is empty. -Argument type: mv_left: ARRAY, mv_right: ARRAY, delim: STRING (optional) -Return type: ARRAY of STRING -Example +**Argument type:** `mv_left: ARRAY, mv_right: ARRAY, delim: STRING (optional)` +**Return type:** `ARRAY of STRING` +### Example ```ppl source=people diff --git a/docs/user/ppl/functions/condition.md b/docs/user/ppl/functions/condition.md index 8d65680fcd..512b5edbbe 100644 --- a/docs/user/ppl/functions/condition.md +++ b/docs/user/ppl/functions/condition.md @@ -1,18 +1,22 @@ # Condition Functions +PPL functions use the search capabilities of the OpenSearch engine. However, these functions don't execute directly within the OpenSearch plugin's memory. Instead, they facilitate the global filtering of query results based on specific conditions, such as a `WHERE` or `HAVING` clause. +The following sections describe the condition PPL functions. ## ISNULL ### Description -Usage: isnull(field) returns TRUE if field is NULL, FALSE otherwise. +Usage: `isnull(field)` returns TRUE if field is NULL, FALSE otherwise. + The `isnull()` function is commonly used: - In `eval` expressions to create conditional fields - With the `if()` function to provide default values - In `where` clauses to filter null records -Argument type: all the supported data types. -Return type: BOOLEAN -Example +**Argument type:** All supported data types. +**Return type:** `BOOLEAN` + +### Example ```ppl source=accounts @@ -79,17 +83,19 @@ fetched rows / total rows = 1/1 ### Description -Usage: isnotnull(field) returns TRUE if field is NOT NULL, FALSE otherwise. +Usage: `isnotnull(field)` returns TRUE if field is NOT NULL, FALSE otherwise. The `isnotnull(field)` function is the opposite of `isnull(field)`. Instead of checking for null values, it checks a specific field and returns `true` if the field contains data, that is, it is not null. + The `isnotnull()` function is commonly used: - In `eval` expressions to create boolean flags - In `where` clauses to filter out null values - With the `if()` function for conditional logic - To validate data presence -Argument type: all the supported data types. -Return type: BOOLEAN -Synonyms: [ISPRESENT](#ispresent) -Example +**Argument type:** All supported data types. +**Return type:** `BOOLEAN` +**Synonyms:** [ISPRESENT](#ispresent) + +### Example ```ppl source=accounts @@ -178,10 +184,12 @@ fetched rows / total rows = 1/1 ### Description -Usage: ifnull(field1, field2) returns field2 if field1 is null. -Argument type: all the supported data types (NOTE : if two parameters have different types, you will fail semantic check). -Return type: any -Example +Usage: `ifnull(field1, field2)` returns field2 if field1 is null. + +**Argument type:** All supported data types (NOTE: if two parameters have different types, you will fail semantic check). +**Return type:** `any` + +### Example ```ppl source=accounts @@ -206,8 +214,8 @@ fetched rows / total rows = 4/4 ### Nested IFNULL Pattern For OpenSearch versions prior to 3.1, COALESCE-like functionality can be achieved using nested IFNULL statements. This pattern is particularly useful in observability use cases where field names may vary across different data sources. -Usage: ifnull(field1, ifnull(field2, ifnull(field3, default_value))) -Example +Usage: `ifnull(field1, ifnull(field2, ifnull(field3, default_value)))` +### Example ```ppl source=accounts @@ -233,10 +241,12 @@ fetched rows / total rows = 4/4 ### Description -Usage: nullif(field1, field2) returns null if two parameters are same, otherwise returns field1. -Argument type: all the supported data types (NOTE : if two parameters have different types, you will fail semantic check). -Return type: any -Example +Usage: `nullif(field1, field2)` returns null if two parameters are same, otherwise returns field1. + +**Argument type:** All supported data types (NOTE: if two parameters have different types, you will fail semantic check). +**Return type:** `any` + +### Example ```ppl source=accounts @@ -262,10 +272,12 @@ fetched rows / total rows = 4/4 ### Description -Usage: if(condition, expr1, expr2) returns expr1 if condition is true, otherwise returns expr2. -Argument type: all the supported data types (NOTE : if expr1 and expr2 are different types, you will fail semantic check). -Return type: any -Example +Usage: `if(condition, expr1, expr2)` returns expr1 if condition is true, otherwise returns expr2. + +**Argument type:** All supported data types (NOTE: if expr1 and expr2 are different types, you will fail semantic check). +**Return type:** `any` + +### Example ```ppl source=accounts @@ -331,16 +343,18 @@ fetched rows / total rows = 4/4 ### Description -Usage: case(condition1, expr1, condition2, expr2, ... conditionN, exprN else default) returns expr1 if condition1 is true, or returns expr2 if condition2 is true, ... if no condition is true, then returns the value of ELSE clause. If the ELSE clause is not defined, returns NULL. -Argument type: all the supported data types (NOTE : there is no comma before "else"). -Return type: any +Usage: `case(condition1, expr1, condition2, expr2, ... conditionN, exprN else default)` returns expr1 if condition1 is true, or returns expr2 if condition2 is true, ... if no condition is true, then returns the value of ELSE clause. If the ELSE clause is not defined, returns NULL. + +**Argument type:** All supported data types (NOTE: there is no comma before "else"). +**Return type:** `any` + ### Limitations When each condition is a field comparison with a numeric literal and each result expression is a string literal, the query will be optimized as [range aggregations](https://docs.opensearch.org/latest/aggregations/bucket/range) if pushdown optimization is enabled. However, this optimization has the following limitations: - Null values will not be grouped into any bucket of a range aggregation and will be ignored - The default ELSE clause will use the string literal `"null"` instead of actual NULL values -Example +### Example ```ppl source=accounts @@ -404,9 +418,10 @@ fetched rows / total rows = 2/2 ### Description -Usage: coalesce(field1, field2, ...) returns the first non-null, non-missing value in the argument list. -Argument type: all the supported data types. Supports mixed data types with automatic type coercion. -Return type: determined by the least restrictive common type among all arguments, with fallback to string if no common type can be determined +Usage: `coalesce(field1, field2, ...)` returns the first non-null, non-missing value in the argument list. + +**Argument type:** All supported data types. Supports mixed data types with automatic type coercion. +**Return type:** Determined by the least restrictive common type among all arguments, with fallback to string if no common type can be determined. Behavior: - Returns the first value that is not null and not missing (missing includes non-existent fields) - Empty strings ("") and whitespace strings (" ") are considered valid values @@ -424,7 +439,7 @@ Limitations: - Type coercion may result in unexpected string conversions for incompatible types - Performance may degrade with very large numbers of arguments -Example +### Example ```ppl source=accounts @@ -537,11 +552,13 @@ fetched rows / total rows = 4/4 ### Description -Usage: ispresent(field) returns true if the field exists. -Argument type: all the supported data types. -Return type: BOOLEAN -Synonyms: [ISNOTNULL](#isnotnull) -Example +Usage: `ispresent(field)` returns true if the field exists. + +**Argument type:** All supported data types. +**Return type:** `BOOLEAN` +**Synonyms:** [ISNOTNULL](#isnotnull) + +### Example ```ppl source=accounts @@ -566,10 +583,12 @@ fetched rows / total rows = 3/3 ### Description -Usage: isblank(field) returns true if the field is null, an empty string, or contains only white space. -Argument type: all the supported data types. -Return type: BOOLEAN -Example +Usage: `isblank(field)` returns true if the field is null, an empty string, or contains only white space. + +**Argument type:** All supported data types. +**Return type:** `BOOLEAN` + +### Example ```ppl source=accounts @@ -596,10 +615,12 @@ fetched rows / total rows = 4/4 ### Description -Usage: isempty(field) returns true if the field is null or is an empty string. -Argument type: all the supported data types. -Return type: BOOLEAN -Example +Usage: `isempty(field)` returns true if the field is null or is an empty string. + +**Argument type:** All supported data types. +**Return type:** `BOOLEAN` + +### Example ```ppl source=accounts @@ -626,7 +647,7 @@ fetched rows / total rows = 4/4 ### Description -Usage: earliest(relative_string, field) returns true if the value of field is after the timestamp derived from relative_string relative to the current time. Otherwise, returns false. +Usage: `earliest(relative_string, field)` returns true if the value of field is after the timestamp derived from relative_string relative to the current time. Otherwise, returns false. relative_string: The relative string can be one of the following formats: 1. `"now"` or `"now()"`: @@ -648,9 +669,11 @@ The relative string can be one of the following formats: - `-3M+1y@M` → `2026-02-01 00:00:00` Read more details [here](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/functions/ppl-datetime.md#relative_timestamp) -Argument type: relative_string:STRING, field: TIMESTAMP -Return type: BOOLEAN -Example + +**Argument type:** `relative_string`: `STRING`, `field`: `TIMESTAMP` +**Return type:** `BOOLEAN` + +### Example ```ppl source=accounts @@ -692,10 +715,12 @@ fetched rows / total rows = 1/1 ### Description -Usage: latest(relative_string, field) returns true if the value of field is before the timestamp derived from relative_string relative to the current time. Otherwise, returns false. -Argument type: relative_string:STRING, field: TIMESTAMP -Return type: BOOLEAN -Example +Usage: `latest(relative_string, field)` returns true if the value of field is before the timestamp derived from relative_string relative to the current time. Otherwise, returns false. + +**Argument type:** `relative_string`: `STRING`, `field`: `TIMESTAMP` +**Return type:** `BOOLEAN` + +### Example ```ppl source=accounts @@ -737,11 +762,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: regexp_match(string, pattern) returns true if the regular expression pattern finds a match against any substring of the string value, otherwise returns false. +Usage: `regexp_match(string, pattern)` returns true if the regular expression pattern finds a match against any substring of the string value, otherwise returns false. The function uses Java regular expression syntax for the pattern. -Argument type: STRING, STRING -Return type: BOOLEAN -Example + +**Argument type:** `STRING`, `STRING` +**Return type:** `BOOLEAN` + +### Example ``` ppl ignore source=logs | where regexp_match(message, 'ERROR|WARN|FATAL') | fields timestamp, message diff --git a/docs/user/ppl/functions/conversion.md b/docs/user/ppl/functions/conversion.md index 9e3b1d1ed7..99efe16103 100644 --- a/docs/user/ppl/functions/conversion.md +++ b/docs/user/ppl/functions/conversion.md @@ -4,7 +4,7 @@ ### Description -Usage: cast(expr as dateType) cast the expr to dataType. return the value of dataType. The following conversion rules are used: +Usage: `cast(expr as dateType)` cast the expr to dataType. return the value of dataType. The following conversion rules are used: | Src/Target | STRING | NUMBER | BOOLEAN | TIMESTAMP | DATE | TIME | IP | | --- | --- | --- | --- | --- | --- | --- | --- | @@ -19,7 +19,8 @@ Usage: cast(expr as dateType) cast the expr to dataType. return the value of dat Note1: the conversion follow the JDK specification. Note2: IP will be converted to its canonical representation. Canonical representation for IPv6 is described in [RFC 5952](https://datatracker.ietf.org/doc/html/rfc5952). -Cast to string example + +### Example: Cast to string ```ppl source=people @@ -38,7 +39,7 @@ fetched rows / total rows = 1/1 +-------+------+------------+ ``` -Cast to number example +### Example: Cast to number ```ppl source=people @@ -57,7 +58,7 @@ fetched rows / total rows = 1/1 +-------+---------+ ``` -Cast to date example +### Example: Cast to date ```ppl source=people @@ -76,7 +77,7 @@ fetched rows / total rows = 1/1 +------------+----------+---------------------+ ``` -Cast function can be chained +### Example: Cast function can be chained ```ppl source=people @@ -101,14 +102,14 @@ Implicit conversion is automatic casting. When a function does not have an exact input types, the engine looks for another signature that can safely work with the values. It picks the option that requires the least stretching of the original types, so you can mix literals and fields without adding `CAST` everywhere. + ### String to numeric When a string stands in for a number we simply parse the text: - The value must be something like `"3.14"` or `"42"`. Anything else causes the query to fail. -- If a string appears next to numeric arguments, it is treated as a `DOUBLE` so the numeric - - overload of the function can run. -Use string in arithmetic operator example +- If a string appears next to numeric arguments, it is treated as a `DOUBLE` so the numeric overload of the function can run. + +### Example: Use string in arithmetic operator ```ppl source=people @@ -127,7 +128,7 @@ fetched rows / total rows = 1/1 +--------+----------+------+-------+--------+ ``` -Use string in comparison operator example +### Example: Use string in comparison operator ```ppl source=people @@ -151,11 +152,17 @@ fetched rows / total rows = 1/1 ### Description The following usage options are available, depending on the parameter types and the number of parameters. -Usage with format type: tostring(ANY, [format]): Converts the value in first argument to provided format type string in second argument. If second argument is not provided, then it converts to default string representation. -Return type: string -Usage for boolean parameter without format type tostring(boolean): Converts the string to 'TRUE' or 'FALSE'. -Return type: string -You can use this function with the eval commands and as part of eval expressions. If first argument can be any valid type , second argument is optional and if provided , it needs to be format name to convert to where first argument contains only numbers. If first argument is boolean, then second argument is not used even if its provided. + +Usage with format type: `tostring(ANY, [format])`: Converts the value in first argument to provided format type string in second argument. If second argument is not provided, then it converts to default string representation. + +**Return type:** `STRING` + +Usage for boolean parameter without format type `tostring(boolean)`: Converts the string to 'TRUE' or 'FALSE'. + +**Return type:** `STRING` + +You can use this function with the eval commands and as part of eval expressions. If first argument can be any valid type, second argument is optional and if provided, it needs to be format name to convert to where first argument contains only numbers. If first argument is boolean, then second argument is not used even if its provided. + Format types: 1. "binary" Converts a number to a binary value. 2. "hex" Converts the number to a hexadecimal value. @@ -164,9 +171,10 @@ Format types: 5. "duration_millis" Converts the value in milliseconds to the readable time format HH:MM:SS. The format argument is optional and is only used when the value argument is a number. The tostring function supports the following formats. -Basic examples: + +### Example: Convert number to binary string + You can use this function to convert a number to a string of its binary representation. -Example ```ppl source=accounts @@ -186,8 +194,9 @@ fetched rows / total rows = 1/1 +-----------+------------------+---------+ ``` +### Example: Convert number to hex string + You can use this function to convert a number to a string of its hex representation. -Example ```ppl source=accounts @@ -207,8 +216,9 @@ fetched rows / total rows = 1/1 +-----------+-------------+---------+ ``` -The following example formats the column totalSales to display values with commas. -Example +### Example: Format number with commas + +The following example formats the column totalSales to display values with commas. ```ppl source=accounts @@ -228,8 +238,9 @@ fetched rows / total rows = 1/1 +-----------+----------------+---------+ ``` +### Example: Convert seconds to duration format + The following example converts number of seconds to HH:MM:SS format representing hours, minutes and seconds. -Example ```ppl source=accounts @@ -249,8 +260,9 @@ fetched rows / total rows = 1/1 +-----------+----------+ ``` -The following example for converts boolean parameter to string. -Example +### Example: Convert boolean to string + +The following example converts boolean parameter to string. ```ppl source=accounts @@ -274,66 +286,78 @@ fetched rows / total rows = 1/1 ### Description -The following usage options are available, depending on the parameter -types and the number of parameters. - -Usage: tonumber(string, \[base\]) converts the value in first argument. -The second argument describe the base of first argument. If second -argument is not provided, then it converts to base 10 number -representation. - -Return type: Number - -You can use this function with the eval commands and as part of eval -expressions. Base values can be between 2 and 36. The maximum value -supported for base 10 is +(2-2\^-52)·2\^1023 and minimum is --(2-2\^-52)·2\^1023. The maximum for other supported bases is 2\^63-1 -(or 7FFFFFFFFFFFFFFF) and minimum is -2\^63 (or -7FFFFFFFFFFFFFFF). If -the tonumber function cannot parse a field value to a number, the -function returns NULL. You can use this function to convert a string -representation of a binary number to return the corresponding number in -base 10. - -Following example converts a string in binary to the number -representation: - - os> source=people | eval int_value = tonumber('010101',2) | fields int_value | head 1 - fetched rows / total rows = 1/1 - +-----------+ - | int_value | - |-----------| - | 21.0 | - +-----------+ - -Following example converts a string in hex to the number representation: - - os> source=people | eval int_value = tonumber('FA34',16) | fields int_value | head 1 - fetched rows / total rows = 1/1 - +-----------+ - | int_value | - |-----------| - | 64052.0 | - +-----------+ - -Following example converts a string in decimal to the number -representation: - - os> source=people | eval int_value = tonumber('4598') | fields int_value | head 1 - fetched rows / total rows = 1/1 - +-----------+ - | int_value | - |-----------| - | 4598.0 | - +-----------+ - -Following example converts a string in decimal with fraction to the -number representation: - - os> source=people | eval double_value = tonumber('4598.678') | fields double_value | head 1 - fetched rows / total rows = 1/1 - +--------------+ - | double_value | - |--------------| - | 4598.678 | - +--------------+ +Usage: `tonumber(string, [base])` converts the value in first argument. +The second argument describes the base of first argument. If second argument is not provided, then it converts to base 10 number representation. + +**Return type:** `NUMBER` + +You can use this function with the eval commands and as part of eval expressions. Base values can be between 2 and 36. The maximum value supported for base 10 is +(2-2^-52)·2^1023 and minimum is -(2-2^-52)·2^1023. The maximum for other supported bases is 2^63-1 (or 7FFFFFFFFFFFFFFF) and minimum is -2^63 (or -7FFFFFFFFFFFFFFF). If the tonumber function cannot parse a field value to a number, the function returns NULL. You can use this function to convert a string representation of a binary number to return the corresponding number in base 10. + +### Example: Convert binary string to number + +```ppl +source=people | eval int_value = tonumber('010101',2) | fields int_value | head 1 +``` + +Expected output: + +```text +fetched rows / total rows = 1/1 ++-----------+ +| int_value | +|-----------| +| 21.0 | ++-----------+ +``` + +### Example: Convert hex string to number + +```ppl +source=people | eval int_value = tonumber('FA34',16) | fields int_value | head 1 +``` + +Expected output: + +```text +fetched rows / total rows = 1/1 ++-----------+ +| int_value | +|-----------| +| 64052.0 | ++-----------+ +``` + +### Example: Convert decimal string to number + +```ppl +source=people | eval int_value = tonumber('4598') | fields int_value | head 1 +``` + +Expected output: + +```text +fetched rows / total rows = 1/1 ++-----------+ +| int_value | +|-----------| +| 4598.0 | ++-----------+ +``` + +### Example: Convert decimal string with fraction to number + +```ppl +source=people | eval double_value = tonumber('4598.678') | fields double_value | head 1 +``` + +Expected output: + +```text +fetched rows / total rows = 1/1 ++--------------+ +| double_value | +|--------------| +| 4598.678 | ++--------------+ +``` diff --git a/docs/user/ppl/functions/cryptographic.md b/docs/user/ppl/functions/cryptographic.md index 33853cfd64..1ea1ca50f5 100644 --- a/docs/user/ppl/functions/cryptographic.md +++ b/docs/user/ppl/functions/cryptographic.md @@ -6,9 +6,11 @@ Version: 3.1.0 Usage: `md5(str)` calculates the MD5 digest and returns the value as a 32-character hex string. -Argument type: STRING -Return type: STRING -Example + +**Argument type:** `STRING` +**Return type:** `STRING` + +### Example ```ppl source=people @@ -33,9 +35,11 @@ fetched rows / total rows = 1/1 Version: 3.1.0 Usage: `sha1(str)` returns the hex string result of SHA-1. -Argument type: STRING -Return type: STRING -Example + +**Argument type:** `STRING` +**Return type:** `STRING` + +### Example ```ppl source=people @@ -61,9 +65,11 @@ fetched rows / total rows = 1/1 Version: 3.1.0 Usage: `sha2(str, numBits)` returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, or 512. -Argument type: STRING, INTEGER -Return type: STRING -Example + +**Argument type:** `STRING`, `INTEGER` +**Return type:** `STRING` + +### Example ```ppl source=people @@ -98,4 +104,4 @@ fetched rows / total rows = 1/1 | 9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043 | +----------------------------------------------------------------------------------------------------------------------------------+ ``` - \ No newline at end of file + diff --git a/docs/user/ppl/functions/datetime.md b/docs/user/ppl/functions/datetime.md index 0cd474b546..9ed105ea91 100644 --- a/docs/user/ppl/functions/datetime.md +++ b/docs/user/ppl/functions/datetime.md @@ -8,16 +8,16 @@ ### Description -Usage: adddate(date, INTERVAL expr unit) / adddate(date, days) adds the interval of second argument to date; adddate(date, days) adds the second argument as integer number of days to date. +Usage: `adddate(date, INTERVAL expr unit)` / adddate(date, days) adds the interval of second argument to date; adddate(date, days) adds the second argument as integer number of days to date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, INTERVAL/LONG +**Argument type:** `DATE/TIMESTAMP/TIME, INTERVAL/LONG` Return type map: (DATE/TIMESTAMP/TIME, INTERVAL) -> TIMESTAMP (DATE, LONG) -> DATE (TIMESTAMP/TIME, LONG) -> TIMESTAMP Synonyms: [DATE_ADD](#date_add) when invoked with the INTERVAL form of the second argument. Antonyms: [SUBDATE](#subdate) -Example +### Example ```ppl source=people @@ -40,13 +40,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: addtime(expr1, expr2) adds expr2 to expr1 and returns the result. If argument is TIME, today's date is used; if argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME +Usage: `addtime(expr1, expr2)` adds expr2 to expr1 and returns the result. If argument is TIME, today's date is used; if argument is DATE, time at midnight is used. +**Argument type:** `DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME` Return type map: (DATE/TIMESTAMP, DATE/TIMESTAMP/TIME) -> TIMESTAMP (TIME, DATE/TIMESTAMP/TIME) -> TIME Antonyms: [SUBTIME](#subtime) -Example +### Example ```ppl source=people @@ -137,11 +137,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: convert_tz(timestamp, from_timezone, to_timezone) constructs a local timestamp converted from the from_timezone to the to_timezone. CONVERT_TZ returns null when any of the three function arguments are invalid, i.e. timestamp is not in the format yyyy-MM-dd HH:mm:ss or the timeszone is not in (+/-)HH:mm. It also is invalid for invalid dates, such as February 30th and invalid timezones, which are ones outside of -13:59 and +14:00. -Argument type: TIMESTAMP/STRING, STRING, STRING -Return type: TIMESTAMP +Usage: `convert_tz(timestamp, from_timezone, to_timezone)` constructs a local timestamp converted from the from_timezone to the to_timezone. CONVERT_TZ returns null when any of the three function arguments are invalid, i.e. timestamp is not in the format yyyy-MM-dd HH:mm:ss or the timezone is not in (+/-)HH:mm. It also is invalid for invalid dates, such as February 30th and invalid timezones, which are ones outside of -13:59 and +14:00. +**Argument type:** `TIMESTAMP/STRING, STRING, STRING` +**Return type:** `TIMESTAMP` Conversion from +00:00 timezone to +10:00 timezone. Returns the timestamp argument converted from +00:00 to +10:00 -Example +### Example ```ppl source=people @@ -161,7 +161,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. Timezones outside of the range, such as +15:00 in this example will return null. -Example +### Example ```ppl source=people @@ -181,7 +181,7 @@ fetched rows / total rows = 1/1 ``` Conversion from a positive timezone to a negative timezone that goes over date line. -Example +### Example ```ppl source=people @@ -201,7 +201,7 @@ fetched rows / total rows = 1/1 ``` Valid dates are required in convert_tz, invalid dates such as April 31st (not a date in the Gregorian calendar) will result in null. -Example +### Example ```ppl source=people @@ -221,7 +221,7 @@ fetched rows / total rows = 1/1 ``` Valid dates are required in convert_tz, invalid dates such as February 30th (not a date in the Gregorian calendar) will result in null. -Example +### Example ```ppl source=people @@ -241,7 +241,7 @@ fetched rows / total rows = 1/1 ``` February 29th 2008 is a valid date because it is a leap year. -Example +### Example ```ppl source=people @@ -261,7 +261,7 @@ fetched rows / total rows = 1/1 ``` Valid dates are required in convert_tz, invalid dates such as February 29th 2007 (2007 is not a leap year) will result in null. -Example +### Example ```ppl source=people @@ -281,7 +281,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. Timezones outside of the range, such as +14:01 in this example will return null. -Example +### Example ```ppl source=people @@ -301,7 +301,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. Timezones outside of the range, such as +14:00 in this example will return a correctly converted date time object. -Example +### Example ```ppl source=people @@ -321,7 +321,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. Timezones outside of the range, such as -14:00 will result in null -Example +### Example ```ppl source=people @@ -341,7 +341,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. This timezone is within range so it is valid and will convert the time. -Example +### Example ```ppl source=people @@ -366,9 +366,9 @@ fetched rows / total rows = 1/1 Returns the current date as a value in 'YYYY-MM-DD' format. CURDATE() returns the current date in UTC at the time the statement is executed. -Return type: DATE +**Return type:** `DATE` Specification: CURDATE() -> DATE -Example +### Example ```ppl ignore source=people @@ -392,7 +392,7 @@ fetched rows / total rows = 1/1 ### Description `CURRENT_DATE()` is a synonym for [CURDATE()](#curdate). -Example +### Example ```ppl ignore source=people @@ -416,7 +416,7 @@ fetched rows / total rows = 1/1 ### Description `CURRENT_TIME()` is a synonym for [CURTIME()](#curtime). -Example +### Example ```ppl ignore source=people @@ -440,7 +440,7 @@ fetched rows / total rows = 1/1 ### Description `CURRENT_TIMESTAMP()` is a synonym for [NOW()](#now). -Example +### Example ```ppl ignore source=people @@ -465,9 +465,9 @@ fetched rows / total rows = 1/1 Returns the current time as a value in 'hh:mm:ss' format in the UTC time zone. CURTIME() returns the time at which the statement began to execute as [NOW()](#now) does. -Return type: TIME +**Return type:** `TIME` Specification: CURTIME() -> TIME -Example +### Example ```ppl ignore source=people @@ -490,10 +490,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: date(expr) constructs a date type with the input string expr as a date. If the argument is of date/timestamp, it extracts the date value part from the expression. -Argument type: STRING/DATE/TIMESTAMP -Return type: DATE -Example +Usage: `date(expr)` constructs a date type with the input string expr as a date. If the argument is of date/timestamp, it extracts the date value part from the expression. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `DATE` +### Example ```ppl source=people @@ -567,12 +567,12 @@ fetched rows / total rows = 1/1 ### Description -Usage: date_add(date, INTERVAL expr unit) adds the interval expr to date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, INTERVAL -Return type: TIMESTAMP +Usage: `date_add(date, INTERVAL expr unit)` adds the interval expr to date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. +**Argument type:** `DATE/TIMESTAMP/TIME, INTERVAL` +**Return type:** `TIMESTAMP` Synonyms: [ADDDATE](#adddate) Antonyms: [DATE_SUB](#date_sub) -Example +### Example ```ppl source=people @@ -595,7 +595,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: date_format(date, format) formats the date argument using the specifiers in the format argument. +Usage: `date_format(date, format)` formats the date argument using the specifiers in the format argument. If an argument of type TIME is provided, the local date is used. The following table describes the available specifier arguments. @@ -638,9 +638,9 @@ The following table describes the available specifier arguments. | x | x, for any smallcase/uppercase alphabet except [aydmshiHIMYDSEL] | -Argument type: STRING/DATE/TIME/TIMESTAMP, STRING -Return type: STRING -Example +**Argument type:** `STRING/DATE/TIME/TIMESTAMP, STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -663,13 +663,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: DATETIME(timestamp)/ DATETIME(date, to_timezone) Converts the datetime to a new timezone -Argument type: timestamp/STRING +Usage: `DATETIME(timestamp)`/ DATETIME(date, to_timezone) Converts the datetime to a new timezone +**Argument type:** `timestamp/STRING` Return type map: (TIMESTAMP, STRING) -> TIMESTAMP (TIMESTAMP) -> TIMESTAMP Converting timestamp with timezone to the second argument timezone. -Example +### Example ```ppl source=people @@ -689,7 +689,7 @@ fetched rows / total rows = 1/1 ``` The valid timezone range for convert_tz is (-13:59, +14:00) inclusive. Timezones outside of the range will result in null. -Example +### Example ```ppl source=people @@ -712,12 +712,12 @@ fetched rows / total rows = 1/1 ### Description -Usage: date_sub(date, INTERVAL expr unit) subtracts the interval expr from date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, INTERVAL -Return type: TIMESTAMP +Usage: `date_sub(date, INTERVAL expr unit)` subtracts the interval expr from date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. +**Argument type:** `DATE/TIMESTAMP/TIME, INTERVAL` +**Return type:** `TIMESTAMP` Synonyms: [SUBDATE](#subdate) Antonyms: [DATE_ADD](#date_add) -Example +### Example ```ppl source=people @@ -739,9 +739,9 @@ fetched rows / total rows = 1/1 ## DATEDIFF Usage: Calculates the difference of date parts of given values. If the first argument is time, today's date is used. -Argument type: DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME -Return type: LONG -Example +**Argument type:** `DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME` +**Return type:** `LONG` +### Example ```ppl source=people @@ -764,11 +764,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: day(date) extracts the day of the month for date, in the range 1 to 31. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `day(date)` extracts the day of the month for date, in the range 1 to 31. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAYOFMONTH](#dayofmonth), [DAY_OF_MONTH](#day_of_month) -Example +### Example ```ppl source=people @@ -791,10 +791,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: dayname(date) returns the name of the weekday for date, including Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. -Argument type: STRING/DATE/TIMESTAMP -Return type: STRING -Example +Usage: `dayname(date)` returns the name of the weekday for date, including Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `STRING` +### Example ```ppl source=people @@ -817,11 +817,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: dayofmonth(date) extracts the day of the month for date, in the range 1 to 31. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `dayofmonth(date)` extracts the day of the month for date, in the range 1 to 31. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAY](#day), [DAY_OF_MONTH](#day_of_month) -Example +### Example ```ppl source=people @@ -844,11 +844,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: day_of_month(date) extracts the day of the month for date, in the range 1 to 31. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `day_of_month(date)` extracts the day of the month for date, in the range 1 to 31. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAY](#day), [DAYOFMONTH](#dayofmonth) -Example +### Example ```ppl source=people @@ -871,11 +871,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: dayofweek(date) returns the weekday index for date (1 = Sunday, 2 = Monday, ..., 7 = Saturday). -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `dayofweek(date)` returns the weekday index for date (1 = Sunday, 2 = Monday, ..., 7 = Saturday). +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAY_OF_WEEK](#day_of_week) -Example +### Example ```ppl source=people @@ -898,11 +898,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: day_of_week(date) returns the weekday index for date (1 = Sunday, 2 = Monday, ..., 7 = Saturday). -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `day_of_week(date)` returns the weekday index for date (1 = Sunday, 2 = Monday, ..., 7 = Saturday). +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAYOFWEEK](#dayofweek) -Example +### Example ```ppl source=people @@ -926,10 +926,10 @@ fetched rows / total rows = 1/1 ### Description Usage: dayofyear(date) returns the day of the year for date, in the range 1 to 366. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAY_OF_YEAR](#day_of_year) -Example +### Example ```ppl source=people @@ -953,10 +953,10 @@ fetched rows / total rows = 1/1 ### Description Usage: day_of_year(date) returns the day of the year for date, in the range 1 to 366. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [DAYOFYEAR](#dayofyear) -Example +### Example ```ppl source=people @@ -979,9 +979,9 @@ fetched rows / total rows = 1/1 ### Description -Usage: extract(part FROM date) returns a LONG with digits in order according to the given 'part' arguments. +Usage: `extract(part FROM date)` returns a LONG with digits in order according to the given 'part' arguments. The specific format of the returned long is determined by the table below. -Argument type: PART, where PART is one of the following tokens in the table below. +**Argument type:** `PART, where PART is one of the following tokens in the table below.` The format specifiers found in this table are the same as those found in the [DATE_FORMAT](#date_format) function. The following table describes the mapping of a 'part' to a particular format. @@ -1009,8 +1009,8 @@ The following table describes the mapping of a 'part' to a particular format. | YEAR_MONTH | %V%m | -Return type: LONG -Example +**Return type:** `LONG` +### Example ```ppl source=people @@ -1033,10 +1033,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: from_days(N) returns the date value given the day number N. -Argument type: INTEGER/LONG -Return type: DATE -Example +Usage: `from_days(N)` returns the date value given the day number N. +**Argument type:** `INTEGER/LONG` +**Return type:** `DATE` +### Example ```ppl source=people @@ -1062,7 +1062,7 @@ fetched rows / total rows = 1/1 Usage: Returns a representation of the argument given as a timestamp or character string value. Perform reverse conversion for [UNIX_TIMESTAMP](#unix_timestamp) function. If second argument is provided, it is used to format the result in the same way as the format string used for the [DATE_FORMAT](#date_format) function. If timestamp is outside of range 1970-01-01 00:00:00 - 3001-01-18 23:59:59.999999 (0 to 32536771199.999999 epoch time), function returns NULL. -Argument type: DOUBLE, STRING +**Argument type:** `DOUBLE, STRING` Return type map: DOUBLE -> TIMESTAMP DOUBLE, STRING -> STRING @@ -1107,7 +1107,7 @@ fetched rows / total rows = 1/1 ### Description Usage: Returns a string value containing string format specifiers based on the input arguments. -Argument type: TYPE, STRING, where TYPE must be one of the following tokens: [DATE, TIME, TIMESTAMP], and +**Argument type:** `TYPE, STRING, where TYPE must be one of the following tokens: [DATE, TIME, TIMESTAMP], and` STRING must be one of the following tokens: ["USA", "JIS", "ISO", "EUR", "INTERNAL"] (" can be replaced by '). Examples @@ -1132,11 +1132,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: hour(time) extracts the hour value for time. Different from the time of day value, the time value has a large range and can be greater than 23, so the return value of hour(time) can be also greater than 23. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `hour(time)` extracts the hour value for time. Different from the time of day value, the time value has a large range and can be greater than 23, so the return value of hour(time) can be also greater than 23. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [HOUR_OF_DAY](#hour_of_day) -Example +### Example ```ppl source=people @@ -1159,11 +1159,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: hour_of_day(time) extracts the hour value for time. Different from the time of day value, the time value has a large range and can be greater than 23, so the return value of hour_of_day(time) can be also greater than 23. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `hour_of_day(time)` extracts the hour value for time. Different from the time of day value, the time value has a large range and can be greater than 23, so the return value of hour_of_day(time) can be also greater than 23. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [HOUR](#hour) -Example +### Example ```ppl source=people @@ -1185,9 +1185,9 @@ fetched rows / total rows = 1/1 ## LAST_DAY Usage: Returns the last day of the month as a DATE for a valid argument. -Argument type: DATE/STRING/TIMESTAMP/TIME -Return type: DATE -Example +**Argument type:** `DATE/STRING/TIMESTAMP/TIME` +**Return type:** `DATE` +### Example ```ppl source=people @@ -1211,7 +1211,7 @@ fetched rows / total rows = 1/1 ### Description `LOCALTIMESTAMP()` are synonyms for [NOW()](#now). -Example +### Example ```ppl ignore source=people @@ -1235,7 +1235,7 @@ fetched rows / total rows = 1/1 ### Description `LOCALTIME()` are synonyms for [NOW()](#now). -Example +### Example ```ppl ignore source=people @@ -1269,9 +1269,9 @@ Limitations: Specifications: 1. MAKEDATE(DOUBLE, DOUBLE) -> DATE -Argument type: DOUBLE -Return type: DATE -Example +**Argument type:** `DOUBLE` +**Return type:** `DATE` +### Example ```ppl source=people @@ -1303,9 +1303,9 @@ Limitations: Specifications: 1. MAKETIME(DOUBLE, DOUBLE, DOUBLE) -> TIME -Argument type: DOUBLE -Return type: TIME -Example +**Argument type:** `DOUBLE` +**Return type:** `TIME` +### Example ```ppl source=people @@ -1328,10 +1328,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: microsecond(expr) returns the microseconds from the time or timestamp expression expr as a number in the range from 0 to 999999. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER -Example +Usage: `microsecond(expr)` returns the microseconds from the time or timestamp expression expr as a number in the range from 0 to 999999. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -1354,11 +1354,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: minute(time) returns the minute for time, in the range 0 to 59. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `minute(time)` returns the minute for time, in the range 0 to 59. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [MINUTE_OF_HOUR](#minute_of_hour) -Example +### Example ```ppl source=people @@ -1381,10 +1381,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: minute(time) returns the amount of minutes in the day, in the range of 0 to 1439. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER -Example +Usage: `minute(time)` returns the amount of minutes in the day, in the range of 0 to 1439. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -1407,11 +1407,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: minute(time) returns the minute for time, in the range 0 to 59. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `minute(time)` returns the minute for time, in the range 0 to 59. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [MINUTE](#minute) -Example +### Example ```ppl source=people @@ -1434,11 +1434,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: month(date) returns the month for date, in the range 1 to 12 for January to December. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `month(date)` returns the month for date, in the range 1 to 12 for January to December. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [MONTH_OF_YEAR](#month_of_year) -Example +### Example ```ppl source=people @@ -1461,11 +1461,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: month_of_year(date) returns the month for date, in the range 1 to 12 for January to December. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER +Usage: `month_of_year(date)` returns the month for date, in the range 1 to 12 for January to December. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [MONTH](#month) -Example +### Example ```ppl source=people @@ -1488,10 +1488,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: monthname(date) returns the full name of the month for date. -Argument type: STRING/DATE/TIMESTAMP -Return type: STRING -Example +Usage: `monthname(date)` returns the full name of the month for date. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `STRING` +### Example ```ppl source=people @@ -1516,9 +1516,9 @@ fetched rows / total rows = 1/1 Returns the current date and time as a value in 'YYYY-MM-DD hh:mm:ss' format. The value is expressed in the UTC time zone. `NOW()` returns a constant time that indicates the time at which the statement began to execute. This differs from the behavior for [SYSDATE()](#sysdate), which returns the exact time at which it executes. -Return type: TIMESTAMP +**Return type:** `TIMESTAMP` Specification: NOW() -> TIMESTAMP -Example +### Example ```ppl ignore source=people @@ -1541,10 +1541,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: period_add(P, N) add N months to period P (in the format YYMM or YYYYMM). Returns a value in the format YYYYMM. -Argument type: INTEGER, INTEGER -Return type: INTEGER -Example +Usage: `period_add(P, N)` add N months to period P (in the format YYMM or YYYYMM). Returns a value in the format YYYYMM. +**Argument type:** `INTEGER, INTEGER` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -1567,10 +1567,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: period_diff(P1, P2) returns the number of months between periods P1 and P2 given in the format YYMM or YYYYMM. -Argument type: INTEGER, INTEGER -Return type: INTEGER -Example +Usage: `period_diff(P1, P2)` returns the number of months between periods P1 and P2 given in the format YYMM or YYYYMM. +**Argument type:** `INTEGER, INTEGER` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -1593,10 +1593,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: quarter(date) returns the quarter of the year for date, in the range 1 to 4. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER -Example +Usage: `quarter(date)` returns the quarter of the year for date, in the range 1 to 4. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -1619,13 +1619,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: sec_to_time(number) returns the time in HH:mm:ssss[.nnnnnn] format. +Usage: `sec_to_time(number)` returns the time in HH:mm:ssss[.nnnnnn] format. Note that the function returns a time between 00:00:00 and 23:59:59. If an input value is too large (greater than 86399), the function will wrap around and begin returning outputs starting from 00:00:00. If an input value is too small (less than 0), the function will wrap around and begin returning outputs counting down from 23:59:59. -Argument type: INTEGER, LONG, DOUBLE, FLOAT -Return type: TIME -Example +**Argument type:** `INTEGER, LONG, DOUBLE, FLOAT` +**Return type:** `TIME` +### Example ```ppl source=people @@ -1649,11 +1649,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: second(time) returns the second for time, in the range 0 to 59. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `second(time)` returns the second for time, in the range 0 to 59. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [SECOND_OF_MINUTE](#second_of_minute) -Example +### Example ```ppl source=people @@ -1676,11 +1676,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: second_of_minute(time) returns the second for time, in the range 0 to 59. -Argument type: STRING/TIME/TIMESTAMP -Return type: INTEGER +Usage: `second_of_minute(time)` returns the second for time, in the range 0 to 59. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `INTEGER` Synonyms: [SECOND](#second) -Example +### Example ```ppl source=people @@ -1704,7 +1704,7 @@ fetched rows / total rows = 1/1 **Version: 3.3.0** ### Description -Usage: strftime(time, format) takes a UNIX timestamp (in seconds) and renders it as a string using the format specified. For numeric inputs, the UNIX time must be in seconds. Values greater than 100000000000 are automatically treated as milliseconds and converted to seconds. +Usage: `strftime(time, format)` takes a UNIX timestamp (in seconds) and renders it as a string using the format specified. For numeric inputs, the UNIX time must be in seconds. Values greater than 100000000000 are automatically treated as milliseconds and converted to seconds. You can use time format variables with the strftime function. This function performs the reverse operation of [UNIX_TIMESTAMP](#unix_timestamp) and is similar to [FROM_UNIXTIME](#from_unixtime) but with POSIX-style format specifiers. - **Available only when Calcite engine is enabled** - All timestamps are interpreted as UTC timezone @@ -1712,8 +1712,8 @@ You can use time format variables with the strftime function. This function perf - String inputs are NOT supported - use `unix_timestamp()` to convert strings first - Functions that return date/time values (like `date()`, `now()`, `timestamp()`) are supported -Argument type: INTEGER/LONG/DOUBLE/TIMESTAMP, STRING -Return type: STRING +**Argument type:** `INTEGER/LONG/DOUBLE/TIMESTAMP, STRING` +**Return type:** `STRING` Format specifiers: The following table describes the available specifier arguments. @@ -1863,13 +1863,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: str_to_date(string, string) is used to extract a TIMESTAMP from the first argument string using the formats specified in the second argument string. +Usage: `str_to_date(string, string)` is used to extract a TIMESTAMP from the first argument string using the formats specified in the second argument string. The input argument must have enough information to be parsed as a DATE, TIMESTAMP, or TIME. Acceptable string format specifiers are the same as those used in the [DATE_FORMAT](#date_format) function. It returns NULL when a statement cannot be parsed due to an invalid pair of arguments, and when 0 is provided for any DATE field. Otherwise, it will return a TIMESTAMP with the parsed values (as well as default values for any field that was not parsed). -Argument type: STRING, STRING -Return type: TIMESTAMP -Example +**Argument type:** `STRING, STRING` +**Return type:** `TIMESTAMP` +### Example ```ppl @@ -1897,16 +1897,16 @@ fetched rows / total rows = 1/1 ### Description -Usage: subdate(date, INTERVAL expr unit) / subdate(date, days) subtracts the interval expr from date; subdate(date, days) subtracts the second argument as integer number of days from date. +Usage: `subdate(date, INTERVAL expr unit)` / subdate(date, days) subtracts the interval expr from date; subdate(date, days) subtracts the second argument as integer number of days from date. If first argument is TIME, today's date is used; if first argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, INTERVAL/LONG +**Argument type:** `DATE/TIMESTAMP/TIME, INTERVAL/LONG` Return type map: (DATE/TIMESTAMP/TIME, INTERVAL) -> TIMESTAMP (DATE, LONG) -> DATE (TIMESTAMP/TIME, LONG) -> TIMESTAMP Synonyms: [DATE_SUB](#date_sub) when invoked with the INTERVAL form of the second argument. Antonyms: [ADDDATE](#adddate) -Example +### Example ```ppl @@ -1934,13 +1934,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: subtime(expr1, expr2) subtracts expr2 from expr1 and returns the result. If argument is TIME, today's date is used; if argument is DATE, time at midnight is used. -Argument type: DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME +Usage: `subtime(expr1, expr2)` subtracts expr2 from expr1 and returns the result. If argument is TIME, today's date is used; if argument is DATE, time at midnight is used. +**Argument type:** `DATE/TIMESTAMP/TIME, DATE/TIMESTAMP/TIME` Return type map: (DATE/TIMESTAMP, DATE/TIMESTAMP/TIME) -> TIMESTAMP (TIME, DATE/TIMESTAMP/TIME) -> TIME Antonyms: [ADDTIME](#addtime) -Example +### Example ```ppl @@ -2060,9 +2060,9 @@ Returns the current date and time as a value in 'YYYY-MM-DD hh:mm:ss[.nnnnnn]'. SYSDATE() returns the date and time at which it executes in UTC. This differs from the behavior for [NOW()](#now), which returns a constant time that indicates the time at which the statement began to execute. If an argument is given, it specifies a fractional seconds precision from 0 to 6, the return value includes a fractional seconds part of that many digits. Optional argument type: INTEGER -Return type: TIMESTAMP +**Return type:** `TIMESTAMP` Specification: SYSDATE([INTEGER]) -> TIMESTAMP -Example +### Example ```ppl ignore @@ -2090,10 +2090,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: time(expr) constructs a time type with the input string expr as a time. If the argument is of date/time/timestamp, it extracts the time value part from the expression. -Argument type: STRING/DATE/TIME/TIMESTAMP -Return type: TIME -Example +Usage: `time(expr)` constructs a time type with the input string expr as a time. If the argument is of date/time/timestamp, it extracts the time value part from the expression. +**Argument type:** `STRING/DATE/TIME/TIMESTAMP` +**Return type:** `TIME` +### Example ```ppl @@ -2187,7 +2187,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: time_format(time, format) formats the time argument using the specifiers in the format argument. +Usage: `time_format(time, format)` formats the time argument using the specifiers in the format argument. This supports a subset of the time format specifiers available for the [date_format](#date_format) function. Using date format specifiers supported by [date_format](#date_format) will return 0 or null. Acceptable format specifiers are listed in the table below. @@ -2209,9 +2209,9 @@ The following table describes the available specifier arguments. | %T | Time, 24-hour (hh:mm:ss) | -Argument type: STRING/DATE/TIME/TIMESTAMP, STRING -Return type: STRING -Example +**Argument type:** `STRING/DATE/TIME/TIMESTAMP, STRING` +**Return type:** `STRING` +### Example ```ppl @@ -2239,10 +2239,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: time_to_sec(time) returns the time argument, converted to seconds. -Argument type: STRING/TIME/TIMESTAMP -Return type: LONG -Example +Usage: `time_to_sec(time)` returns the time argument, converted to seconds. +**Argument type:** `STRING/TIME/TIMESTAMP` +**Return type:** `LONG` +### Example ```ppl @@ -2271,9 +2271,9 @@ fetched rows / total rows = 1/1 ### Description Usage: returns the difference between two time expressions as a time. -Argument type: TIME, TIME -Return type: TIME -Example +**Argument type:** `TIME, TIME` +**Return type:** `TIME` +### Example ```ppl @@ -2301,13 +2301,13 @@ fetched rows / total rows = 1/1 ### Description -Usage: timestamp(expr) constructs a timestamp type with the input string `expr` as an timestamp. If the argument is not a string, it casts `expr` to timestamp type with default timezone UTC. If argument is a time, it applies today's date before cast. +Usage: `timestamp(expr)` constructs a timestamp type with the input string `expr` as an timestamp. If the argument is not a string, it casts `expr` to timestamp type with default timezone UTC. If argument is a time, it applies today's date before cast. With two arguments `timestamp(expr1, expr2)` adds the time expression `expr2` to the date or timestamp expression `expr1` and returns the result as a timestamp value. -Argument type: STRING/DATE/TIME/TIMESTAMP +**Argument type:** `STRING/DATE/TIME/TIMESTAMP` Return type map: (STRING/DATE/TIME/TIMESTAMP) -> TIMESTAMP (STRING/DATE/TIME/TIMESTAMP, STRING/DATE/TIME/TIMESTAMP) -> TIMESTAMP -Example +### Example ```ppl @@ -2338,7 +2338,7 @@ fetched rows / total rows = 1/1 Usage: Returns a TIMESTAMP value based on a passed in DATE/TIME/TIMESTAMP/STRING argument and an INTERVAL and INTEGER argument which determine the amount of time to be added. If the third argument is a STRING, it must be formatted as a valid TIMESTAMP. If only a TIME is provided, a TIMESTAMP is still returned with the DATE portion filled in using the current date. If the third argument is a DATE, it will be automatically converted to a TIMESTAMP. -Argument type: INTERVAL, INTEGER, DATE/TIME/TIMESTAMP/STRING +**Argument type:** `INTERVAL, INTEGER, DATE/TIME/TIMESTAMP/STRING` INTERVAL must be one of the following tokens: [MICROSECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR] Examples @@ -2369,11 +2369,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: TIMESTAMPDIFF(interval, start, end) returns the difference between the start and end date/times in interval units. +Usage: `TIMESTAMPDIFF(interval, start, end)` returns the difference between the start and end date/times in interval units. If a TIME is provided as an argument, it will be converted to a TIMESTAMP with the DATE portion filled in using the current date. Arguments will be automatically converted to a TIME/TIMESTAMP when appropriate. Any argument that is a STRING must be formatted as a valid TIMESTAMP. -Argument type: INTERVAL, DATE/TIME/TIMESTAMP/STRING, DATE/TIME/TIMESTAMP/STRING +**Argument type:** `INTERVAL, DATE/TIME/TIMESTAMP/STRING, DATE/TIME/TIMESTAMP/STRING` INTERVAL must be one of the following tokens: [MICROSECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, YEAR] Examples @@ -2404,10 +2404,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: to_days(date) returns the day number (the number of days since year 0) of the given date. Returns NULL if date is invalid. -Argument type: STRING/DATE/TIMESTAMP -Return type: LONG -Example +Usage: `to_days(date)` returns the day number (the number of days since year 0) of the given date. Returns NULL if date is invalid. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `LONG` +### Example ```ppl @@ -2435,11 +2435,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: to_seconds(date) returns the number of seconds since the year 0 of the given value. Returns NULL if value is invalid. +Usage: `to_seconds(date)` returns the number of seconds since the year 0 of the given value. Returns NULL if value is invalid. An argument of a LONG type can be used. It must be formatted as YMMDD, YYMMDD, YYYMMDD or YYYYMMDD. Note that a LONG type argument cannot have leading 0s as it will be parsed using an octal numbering system. -Argument type: STRING/LONG/DATE/TIME/TIMESTAMP -Return type: LONG -Example +**Argument type:** `STRING/LONG/DATE/TIME/TIMESTAMP` +**Return type:** `LONG` +### Example ```ppl @@ -2472,9 +2472,9 @@ Usage: Converts given argument to Unix time (seconds since Epoch - very beginnin The date argument may be a DATE, or TIMESTAMP string, or a number in YYMMDD, YYMMDDhhmmss, YYYYMMDD, or YYYYMMDDhhmmss format. If the argument includes a time part, it may optionally include a fractional seconds part. If argument is in invalid format or outside of range 1970-01-01 00:00:00 - 3001-01-18 23:59:59.999999 (0 to 32536771199.999999 epoch time), function returns NULL. You can use [FROM_UNIXTIME](#from_unixtime) to do reverse conversion. -Argument type: \/DOUBLE/DATE/TIMESTAMP -Return type: DOUBLE -Example +**Argument type:** `\/DOUBLE/DATE/TIMESTAMP` +**Return type:** `DOUBLE` +### Example ```ppl @@ -2503,9 +2503,9 @@ fetched rows / total rows = 1/1 ### Description Returns the current UTC date as a value in 'YYYY-MM-DD'. -Return type: DATE +**Return type:** `DATE` Specification: UTC_DATE() -> DATE -Example +### Example ```ppl ignore @@ -2534,9 +2534,9 @@ fetched rows / total rows = 1/1 ### Description Returns the current UTC time as a value in 'hh:mm:ss'. -Return type: TIME +**Return type:** `TIME` Specification: UTC_TIME() -> TIME -Example +### Example ```ppl ignore @@ -2565,9 +2565,9 @@ fetched rows / total rows = 1/1 ### Description Returns the current UTC timestamp as a value in 'YYYY-MM-DD hh:mm:ss'. -Return type: TIMESTAMP +**Return type:** `TIMESTAMP` Specification: UTC_TIMESTAMP() -> TIMESTAMP -Example +### Example ```ppl ignore @@ -2595,7 +2595,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: week(date[, mode]) returns the week number for date. If the mode argument is omitted, the default mode 0 is used. +Usage: `week(date[, mode])` returns the week number for date. If the mode argument is omitted, the default mode 0 is used. The following table describes how the mode argument works. @@ -2611,10 +2611,10 @@ The following table describes how the mode argument works. | 7 | Monday | 1-53 | with a Monday in this year | -Argument type: DATE/TIMESTAMP/STRING -Return type: INTEGER +**Argument type:** `DATE/TIMESTAMP/STRING` +**Return type:** `INTEGER` Synonyms: [WEEK_OF_YEAR](#week_of_year) -Example +### Example ```ppl @@ -2642,11 +2642,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: weekday(date) returns the weekday index for date (0 = Monday, 1 = Tuesday, ..., 6 = Sunday). +Usage: `weekday(date)` returns the weekday index for date (0 = Monday, 1 = Tuesday, ..., 6 = Sunday). It is similar to the [dayofweek](#dayofweek) function, but returns different indexes for each day. -Argument type: STRING/DATE/TIME/TIMESTAMP -Return type: INTEGER -Example +**Argument type:** `STRING/DATE/TIME/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl @@ -2675,7 +2675,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: week_of_year(date[, mode]) returns the week number for date. If the mode argument is omitted, the default mode 0 is used. +Usage: `week_of_year(date[, mode])` returns the week number for date. If the mode argument is omitted, the default mode 0 is used. The following table describes how the mode argument works. @@ -2691,10 +2691,10 @@ The following table describes how the mode argument works. | 7 | Monday | 1-53 | with a Monday in this year | -Argument type: DATE/TIMESTAMP/STRING -Return type: INTEGER +**Argument type:** `DATE/TIMESTAMP/STRING` +**Return type:** `INTEGER` Synonyms: [WEEK](#week) -Example +### Example ```ppl @@ -2722,10 +2722,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: year(date) returns the year for date, in the range 1000 to 9999, or 0 for the “zero” date. -Argument type: STRING/DATE/TIMESTAMP -Return type: INTEGER -Example +Usage: `year(date)` returns the year for date, in the range 1000 to 9999, or 0 for the “zero” date. +**Argument type:** `STRING/DATE/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl @@ -2753,10 +2753,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: yearweek(date[, mode]) returns the year and week for date as an integer. It accepts and optional mode arguments aligned with those available for the [WEEK](#week) function. -Argument type: STRING/DATE/TIME/TIMESTAMP -Return type: INTEGER -Example +Usage: `yearweek(date[, mode])` returns the year and week for date as an integer. It accepts and optional mode arguments aligned with those available for the [WEEK](#week) function. +**Argument type:** `STRING/DATE/TIME/TIMESTAMP` +**Return type:** `INTEGER` +### Example ```ppl diff --git a/docs/user/ppl/functions/ip.md b/docs/user/ppl/functions/ip.md index 673a0a8d25..c21816baea 100644 --- a/docs/user/ppl/functions/ip.md +++ b/docs/user/ppl/functions/ip.md @@ -5,9 +5,11 @@ ### Description Usage: `cidrmatch(ip, cidr)` checks if `ip` is within the specified `cidr` range. -Argument type: STRING/IP, STRING -Return type: BOOLEAN -Example + +**Argument type:** `STRING`/`IP`, `STRING` +**Return type:** `BOOLEAN` + +### Example ```ppl source=weblogs @@ -37,9 +39,11 @@ Note: ### Description Usage: `geoip(dataSourceName, ipAddress[, options])` to lookup location information from given IP addresses via OpenSearch GeoSpatial plugin API. -Argument type: STRING, STRING/IP, STRING -Return type: OBJECT -Example: + +**Argument type:** `STRING`, `STRING`/`IP`, `STRING` +**Return type:** `OBJECT` + +### Example: ```ppl ignore source=weblogs @@ -58,4 +62,4 @@ fetched rows / total rows = 1/1 Note: - `dataSourceName` must be an established dataSource on OpenSearch GeoSpatial plugin, detail of configuration can be found: https://opensearch.org/docs/latest/ingest-pipelines/processors/ip2geo/ - `ip` can be an IPv4 or an IPv6 address - - `options` is an optional String of comma separated fields to output: the selection of fields is subject to dataSourceProvider's schema. For example, the list of fields in the provided `geolite2-city` dataset includes: "country_iso_code", "country_name", "continent_name", "region_iso_code", "region_name", "city_name", "time_zone", "location" \ No newline at end of file + - `options` is an optional String of comma separated fields to output: the selection of fields is subject to dataSourceProvider's schema. For example, the list of fields in the provided `geolite2-city` dataset includes: "country_iso_code", "country_name", "continent_name", "region_iso_code", "region_name", "city_name", "time_zone", "location" diff --git a/docs/user/ppl/functions/json.md b/docs/user/ppl/functions/json.md index 8d0b29883a..e9bd8cf8ac 100644 --- a/docs/user/ppl/functions/json.md +++ b/docs/user/ppl/functions/json.md @@ -23,9 +23,9 @@ Notes: ### Description Usage: `json(value)` Evaluates whether a string can be parsed as a json-encoded string. Returns the value if valid, null otherwise. -Argument type: STRING -Return type: STRING -Example +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -53,10 +53,10 @@ fetched rows / total rows = 4/4 ### Description Version: 3.1.0 -Limitation: Only works when plugins.calcite.enabled=true +Limitation: Only works when `plugins.calcite.enabled=true` Usage: `json_valid(value)` Evaluates whether a string uses valid JSON syntax. Returns TRUE if valid, FALSE if invalid. NULL input returns NULL. -Argument type: STRING -Return type: BOOLEAN +**Argument type:** `STRING ` +**Return type:** `BOOLEAN ` Example ```ppl @@ -82,9 +82,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_object(key1, value1, key2, value2...)` create a json object string with key value pairs. The key must be string. -Argument type: key1: STRING, value1: ANY, key2: STRING, value2: ANY ... -Return type: STRING -Example +**Argument type:** `key1: STRING, value1: ANY, key2: STRING, value2: ANY ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -109,9 +109,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_array(element1, element2, ...)` create a json array string with elements. -Argument type: element1: ANY, element2: ANY ... -Return type: STRING -Example +**Argument type:** `element1: ANY, element2: ANY ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -136,9 +136,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_array_length(value)` parse the string to json array and return size,, null is returned in case of any other valid JSON string, null or an invalid JSON. -Argument type: value: A JSON STRING -Return type: INTEGER -Example +**Argument type:** `value: A JSON STRING` +**Return type:** `INTEGER` +### Example ```ppl source=json_test @@ -181,9 +181,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_extract(json_string, path1, path2, ...)` Extracts values using the specified JSON paths. If only one path is provided, it returns a single value. If multiple paths are provided, it returns a JSON Array in the order of the paths. If one path cannot find value, return null as the result for this path. The path use "{}" to represent index for array, "{}" means "{*}". -Argument type: json_string: STRING, path1: STRING, path2: STRING ... -Return type: STRING -Example +**Argument type:** `json_string: STRING, path1: STRING, path2: STRING ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -226,9 +226,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_delete(json_string, path1, path2, ...)` Delete values using the specified JSON paths. Return the json string after deleting. If one path cannot find value, do nothing. -Argument type: json_string: STRING, path1: STRING, path2: STRING ... -Return type: STRING -Example +**Argument type:** `json_string: STRING, path1: STRING, path2: STRING ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -289,9 +289,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_set(json_string, path1, value1, path2, value2...)` Set values to corresponding paths using the specified JSON paths. If one path's parent node is not a json object, skip the path. Return the json string after setting. -Argument type: json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ... -Return type: STRING -Example +**Argument type:** `json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -334,9 +334,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_append(json_string, path1, value1, path2, value2...)` Append values to corresponding paths using the specified JSON paths. If one path's target node is not an array, skip the path. Return the json string after setting. -Argument type: json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ... -Return type: STRING -Example +**Argument type:** `json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -397,9 +397,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_extend(json_string, path1, value1, path2, value2...)` Extend values to corresponding paths using the specified JSON paths. If one path's target node is not an array, skip the path. The function will try to parse the value as an array. If it can be parsed, extend it to the target array. Otherwise, regard the value a single one. Return the json string after setting. -Argument type: json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ... -Return type: STRING -Example +**Argument type:** `json_string: STRING, path1: STRING, value1: ANY, path2: STRING, value2: ANY ...` +**Return type:** `STRING` +### Example ```ppl source=json_test @@ -460,9 +460,9 @@ fetched rows / total rows = 1/1 ### Description Usage: `json_keys(json_string)` Return the key list of the Json object as a Json array. Otherwise, return null. -Argument type: json_string: A JSON STRING -Return type: STRING -Example +**Argument type:** `json_string: A JSON STRING` +**Return type:** `STRING` +### Example ```ppl source=json_test diff --git a/docs/user/ppl/functions/math.md b/docs/user/ppl/functions/math.md index 6b2fe319df..834e3523fd 100644 --- a/docs/user/ppl/functions/math.md +++ b/docs/user/ppl/functions/math.md @@ -4,10 +4,10 @@ ### Description -Usage: abs(x) calculates the abs x. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: INTEGER/LONG/FLOAT/DOUBLE -Example +Usage: `abs(x)` calculates the abs x. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `INTEGER/LONG/FLOAT/DOUBLE` +### Example ```ppl source=people @@ -30,11 +30,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: add(x, y) calculates x plus y. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider number between x and y +Usage: `add(x, y)` calculates x plus y. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider number between x and y` Synonyms: Addition Symbol (+) -Example +### Example ```ppl source=people @@ -57,11 +57,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: subtract(x, y) calculates x minus y. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider number between x and y +Usage: `subtract(x, y)` calculates x minus y. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider number between x and y` Synonyms: Subtraction Symbol (-) -Example +### Example ```ppl source=people @@ -84,11 +84,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: multiply(x, y) calculates the multiplication of x and y. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider number between x and y. If y equals to 0, then returns NULL. +Usage: `multiply(x, y)` calculates the multiplication of x and y. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider number between x and y. If y equals to 0, then returns NULL.` Synonyms: Multiplication Symbol (\*) -Example +### Example ```ppl source=people @@ -111,11 +111,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: divide(x, y) calculates x divided by y. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider number between x and y +Usage: `divide(x, y)` calculates x divided by y. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider number between x and y` Synonyms: Division Symbol (/) -Example +### Example ```ppl source=people @@ -138,11 +138,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: sum(x, y, ...) calculates the sum of all provided arguments. This function accepts a variable number of arguments. +Usage: `sum(x, y, ...)` calculates the sum of all provided arguments. This function accepts a variable number of arguments. Note: This function is only available in the eval command context and is rewritten to arithmetic addition while query parsing. -Argument type: Variable number of INTEGER/LONG/FLOAT/DOUBLE arguments -Return type: Wider number type among all arguments -Example +**Argument type:** `Variable number of INTEGER/LONG/FLOAT/DOUBLE arguments` +**Return type:** `Wider number type among all arguments` +### Example ```ppl source=accounts @@ -188,11 +188,11 @@ fetched rows / total rows = 4/4 ### Description -Usage: avg(x, y, ...) calculates the average (arithmetic mean) of all provided arguments. This function accepts a variable number of arguments. +Usage: `avg(x, y, ...)` calculates the average (arithmetic mean) of all provided arguments. This function accepts a variable number of arguments. Note: This function is only available in the eval command context and is rewritten to arithmetic expression (sum / count) at query parsing time. -Argument type: Variable number of INTEGER/LONG/FLOAT/DOUBLE arguments -Return type: DOUBLE -Example +**Argument type:** `Variable number of INTEGER/LONG/FLOAT/DOUBLE arguments` +**Return type:** `DOUBLE` +### Example ```ppl source=accounts @@ -238,10 +238,10 @@ fetched rows / total rows = 4/4 ### Description -Usage: acos(x) calculates the arc cosine of x. Returns NULL if x is not in the range -1 to 1. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `acos(x)` calculates the arc cosine of x. Returns NULL if x is not in the range -1 to 1. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -264,10 +264,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: asin(x) calculate the arc sine of x. Returns NULL if x is not in the range -1 to 1. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `asin(x)` calculate the arc sine of x. Returns NULL if x is not in the range -1 to 1. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -290,10 +290,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: atan(x) calculates the arc tangent of x. atan(y, x) calculates the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `atan(x)` calculates the arc tangent of x. atan(y, x) calculates the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -317,9 +317,9 @@ fetched rows / total rows = 1/1 ### Description Usage: atan2(y, x) calculates the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -345,12 +345,12 @@ An alias for [CEILING](#ceiling) function. ### Description -Usage: CEILING(T) takes the ceiling of value T. +Usage: `CEILING(T)` takes the ceiling of value T. Note: [CEIL](#ceil) and CEILING functions have the same implementation & functionality Limitation: CEILING only works as expected when IEEE 754 double type displays decimal when stored. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: same type with input -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `same type with input` +### Example ```ppl source=people @@ -390,10 +390,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: CONV(x, a, b) converts the number x from a base to b base. -Argument type: x: STRING, a: INTEGER, b: INTEGER -Return type: STRING -Example +Usage: `CONV(x, a, b)` converts the number x from a base to b base. +**Argument type:** `x: STRING, a: INTEGER, b: INTEGER` +**Return type:** `STRING` +### Example ```ppl source=people @@ -416,10 +416,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: cos(x) calculates the cosine of x, where x is given in radians. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `cos(x)` calculates the cosine of x, where x is given in radians. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -442,10 +442,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: cosh(x) calculates the hyperbolic cosine of x, defined as (((e^x) + (e^(-x))) / 2). -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `cosh(x)` calculates the hyperbolic cosine of x, defined as (((e^x) + (e^(-x))) / 2). +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -468,10 +468,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: cot(x) calculates the cotangent of x. Returns out-of-range error if x equals to 0. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `cot(x)` calculates the cotangent of x. Returns out-of-range error if x equals to 0. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -495,9 +495,9 @@ fetched rows / total rows = 1/1 ### Description Usage: Calculates a cyclic redundancy check value and returns a 32-bit unsigned value. -Argument type: STRING -Return type: LONG -Example +**Argument type:** `STRING` +**Return type:** `LONG` +### Example ```ppl source=people @@ -520,10 +520,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: degrees(x) converts x from radians to degrees. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `degrees(x)` converts x from radians to degrees. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -546,9 +546,9 @@ fetched rows / total rows = 1/1 ### Description -Usage: E() returns the Euler's number -Return type: DOUBLE -Example +Usage: `E()` returns the Euler's number +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -571,10 +571,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: exp(x) return e raised to the power of x. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `exp(x)` return e raised to the power of x. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -598,9 +598,9 @@ fetched rows / total rows = 1/1 ### Description Usage: expm1(NUMBER T) returns the exponential of T, minus 1. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -623,11 +623,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: FLOOR(T) takes the floor of value T. +Usage: `FLOOR(T)` takes the floor of value T. Limitation: FLOOR only works as expected when IEEE 754 double type displays decimal when stored. -Argument type: a: INTEGER/LONG/FLOAT/DOUBLE -Return type: same type with input -Example +**Argument type:** `a: INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `same type with input` +### Example ```ppl source=people @@ -684,10 +684,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: ln(x) return the the natural logarithm of x. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `ln(x)` return the the natural logarithm of x. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -711,10 +711,10 @@ fetched rows / total rows = 1/1 ### Description Specifications: -Usage: log(x) returns the natural logarithm of x that is the base e logarithm of the x. log(B, x) is equivalent to log(x)/log(B). -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `log(x)` returns the natural logarithm of x that is the base e logarithm of the x. log(B, x) is equivalent to log(x)/log(B). +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -739,9 +739,9 @@ fetched rows / total rows = 1/1 Specifications: Usage: log2(x) is equivalent to log(x)/log(2). -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -766,9 +766,9 @@ fetched rows / total rows = 1/1 Specifications: Usage: log10(x) is equivalent to log(x)/log(10). -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -791,10 +791,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: MOD(n, m) calculates the remainder of the number n divided by m. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider type between types of n and m if m is nonzero value. If m equals to 0, then returns NULL. -Example +Usage: `MOD(n, m)` calculates the remainder of the number n divided by m. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider type between types of n and m if m is nonzero value. If m equals to 0, then returns NULL.` +### Example ```ppl source=people @@ -817,10 +817,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: MODULUS(n, m) calculates the remainder of the number n divided by m. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: Wider type between types of n and m if m is nonzero value. If m equals to 0, then returns NULL. -Example +Usage: `MODULUS(n, m)` calculates the remainder of the number n divided by m. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `Wider type between types of n and m if m is nonzero value. If m equals to 0, then returns NULL.` +### Example ```ppl source=people @@ -843,9 +843,9 @@ fetched rows / total rows = 1/1 ### Description -Usage: PI() returns the constant pi -Return type: DOUBLE -Example +Usage: `PI()` returns the constant pi +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -868,11 +868,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: POW(x, y) calculates the value of x raised to the power of y. Bad inputs return NULL result. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE +Usage: `POW(x, y)` calculates the value of x raised to the power of y. Bad inputs return NULL result. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` Synonyms: [POWER](#power) -Example +### Example ```ppl source=people @@ -895,11 +895,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: POWER(x, y) calculates the value of x raised to the power of y. Bad inputs return NULL result. -Argument type: INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE +Usage: `POWER(x, y)` calculates the value of x raised to the power of y. Bad inputs return NULL result. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE, INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` Synonyms: [POW](#pow) -Example +### Example ```ppl source=people @@ -922,10 +922,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: radians(x) converts x from degrees to radians. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `radians(x)` converts x from degrees to radians. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -948,10 +948,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: RAND()/RAND(N) returns a random floating-point value in the range 0 <= value < 1.0. If integer N is specified, the seed is initialized prior to execution. One implication of this behavior is with identical argument N, rand(N) returns the same value each time, and thus produces a repeatable sequence of column values. -Argument type: INTEGER -Return type: FLOAT -Example +Usage: `RAND()`/`RAND(`N) returns a random floating-point value in the range 0 <= value < 1.0. If integer N is specified, the seed is initialized prior to execution. One implication of this behavior is with identical argument N, rand(N) returns the same value each time, and thus produces a repeatable sequence of column values. +**Argument type:** `INTEGER` +**Return type:** `FLOAT` +### Example ```ppl source=people @@ -974,12 +974,12 @@ fetched rows / total rows = 1/1 ### Description -Usage: ROUND(x, d) rounds the argument x to d decimal places, d defaults to 0 if not specified -Argument type: INTEGER/LONG/FLOAT/DOUBLE +Usage: `ROUND(x, d)` rounds the argument x to d decimal places, d defaults to 0 if not specified +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` Return type map: (INTEGER/LONG [,INTEGER]) -> LONG (FLOAT/DOUBLE [,INTEGER]) -> LONG -Example +### Example ```ppl source=people @@ -1003,9 +1003,9 @@ fetched rows / total rows = 1/1 ### Description Usage: Returns the sign of the argument as -1, 0, or 1, depending on whether the number is negative, zero, or positive -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: same type with input -Example +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `same type with input` +### Example ```ppl source=people @@ -1029,10 +1029,10 @@ fetched rows / total rows = 1/1 ### Description Usage: Returns the sign of the argument as -1, 0, or 1, depending on whether the number is negative, zero, or positive -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: INTEGER +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `INTEGER` Synonyms: `SIGN` -Example +### Example ```ppl source=people @@ -1055,10 +1055,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: sin(x) calculates the sine of x, where x is given in radians. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `sin(x)` calculates the sine of x, where x is given in radians. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -1081,10 +1081,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: sinh(x) calculates the hyperbolic sine of x, defined as (((e^x) - (e^(-x))) / 2). -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `sinh(x)` calculates the hyperbolic sine of x, defined as (((e^x) - (e^(-x))) / 2). +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people @@ -1108,11 +1108,11 @@ fetched rows / total rows = 1/1 ### Description Usage: Calculates the square root of a non-negative number -Argument type: INTEGER/LONG/FLOAT/DOUBLE +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` Return type map: (Non-negative) INTEGER/LONG/FLOAT/DOUBLE -> DOUBLE (Negative) INTEGER/LONG/FLOAT/DOUBLE -> NULL -Example +### Example ```ppl source=people @@ -1136,10 +1136,10 @@ fetched rows / total rows = 1/1 ### Description Usage: Calculates the cube root of a number -Argument type: INTEGER/LONG/FLOAT/DOUBLE +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` Return type DOUBLE: INTEGER/LONG/FLOAT/DOUBLE -> DOUBLE -Example +### Example ```ppl ignore source=location @@ -1163,10 +1163,10 @@ fetched rows / total rows = 2/2 ### Description -Usage: rint(NUMBER T) returns T rounded to the closest whole integer number. -Argument type: INTEGER/LONG/FLOAT/DOUBLE -Return type: DOUBLE -Example +Usage: `rint(NUMBER T)` returns T rounded to the closest whole integer number. +**Argument type:** `INTEGER/LONG/FLOAT/DOUBLE` +**Return type:** `DOUBLE` +### Example ```ppl source=people diff --git a/docs/user/ppl/functions/statistical.md b/docs/user/ppl/functions/statistical.md index b109856691..7f87e11ca5 100644 --- a/docs/user/ppl/functions/statistical.md +++ b/docs/user/ppl/functions/statistical.md @@ -4,11 +4,14 @@ ### Description -Usage: max(x, y, ...) returns the maximum value from all provided arguments. Strings are treated as greater than numbers, so if provided both strings and numbers, it will return the maximum string value (lexicographically ordered) +Usage: `max(x, y, ...)` returns the maximum value from all provided arguments. Strings are treated as greater than numbers, so if provided both strings and numbers, it will return the maximum string value (lexicographically ordered). + Note: This function is only available in the eval command context. -Argument type: Variable number of INTEGER/LONG/FLOAT/DOUBLE/STRING arguments -Return type: Type of the selected argument -Example + +**Argument type:** Variable number of `INTEGER`/`LONG`/`FLOAT`/`DOUBLE`/`STRING` arguments +**Return type:** Type of the selected argument + +### Example ```ppl source=accounts @@ -74,11 +77,14 @@ fetched rows / total rows = 4/4 ### Description -Usage: min(x, y, ...) returns the minimum value from all provided arguments. Strings are treated as greater than numbers, so if provided both strings and numbers, it will return the minimum numeric value. +Usage: `min(x, y, ...)` returns the minimum value from all provided arguments. Strings are treated as greater than numbers, so if provided both strings and numbers, it will return the minimum numeric value. + Note: This function is only available in the eval command context. -Argument type: Variable number of INTEGER/LONG/FLOAT/DOUBLE/STRING arguments -Return type: Type of the selected argument -Example + +**Argument type:** Variable number of `INTEGER`/`LONG`/`FLOAT`/`DOUBLE`/`STRING` arguments +**Return type:** Type of the selected argument + +### Example ```ppl source=accounts @@ -139,4 +145,4 @@ fetched rows / total rows = 4/4 | 33 | Dale | 33 | +-----+-----------+--------+ ``` - \ No newline at end of file + diff --git a/docs/user/ppl/functions/string.md b/docs/user/ppl/functions/string.md index 04a3485c49..c1c64d21da 100644 --- a/docs/user/ppl/functions/string.md +++ b/docs/user/ppl/functions/string.md @@ -4,10 +4,10 @@ ### Description -Usage: CONCAT(str1, str2, ...., str_9) adds up to 9 strings together. -Argument type: STRING, STRING, ...., STRING -Return type: STRING -Example +Usage: `CONCAT(str1, str2, ...., str_9)` adds up to 9 strings together. +**Argument type:** `STRING, STRING, ...., STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -30,10 +30,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: CONCAT_WS(sep, str1, str2) returns str1 concatenated with str2 using sep as a separator between them. -Argument type: STRING, STRING, STRING -Return type: STRING -Example +Usage: `CONCAT_WS(sep, str1, str2)` returns str1 concatenated with str2 using sep as a separator between them. +**Argument type:** `STRING, STRING, STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -59,10 +59,10 @@ fetched rows / total rows = 1/1 Specifications: 1. LENGTH(STRING) -> INTEGER -Usage: length(str) returns length of string measured in bytes. -Argument type: STRING -Return type: INTEGER -Example +Usage: `length(str)` returns length of string measured in bytes. +**Argument type:** `STRING` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -85,7 +85,7 @@ fetched rows / total rows = 1/1 ### Description -Usage: like(string, PATTERN[, case_sensitive]) return true if the string match the PATTERN. `case_sensitive` is optional. When set to `true`, PATTERN is **case-sensitive**. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. +Usage: `like(string, PATTERN[, case_sensitive])` return true if the string match the PATTERN. `case_sensitive` is optional. When set to `true`, PATTERN is **case-sensitive**. **Default:** Determined by `plugins.ppl.syntax.legacy.preferred`. * When `plugins.ppl.syntax.legacy.preferred=true`, `case_sensitive` defaults to `false` * When `plugins.ppl.syntax.legacy.preferred=false`, `case_sensitive` defaults to `true` @@ -93,9 +93,9 @@ There are two wildcards often used in conjunction with the LIKE operator: * `%` - The percent sign represents zero, one, or multiple characters * `_` - The underscore represents a single character -Argument type: STRING, STRING [, BOOLEAN] -Return type: INTEGER -Example +**Argument type:** `STRING, STRING [, BOOLEAN]` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -119,14 +119,14 @@ Limitation: The pushdown of the LIKE function to a DSL wildcard query is support ### Description -Usage: ilike(string, PATTERN) return true if the string match the PATTERN, PATTERN is **case-insensitive**. +Usage: `ilike(string, PATTERN)` return true if the string match the PATTERN, PATTERN is **case-insensitive**. There are two wildcards often used in conjunction with the ILIKE operator: * `%` - The percent sign represents zero, one, or multiple characters * `_` - The underscore represents a single character -Argument type: STRING, STRING -Return type: INTEGER -Example +**Argument type:** `STRING, STRING` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -150,10 +150,10 @@ Limitation: The pushdown of the ILIKE function to a DSL wildcard query is suppor ### Description -Usage: locate(substr, str[, start]) returns the position of the first occurrence of substring substr in string str, starting searching from position start. If start is not specified, it defaults to 1 (the beginning of the string). Returns 0 if substr is not found. If any argument is NULL, the function returns NULL. -Argument type: STRING, STRING[, INTEGER] -Return type: INTEGER -Example +Usage: `locate(substr, str[, start])` returns the position of the first occurrence of substring substr in string str, starting searching from position start. If start is not specified, it defaults to 1 (the beginning of the string). Returns 0 if substr is not found. If any argument is NULL, the function returns NULL. +**Argument type:** `STRING, STRING[, INTEGER]` +**Return type:** `INTEGER` +### Example ```ppl source=people @@ -176,10 +176,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: lower(string) converts the string to lowercase. -Argument type: STRING -Return type: STRING -Example +Usage: `lower(string)` converts the string to lowercase. +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -202,10 +202,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: ltrim(str) trims leading space characters from the string. -Argument type: STRING -Return type: STRING -Example +Usage: `ltrim(str)` trims leading space characters from the string. +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -229,10 +229,10 @@ fetched rows / total rows = 1/1 ### Description Usage: The syntax POSITION(substr IN str) returns the position of the first occurrence of substring substr in string str. Returns 0 if substr is not in str. Returns NULL if any argument is NULL. -Argument type: STRING, STRING +**Argument type:** `STRING, STRING` Return type INTEGER (STRING IN STRING) -> INTEGER -Example +### Example ```ppl source=people @@ -255,10 +255,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: replace(str, pattern, replacement) returns a string with all occurrences of the pattern replaced by the replacement string in str. If any argument is NULL, the function returns NULL. +Usage: `replace(str, pattern, replacement)` returns a string with all occurrences of the pattern replaced by the replacement string in str. If any argument is NULL, the function returns NULL. **Regular Expression Support**: The pattern argument supports Java regex syntax, including: -Argument type: STRING, STRING (regex pattern), STRING (replacement) -Return type: STRING +**Argument type:** `STRING, STRING (regex pattern), STRING (replacement)` +**Return type:** `STRING` **Important - Regex Special Characters**: The pattern is interpreted as a regular expression. Characters like `.`, `*`, `+`, `[`, `]`, `(`, `)`, `{`, `}`, `^`, `$`, `|`, `?`, and `\` have special meaning in regex. To match them literally, escape with backslashes: * To match `example.com`: use `'example\\.com'` (escape the dots) * To match `value*`: use `'value\\*'` (escape the asterisk) @@ -368,10 +368,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: REVERSE(str) returns reversed string of the string supplied as an argument. -Argument type: STRING -Return type: STRING -Example +Usage: `REVERSE(str)` returns reversed string of the string supplied as an argument. +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -394,10 +394,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: right(str, len) returns the rightmost len characters from the string str, or NULL if any argument is NULL. -Argument type: STRING, INTEGER -Return type: STRING -Example +Usage: `right(str, len)` returns the rightmost len characters from the string str, or NULL if any argument is NULL. +**Argument type:** `STRING, INTEGER` +**Return type:** `STRING` +### Example ```ppl source=people @@ -420,10 +420,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: rtrim(str) trims trailing space characters from the string. -Argument type: STRING -Return type: STRING -Example +Usage: `rtrim(str)` trims trailing space characters from the string. +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -446,11 +446,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: substring(str, start) or substring(str, start, length) returns substring using start and length. With no length, entire string from start is returned. -Argument type: STRING, INTEGER, INTEGER -Return type: STRING +Usage: `substring(str, start)` or substring(str, start, length) returns substring using start and length. With no length, entire string from start is returned. +**Argument type:** `STRING, INTEGER, INTEGER` +**Return type:** `STRING` Synonyms: SUBSTR -Example +### Example ```ppl source=people @@ -474,8 +474,8 @@ fetched rows / total rows = 1/1 ### Description Argument Type: STRING -Return type: STRING -Example +**Return type:** `STRING` +### Example ```ppl source=people @@ -498,10 +498,10 @@ fetched rows / total rows = 1/1 ### Description -Usage: upper(string) converts the string to uppercase. -Argument type: STRING -Return type: STRING -Example +Usage: `upper(string)` converts the string to uppercase. +**Argument type:** `STRING` +**Return type:** `STRING` +### Example ```ppl source=people @@ -524,11 +524,11 @@ fetched rows / total rows = 1/1 ### Description -Usage: regexp_replace(str, pattern, replacement) replace all substrings of the string value that match pattern with replacement and returns modified string value. -Argument type: STRING, STRING, STRING -Return type: STRING +Usage: `regexp_replace(str, pattern, replacement)` replace all substrings of the string value that match pattern with replacement and returns modified string value. +**Argument type:** `STRING, STRING, STRING` +**Return type:** `STRING` Synonyms: [REPLACE](#replace) -Example +### Example ```ppl source=people diff --git a/docs/user/ppl/functions/system.md b/docs/user/ppl/functions/system.md index 4eb2aeb811..4d394d2dd7 100644 --- a/docs/user/ppl/functions/system.md +++ b/docs/user/ppl/functions/system.md @@ -4,11 +4,12 @@ ### Description -Usage: typeof(expr) function returns name of the data type of the value that is passed to it. This can be helpful for troubleshooting or dynamically constructing SQL queries. -Argument type: ANY -Return type: STRING +Usage: `typeof(expr)` function returns name of the data type of the value that is passed to it. This can be helpful for troubleshooting or dynamically constructing SQL queries. -Example +**Argument type:** `ANY` +**Return type:** `STRING` + +### Example ```ppl source=people @@ -26,4 +27,4 @@ fetched rows / total rows = 1/1 | DATE | INT | TIMESTAMP | STRUCT | +--------------+-------------+---------------+----------------+ ``` - \ No newline at end of file + diff --git a/scripts/docs_exporter/README.md b/scripts/docs_exporter/README.md new file mode 100644 index 0000000000..1260dee3eb --- /dev/null +++ b/scripts/docs_exporter/README.md @@ -0,0 +1,85 @@ +# PPL Documentation Exporter + +Exports PPL documentation to the OpenSearch documentation website. Auto-injects Jekyll front-matter, converts SQL CLI tables to markdown, fixes relative links, and adds copy buttons. + +## Directory Structure + +``` +sql/ +├── docs/user/ppl/ <-- SOURCE +├── scripts/docs_exporter/ <-- THIS TOOL + +documentation-website/ <-- MUST BE SIBLING OF sql/ +└── _sql-and-ppl/ppl/ <-- DESTINATION +``` + +## SOP: Exporting to Documentation Website + +### 1. Clone documentation-website to same root as `sql` repo (first time only) + +```bash +cd /path/to/sql/../ +git clone https://github.com/opensearch-project/documentation-website.git +``` + +### 2. Rebase documentation-website to latest + +```bash +cd documentation-website +git fetch origin +git rebase origin/main +``` + +### 3. Run the export + +As of Dec 17 2025, the migration to auto-export for documentation-website is ongoing. +Currently select directories (e.g. `docs/user/ppl/cmd`) are only exported to documentation-website. + +#### How to export specific directories only: +```bash +# Export only cmd/ +./export_to_docs_website.py --only-dirs cmd + +# Export cmd/ and functions/ +./export_to_docs_website.py --only-dirs cmd,functions +``` + +#### How to export all directories +```bash +cd sql/scripts/docs_exporter +./export_to_docs_website.py +``` + +### 4. Review and commit changes + +```bash +cd documentation-website +git diff +git add -A +git commit -m "Update PPL documentation" +``` + +### 5. Open Pull Request in documentation-website repo +Example: https://github.com/opensearch-project/documentation-website/pull/11688 + +## Options + +| Option | Description | +|--------|-------------| +| `-y, --yes` | Auto-overwrite existing files without prompting | +| `--only-dirs` | Comma-separated list of directories to export (e.g., `cmd`, `cmd,functions`) | + +## What the exporter does + +- Injects Jekyll front-matter (title, parent, nav_order, etc.) +- Converts SQL CLI table output to markdown tables +- Removes empty tables and their surrounding whitespace +- Escapes angle brackets and asterisks in table cells for Jekyll compatibility +- Converts `docs.opensearch.org` links to Jekyll site variables +- Fixes relative links to use `{{site.url}}{{site.baseurl}}` +- Handles anchor normalization (removes dots) +- Converts `ppl` code fences to `sql` +- Converts `bash ignore` code blocks to `json` with curl copy buttons +- Adds copy buttons to code blocks +- Converts markdown emphasis (**Note**, **Warning**, **Important**) to Jekyll attribute syntax +- Rolls up third-level directories to avoid Jekyll rendering limitations diff --git a/scripts/docs_exporter/export_to_docs_website.py b/scripts/docs_exporter/export_to_docs_website.py index 0ba63aa5c5..51c708cf3f 100755 --- a/scripts/docs_exporter/export_to_docs_website.py +++ b/scripts/docs_exporter/export_to_docs_website.py @@ -24,14 +24,15 @@ import re import os +import argparse from collections import defaultdict from pathlib import Path from typing import Optional # Base path for links in the documentation website -DOCS_BASE_PATH = "sql-and-ppl/ppl-reference" -DOCS_BASE_TITLE = "OpenSearch PPL Reference Manual" +DOCS_PARENT_BASE_PATH = "sql-and-ppl/ppl" +DOCS_PARENT_BASE_TITLE = "PPL" # Directory name to heading mappings (as they appear on website) DIR_NAMES_TO_HEADINGS_MAP = { @@ -44,12 +45,105 @@ "reference": "Reference", } +# Custom redirect_from lists for specific files (relative path from source root) +# Required for backward compatibility from old website links. Injected in Jekyll front-matter. +CUSTOM_REDIRECTS = { + "cmd/index.md": [ + "/search-plugins/sql/ppl/functions/", + "/observability-plugin/ppl/commands/", + "/search-plugins/ppl/commands/", + "/search-plugins/ppl/functions/", + "/sql-and-ppl/ppl/functions/", + ], +} + +# Directory name mappings for export (source_dir -> target_dir) +DIR_PATH_MAPPINGS = { + "cmd": "commands", +} + +# Command title overrides (filename -> custom title) +CMD_TITLE_OVERRIDES = { + "showdatasources": "show datasources", + "syntax": "PPL syntax", +} + def get_heading_for_dir(dir_name: str) -> str: """Get heading for directory name, using mapped value or fallback to title-case.""" return DIR_NAMES_TO_HEADINGS_MAP.get(dir_name, dir_name.replace("-", " ").title()) +def map_directory_path(rel_path: Path) -> Path: + """Map directory paths from source to target naming conventions.""" + parts = list(rel_path.parts) + + # Apply directory mappings + for i, part in enumerate(parts[:-1]): # Don't map the filename itself + if part in DIR_PATH_MAPPINGS: + parts[i] = DIR_PATH_MAPPINGS[part] + + return Path(*parts) + + +def convert_sql_table_to_markdown(table_text: str) -> str: + """Convert SQL CLI table format to markdown table.""" + lines = table_text.strip().split('\n') + result = [] + header_done = False + data_row_count = 0 + + for line in lines: + # Skip border lines (+---+---+), separator lines (|---+---|), and fetched rows line + if re.match(r'^\+[-+]+\+$', line.strip()) or re.match(r'^\|[-+|]+\|$', line.strip()): + continue + if re.match(r'^fetched rows\s*/\s*total rows\s*=', line.strip()): + continue + # Data/header row + if line.strip().startswith('|') and line.strip().endswith('|'): + cells = [c.strip() for c in line.strip().strip('|').split('|')] + # Escape angle brackets for Jekyll in converted tables (results tables) + cells = [c.replace('<', '\\<').replace('>', '\\>').replace('*', '\\*') for c in cells] + result.append('| ' + ' | '.join(cells) + ' |') + if not header_done: + result.append('|' + '|'.join([' --- ' for _ in cells]) + '|') + header_done = True + else: + data_row_count += 1 + + # Return empty string if table has no data rows (only header) + if data_row_count == 0: + return '' + + return '\n'.join(result) + + +def convert_tables_in_code_blocks(content: str) -> str: + """Find and convert SQL CLI tables in code blocks to markdown tables.""" + def replace_table(match): + block_content = match.group(1) + # Check if this looks like a SQL CLI table + if re.search(r'^\+[-+]+\+$', block_content, re.MULTILINE): + return convert_sql_table_to_markdown(block_content) + return match.group(0) + + # First, remove empty tables with their trailing blank line + def replace_empty_table(match): + block_content = match.group(1) + # Check if table is empty + if re.search(r'^\+[-+]+\+$', block_content, re.MULTILINE): + converted_table = convert_sql_table_to_markdown(block_content) + if converted_table == '': + return '' # Remove entire match (table + blank line) + return match.group(0) # Keep original if not empty table + + # Remove empty tables and their surrounding blank lines + content = re.sub(r'\n```[^\n]*\n(.*?)```\n', replace_empty_table, content, flags=re.DOTALL) + + # Then convert remaining tables normally + return re.sub(r'```[^\n]*\n(.*?)```', replace_table, content, flags=re.DOTALL) + + def extract_title(content: str) -> Optional[str]: """Extract title from first H1 heading or return None.""" match = re.search(r'^#\s+(.+)$', content, re.MULTILINE) @@ -62,25 +156,23 @@ def generate_frontmatter( grand_parent: Optional[str] = None, nav_order: int = 1, has_children: bool = False, - redirect_from: Optional[str] = None, + redirect_from: Optional[list] = None, ) -> str: """Generate Jekyll front-matter.""" - def escape_yaml_string(s: str) -> str: - """Escape string for YAML double quotes.""" - return s.replace('\\', '\\\\').replace('"', '\\"') - fm = ["---", "layout: default"] if title: - fm.append(f'title: "{escape_yaml_string(title)}"') + fm.append(f"title: {title}") if parent: - fm.append(f'parent: "{escape_yaml_string(parent)}"') + fm.append(f"parent: {parent}") if grand_parent: - fm.append(f'grand_parent: "{escape_yaml_string(grand_parent)}"') + fm.append(f"grand_parent: {grand_parent}") fm.append(f"nav_order: {nav_order}") if has_children: fm.append("has_children: true") if redirect_from: - fm.append(f'redirect_from: ["{escape_yaml_string(redirect_from)}"]') + fm.append("redirect_from:") + for r in redirect_from: + fm.append(f" - {r}") fm.append("---\n") return "\n".join(fm) @@ -108,8 +200,8 @@ def fix_link(match, current_file_path=None): if link.startswith("http"): return match.group(0) - # Remove .md extension - link = link.replace(".md", "") + # Remove .md and .rst extensions + link = link.replace(".md", "").replace(".rst", "") # Resolve path based on link type if ( @@ -134,27 +226,38 @@ def fix_link(match, current_file_path=None): # Clean up malformed paths resolved_path = re.sub(r"[,\s]+", "-", resolved_path.strip()) - # Normalize anchor for Jekyll (remove dots and dashes) + # Normalize anchor for Jekyll (remove dots) if anchor: - anchor = re.sub(r"[.-]", "", anchor.lower()) + anchor = re.sub(r"[.]", "", anchor.lower()) # Add trailing slash for directories (but not with anchors) if resolved_path and not resolved_path.endswith((".html", ".htm")) and not anchor: resolved_path = resolved_path.rstrip("/") + "/" - return f"]({{{{site.url}}}}{{{{site.baseurl}}}}/{DOCS_BASE_PATH}/{resolved_path}{anchor})" + return f"]({{{{site.url}}}}{{{{site.baseurl}}}}/{DOCS_PARENT_BASE_PATH}/{resolved_path}{anchor})" def process_content(content: str, current_file_path=None) -> str: """Process markdown content with PPL->SQL conversion, copy buttons, and link fixes.""" + # Convert SQL CLI tables in code blocks to markdown tables + content = convert_tables_in_code_blocks(content) + # Convert PPL code fences to SQL content = re.sub(r'^```ppl\b.*$', '```sql', content, flags=re.MULTILINE) + # Convert bash ignore blocks to JSON with copy-curl buttons + content = re.sub(r'^```bash ignore\b.*?\n(.*?)^```$', + r'```json\n\1```\n{% include copy-curl.html %}', + content, flags=re.MULTILINE | re.DOTALL) + # Add copy buttons after code fences - content = re.sub(r'^```(bash|sh|sql)\b.*?\n(.*?)^```$', - r'```\1\n\2```\n{% include copy.html %}', + content = re.sub(r'^```(bash|sh|sql)\b.*?\n(.*?)^```$', + r'```\1\n\2```\n{% include copy.html %}', content, flags=re.MULTILINE | re.DOTALL) + # Convert syntax code fences to SQL (for syntax definitions, no copy buttons) + content = re.sub(r'^```syntax\b.*$', '```sql', content, flags=re.MULTILINE) + # Convert relative links with current file context def fix_link_with_context(match): return fix_link(match, current_file_path) @@ -163,10 +266,36 @@ def fix_link_with_context(match): r"\]\((?!https?://)(.*?)(\.md)?(#[^\)]*)?\)", fix_link_with_context, content ) + # Convert docs.opensearch.org links to site variables + def fix_opensearch_link(match): + path = match.group(1) + return f"]({{{{site.url}}}}{{{{site.baseurl}}}}{path})" + + content = re.sub( + r"\]\(https://docs\.opensearch\.org/[^/]+(.*?)\)", fix_opensearch_link, content + ) + + for source_dir, target_dir in DIR_PATH_MAPPINGS.items(): + content = content.replace(f'/{source_dir}/', f'/{target_dir}/') + + # Handle specific admin/settings link + content = content.replace('](../../admin/settings.rst)', ']({{site.url}}{{site.baseurl}}/sql-and-ppl/settings/)') + content = content.replace('](../../admin/settings.md)', ']({{site.url}}{{site.baseurl}}/sql-and-ppl/settings/)') + + # Convert markdown blockquotes to Jekyll attribute syntax + content = re.sub(r'^> \*\*Note\*\*:?\s*(.*?)$', r'\1\n{: .note}', content, flags=re.MULTILINE) + content = re.sub(r'^> \*\*Warning\*\*:?\s*(.*?)$', r'\1\n{: .warning}', content, flags=re.MULTILINE) + content = re.sub(r'^> \*\*Important\*\*:?\s*(.*?)$', r'\1\n{: .important}', content, flags=re.MULTILINE) + return content -def export_docs(source_dir: Path, target_dir: Path) -> None: +def export_docs( + source_dir: Path, + target_dir: Path, + auto_yes: bool = False, + only_dirs: Optional[set] = None, +) -> None: """Export PPL docs to documentation website.""" if not source_dir.exists(): print(f"Source directory {source_dir} not found") @@ -174,14 +303,33 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: # Check if target directory exists and has files if target_dir.exists() and any(target_dir.glob('**/*.md')): - response = input(f"Target directory {target_dir} contains files. Overwrite? (y/n): ") - if response.lower() != 'y': - print("Export cancelled") - return + if auto_yes: + print( + f"Target directory {target_dir} contains files. Auto-overwriting (--yes flag)." + ) + else: + response = input( + f"Target directory {target_dir} contains files. Overwrite? (y/n): " + ) + if response.lower() != "y": + print("Export cancelled") + return # Get all markdown files sorted alphabetically md_files = sorted(source_dir.glob("**/*.md")) + # Filter to only specified directories if provided + if only_dirs: + md_files = [ + f for f in md_files if f.relative_to(source_dir).parts[0] in only_dirs + ] + + # Filter to only specified directories if provided + if only_dirs: + md_files = [ + f for f in md_files if f.relative_to(source_dir).parts[0] in only_dirs + ] + # Group files by directory for local nav_order files_by_dir = defaultdict(list) @@ -200,13 +348,18 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: # Sort files within each directory alphabetically for proper nav_order for dir_name in files_by_dir: files_by_dir[dir_name].sort(key=lambda f: f.name) + if dir_name == "cmd" and any(f.name == "syntax.md" for f in files_by_dir[dir_name]): + files_by_dir[dir_name].sort(key=lambda f: (f.name != "syntax.md", f.name)) for _, files in files_by_dir.items(): for i, md_file in enumerate(files, 1): rel_path = md_file.relative_to(source_dir) + rel_path_str = str(rel_path) + + # Check for custom redirects + redirect_from = CUSTOM_REDIRECTS.get(rel_path_str, None) # Roll up third-level files to second level to avoid rendering limitations - redirect_from = None if len(rel_path.parts) >= 3: # Move from admin/connectors/file.md to admin/connectors_file.md parent_dir = rel_path.parts[0] # e.g., "admin" @@ -216,8 +369,8 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: target_file = target_dir / parent_dir / new_filename # Generate redirect_from for the original path - original_path = f"/{DOCS_BASE_PATH}/{rel_path.with_suffix('')}/" - redirect_from = original_path + original_path = f"/{DOCS_PARENT_BASE_PATH}/{rel_path.with_suffix('')}/" + redirect_from = (redirect_from or []) + [original_path] print( f"\033[93mWARNING: Rolling up {rel_path} to {parent_dir}/{new_filename} due to rendering limitations\033[0m" @@ -226,7 +379,8 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: # Update rel_path for parent/grand_parent logic rel_path = Path(parent_dir) / new_filename else: - target_file = target_dir / rel_path + mapped_path = map_directory_path(rel_path) + target_file = target_dir / mapped_path # Determine parent and grand_parent based on directory structure if rel_path.parent == Path("."): @@ -236,27 +390,37 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: elif len(rel_path.parts) == 2: # Second level files (including rolled-up files) parent = get_heading_for_dir(rel_path.parent.name) - grand_parent = DOCS_BASE_TITLE + grand_parent = DOCS_PARENT_BASE_TITLE else: # This shouldn't happen after roll-up, but keeping for safety parent = get_heading_for_dir(rel_path.parent.name) grand_parent = get_heading_for_dir(rel_path.parts[-3]) - grand_parent = DIR_NAMES_TO_HEADINGS_MAP.get( - grand_parent_name, grand_parent_name.replace("-", " ").title() - ) - # Check if this is the root index.md and has children - is_root_index = rel_path.name == "index.md" and rel_path.parent == Path(".") - has_children = ( - is_root_index - or (md_file.parent / md_file.stem).is_dir() - and any((md_file.parent / md_file.stem).glob("*/*.md")) - ) + # Check if this is an index.md (root or directory) - these have children + is_index = rel_path.name == "index.md" + has_children = is_index + + # For directory index files, parent should be one level up + if is_index and rel_path.parent != Path("."): + parent = DOCS_PARENT_BASE_TITLE + grand_parent = None - title = ( - extract_title(md_file.read_text(encoding="utf-8")) - or md_file.stem.replace("-", " ").title() - ) + # Determine title - use directory name for index files, filename for cmd files + if is_index: + # For index files, use the directory heading as title + title = get_heading_for_dir(rel_path.parent.name) if rel_path.parent != Path(".") else DOCS_PARENT_BASE_TITLE + elif len(rel_path.parts) >= 2 and rel_path.parts[0] == "cmd": + # For command files, check for custom title override first + if md_file.stem in CMD_TITLE_OVERRIDES: + title = CMD_TITLE_OVERRIDES[md_file.stem] + else: + # Use filename as ground truth + title = md_file.stem.replace("-", " ") + else: + title = ( + extract_title(md_file.read_text(encoding="utf-8")) + or md_file.stem.replace("-", " ").title() + ) frontmatter = generate_frontmatter( title, parent, grand_parent, i, has_children, redirect_from ) @@ -278,6 +442,12 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: # Skip third-level directories since files are rolled up if len(dir_path.parts) > 1: continue + # Skip directories not in only_dirs filter + if only_dirs and dir_path.parts[0] not in only_dirs: + continue + # Skip if source index.md exists (it will be exported with the other files) + if (source_dir / dir_path / "index.md").exists(): + continue target_index = target_dir / dir_path / "index.md" title = get_heading_for_dir(dir_path.name) @@ -285,7 +455,7 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: # Determine parent for directory index based on depth if len(dir_path.parts) == 1: # Second-level directory (e.g., admin/) - parent is root title - parent = DOCS_BASE_TITLE + parent = DOCS_PARENT_BASE_TITLE else: # This shouldn't happen after filtering, but keeping for safety parent = get_heading_for_dir(dir_path.parent.name) @@ -296,7 +466,26 @@ def export_docs(source_dir: Path, target_dir: Path) -> None: if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Export PPL docs to documentation website" + ) + parser.add_argument( + "-y", + "--yes", + action="store_true", + help="Automatically overwrite existing files without prompting", + ) + parser.add_argument( + "--only-dirs", + type=str, + help="Comma-separated list of directories to export (e.g., 'cmd' or 'cmd,functions')", + ) + args = parser.parse_args() + script_dir = Path(__file__).parent source_dir_ppl = script_dir / "../../docs/user/ppl" - target_dir_ppl = script_dir / f"../../../documentation-website/_{DOCS_BASE_PATH}" - export_docs(source_dir_ppl, target_dir_ppl) + target_dir_ppl = ( + script_dir / f"../../../documentation-website/_{DOCS_PARENT_BASE_PATH}" + ) + only_dirs = set(args.only_dirs.split(",")) if args.only_dirs else None + export_docs(source_dir_ppl, target_dir_ppl, args.yes, only_dirs)