[RFC] PPL `convert` Command

## Problem Statement

PPL currently lacks a unified command for converting field values between different data types and formats. Users frequently need to convert data for analysis, calculations, and reporting purposes, but must rely on chaining multiple evaluation functions or writing complex expressions. Common conversion needs include:

- Converting string representations to numeric values
- Transforming timestamp formats for readability or calculations
- Converting duration strings to seconds for mathematical operations
- Normalizing memory values to consistent units
- Extracting numeric values from mixed alphanumeric strings

This creates verbose queries and increases the learning curve for users who need to remember multiple function names and their specific syntax.

## Current State

Users must currently use evaluation functions within the `eval` command to perform conversions:

```sql
source=logs | eval balance_num = tonumber(balance) | eval time_readable = strftime(_time, "%H:%M:%S")
```

Limitations of the current approach:
- Requires knowledge of specific function names scattered across documentation
- No unified interface for common conversion operations
- Difficult to apply conversions to multiple fields efficiently
- No support for specialized conversions like duration-to-seconds or memory unit normalization
- Verbose syntax for simple operations

## Long-Term Goals

- Provide a unified `convert` command that consolidates common data type conversion operations
- Simplify PPL queries by reducing the need for multiple eval statements
- Support bulk field conversions with wildcard patterns
- Enable in-place field conversion or creation of new fields via AS clause
- Maintain high performance with large datasets through optimized conversion implementations
- Ensure the solution is extensible for future conversion function additions
- Align with SQL plugin architecture and PPL command patterns
- Support 90% confidence that this addresses user conversion needs based on common log analysis patterns

## Proposal

Implement a `convert` command in PPL with the following conversion functions:

1. **Numeric Conversions**
   - `auto()` - Automatically converts fields to numbers using best-fit heuristics
   - `num()` - Converts to numbers, removing non-convertible values
   - `rmcomma()` - Removes commas from numeric strings
   - `rmunit()` - Extracts leading numbers and removes trailing text

2. **Time Conversions**
   - `ctime()` - Converts UNIX timestamps to human-readable format
   - `mktime()` - Converts human-readable time strings to epoch time
   - `mstime()` - Converts MM:SS.SSS format to seconds
   - `dur2sec()` - Converts duration format [D+]HH:MM:SS to seconds

3. **Memory Conversions**
   - `memk()` - Converts memory values (with k/m/g suffixes) to kilobytes

4. **Field Control**
   - `none()` - Excludes specific fields from wildcard conversions

## Syntax

```
... | convert [timeformat=<string>] <convert-function>(<field>) [AS <field>] [<convert-function>(<field>) [AS <field>]]...
```

## Parameters
- `timeformat`: (Optional) Output format for time conversions (used by `ctime` and `mktime`)
- `<convert-function>`: One of the conversion functions listed above
- `<field>`: Field name to convert, supports wildcards (*)
- `AS <field>`: (Optional) Create new field with converted value, preserving original

## Approach

### Phase 1: Grammar and Parser Updates

1. **Update OpenSearchPPLParser.g4**
   - Add `convertCommand` rule to `commands` section
   - Define `convertFunction` rule with all conversion function types
   - Add `convertClause` for handling multiple conversions in one command
   - Support timeformat parameter and AS clause

2. **Update OpenSearchPPLLexer.g4**
   - Add keywords: CONVERT, AUTO, CTIME, MKTIME, MSTIME, DUR2SEC, MEMK, RMCOMMA, RMUNIT, TIMEFORMAT

### Phase 2: AST Building

3. **Create AstBuilder visitor method**
   - Implement `visitConvertCommand()` in AstBuilder.java
   - Parse conversion function types and field expressions
   - Handle wildcard field patterns
   - Extract timeformat parameter and AS aliases
   - Build `Convert` UnresolvedPlan node with list of `ConvertFunction` objects

### Phase 3: Calcite UDF Implementation

4. __Create BaseConversionUDF abstract class__

   - Extend `ImplementorUDF` to integrate with Calcite
   - Provide single constructor accepting conversion method name
   - Use `ConversionImplementor` (NotNullImplementor) for code generation
   - Implement `getReturnTypeInference()` returning nullable DOUBLE type
   - Define `toDoubleOrNull()` helper for consistent null handling

5. __Implement individual UDF classes__

   - Create `AutoConvertFunction` extending BaseConversionUDF
   - Create `NumConvertFunction` extending BaseConversionUDF
   - Create `RmcommaConvertFunction` extending BaseConversionUDF
   - Create `RmunitConvertFunction` extending BaseConversionUDF
   - Each class: simple constructor calling super with method name (e.g., "autoConvert")
   - Register each function in Calcite's function repository

### Phase 4: Conversion Logic Implementation

6. __Create ConversionUtils utility class__

   - Implement static methods for each conversion function:
     - `autoConvert()` - Intelligent conversion with comma/unit handling
     - `numConvert()` - Number extraction with comma removal
     - `rmcommaConvert()` - Pure comma removal and parsing
     - `rmunitConvert()` - Leading number extraction
   - Each method signature: `public static Object methodName(Object value)`
   - Return `Double` on success, `null` on failure (consistent with eval+cast)

7. __Conversion logic details__
   - __auto()__: Uses ConversionStrategy.COMPREHENSIVE with fallback chain
   - __num()__: Uses ConversionStrategy.STANDARD with numeric focus
   - __rmcomma()__: ConversionStrategy.COMMA_ONLY - regex comma removal
   - __rmunit()__: ConversionStrategy.UNIT_ONLY - extract leading numbers

   - All use private helper methods:
     - `tryParseDouble()` - Safe Double.parseDouble with null handling
     - `isPotentiallyConvertible()` - Quick validation before conversion
     - `convertStandard()`, `convertComprehensive()`, etc. - Strategy implementations

### Phase 5: Integration and Execution

8. __Calcite integration flow__
   - Convert command parsed → AST `Convert` node created
   - Calcite planner resolves UDF function calls
   - `BaseConversionUDF.ConversionImplementor` generates execution code
   - Generated code calls `ConversionUtils.methodName()`
   - Result passed through `toDoubleOrNull()` for type safety
   - Calcite handles expression evaluation and null propagation

9. __No custom operators needed__
   - Leverage Calcite's expression evaluation engine
   - Benefits from Calcite's optimization passes
   - Integrates seamlessly with other Calcite features
   - Standard UDF pattern familiar to developers

### Phase 6: Testing and Documentation

10. __Unit tests__ (`ConversionUtilsTest.java`)
    - Test each conversion function with various inputs
    - Verify null handling for non-convertible values
    - Test edge cases (empty strings, special characters, etc.)

11. __Integration tests__
    - Test command parsing and execution
    - Test with AS clause for new field creation
    - Test wildcard patterns and multiple conversions

12. __Documentation__ (`docs/user/ppl/cmd/convert.md`)
    - Document each conversion function with examples
    - Explain null behavior on conversion failure
    - Provide usage patterns and best practices

## Examples

### Example 1: Convert all fields to numeric values
```sql
source=accounts | convert auto(*)
```
Input:
| account_number | balance    | age    |
|----------------|------------|--------|
| 1              | "39,225"   | "32"   |
| 2              | "5,686"    | "28"   |

Output:
| account_number | balance | age |
|----------------|---------|-----|
| 1              | 39225   | 32  |
| 2              | 5686    | 28  |

*Note: auto(*) automatically converts string values to numbers where possible. "39,225" becomes 39225 by removing commas, "32" becomes 32.*

### Example 2: Convert specific fields while excluding others
```sql
source=accounts | convert auto(*) none(account_id)
```
Input:
| account_number | balance    | age    | account_id |
|----------------|------------|--------|------------|
| 1              | "39,225"   | "32"   | "ACC001"   |
| 2              | "5,686"    | "28"   | "ACC002"   |

Output:
| account_number | balance | age | account_id |
|----------------|---------|-----|------------|
| 1              | 39225   | 32  | "ACC001"   |
| 2              | 5686    | 28  | "ACC002"   |

*Note: auto(*) converts balance and age (removing special characters), but none(account_id) excludes account_id from conversion, preserving it as a string.*

### Example 3: Convert duration to seconds
```sql
source=sessions | convert dur2sec(session_time) dur2sec(login_duration)
```
Input:
| user_id | session_time | login_duration |
|---------|--------------|----------------|
| user1   | "00:15:30"   | "00:05:45"     |
| user2   | "01:20:15"   | "00:10:30"     |

Output:
| user_id | session_time | login_duration |
|---------|--------------|----------------|
| user1   | 930          | 345            |
| user2   | 4815         | 630            |

*Note: dur2sec() converts HH:MM:SS format to total seconds. 00:15:30 = (0×3600) + (15×60) + 30 = 930 seconds, 00:05:45 = 345 seconds.*

### Example 4: Convert with extended duration format (days + time)
```sql
source=logs | convert dur2sec(delay)
```
Input:
| request_id | delay          |
|------------|----------------|
| req1       | "1+02:10:15"   |
| req2       | "00:05:30"     |
| req3       | "3+00:02:45"   |

Output:
| request_id | delay  |
|------------|--------|
| req1       | 94215  |
| req2       | 330    |
| req3       | 259365 |

*Note: dur2sec() handles extended format D+HH:MM:SS where D+ represents days. 1+02:10:15 = (1×86400) + (2×3600) + (10×60) + 15 = 94215 seconds.*

### Example 5: Remove units from field values
```sql
source=metrics | convert rmunit(duration)
```
Input:
| metric_id | duration       |
|-----------|----------------|
| m1        | "212 sec"      |
| m2        | "450 minutes"  |
| m3        | "30 hours"     |

Output:
| metric_id | duration |
|-----------|----------|
| m1        | 212      |
| m2        | 450      |
| m3        | 30       |

*Note: rmunit() extracts numbers from the beginning of values and removes trailing text. "212 sec" becomes 212, "450 minutes" becomes 450.*

### Example 6: Convert memory values to kilobytes
```sql
source=processes | convert memk(memory_usage)
```
Input:
| process_id | memory_usage |
|------------|--------------|
| p1         | "512m"       |
| p2         | "1g"         |
| p3         | "256k"       |

Output:
| process_id | memory_usage |
|------------|--------------|
| p1         | 524288       |
| p2         | 1048576      |
| p3         | 256          |

*Note: memk() converts memory units to kilobytes. "512m" = 512×1024 = 524288 KB, "1g" = 1×1024×1024 = 1048576 KB, "256k" = 256 KB.*

### Example 7: Convert UNIX timestamp to readable format
```sql
source=events | convert timeformat="%H:%M:%S" ctime(unix_time) AS readable_time
```
Input:
| event_id | unix_time  |
|----------|------------|
| e1       | 1522196414 |
| e2       | 1522196465 |
| e3       | 1522196823 |

Output:
| event_id | unix_time  | readable_time |
|----------|------------|---------------|
| e1       | 1522196414 | "00:20:14"    |
| e2       | 1522196465 | "00:21:05"    |
| e3       | 1522196823 | "00:27:03"    |

*Note: ctime() converts UNIX timestamps to human-readable format. timeformat="%H:%M:%S" shows only hours:minutes:seconds. Original unix_time field is preserved with AS clause.*

### Example 8: Convert MM:SS.SSS to seconds
```sql
source=responses | convert mstime(response_time)
```
Input:
| request_id | response_time |
|------------|---------------|
| r1         | "02:15.500"   |
| r2         | "01:30.250"   |
| r3         | "03:45.750"   |

Output:
| request_id | response_time |
|------------|---------------|
| r1         | 135.5         |
| r2         | 90.25         |
| r3         | 225.75        |

*Note: mstime() converts MM:SS.SSS to total seconds. "02:15.500" = (2×60) + 15.5 = 135.5 seconds.*

### Example 9: Remove commas from numeric values
```sql
source=accounts | convert rmcomma(balance)
```
Input:
| account_id | balance      |
|------------|--------------|
| acc1       | "39,225.50"  |
| acc2       | "5,686.75"   |
| acc3       | "32,838.25"  |

Output:
| account_id | balance   |
|------------|-----------|
| acc1       | 39225.50  |
| acc2       | 5686.75   |
| acc3       | 32838.25  |

*Note: rmcomma() removes all commas from values. "39,225.50" becomes 39225.50 for proper numeric calculations.*

### Example 10: Multiple conversions in one command
```sql
source=data | convert rmcomma(balance) auto(age) memk(memory) rmunit(duration)
```
Input:
| id | balance    | age    | memory | duration       |
|----|------------|--------|--------|----------------|
| 1  | "39,225"   | "32"   | "512m" | "45 minutes"   |
| 2  | "5,686"    | "28"   | "1g"   | "30 seconds"   |

Output:
| id | balance | age | memory  | duration |
|----|---------|-----|---------|----------|
| 1  | 39225   | 32  | 524288  | 45       |
| 2  | 5686    | 28  | 1048576 | 30       |

*Note: Multiple convert functions can be used in one command. rmcomma() removes commas, auto() converts to numbers, memk() converts to KB, rmunit() removes text.*

### Example 11: Create new fields with AS clause
```sql
source=logs | convert dur2sec(time_elapsed) AS time_seconds | stats sum(time_seconds) by user_id
```
Input:
| user_id | time_elapsed |
|---------|--------------|
| user1   | "01:30:45"   |
| user2   | "02:15:30"   |
| user3   | "00:45:15"   |

After convert:
| user_id | time_elapsed | time_seconds |
|---------|--------------|--------------|
| user1   | "01:30:45"   | 5445         |
| user2   | "02:15:30"   | 8130         |
| user3   | "00:45:15"   | 2715         |

*Note: dur2sec() converts HH:MM:SS to seconds, then the AS clause creates a new field time_seconds while preserving the original time_elapsed field. The converted values can then be used in aggregations like stats sum().*

## Alternative

Temporary workarounds users can employ:
1. Continue using `eval` with `tonumber()`, `tostring()`, and other existing functions
2. Pre-process data during ingestion to normalize formats
3. Create custom UDFs for specific conversion needs

However, these alternatives:
- Don't provide the unified interface that reduces cognitive load
- Require more verbose queries
- Don't cover all conversion scenarios (e.g., dur2sec, memk)
- Don't support efficient bulk conversions with wildcards

## Implementation Discussion

1. **Phased Implementation**: Should we implement all 10 functions in the initial release, or prioritize based on usage patterns?
   - Recommendation: Start with numeric conversions (auto, num, rmcomma, rmunit) and time conversions (ctime, dur2sec) as these cover 80% of use cases

2. **Error Handling**: How should the command handle conversion failures?
   - Option A: Silently leave unconvertible values unchanged
   - Option B: Set failed conversions to null (matches eval behavior) ✓
   - Option C: Provide warning logs for troubleshooting
   - Recommendation: Option B for consistency with existing PPL behavior

3. **Performance Optimization**: For wildcard conversions, should we:
   - Analyze all field values to determine convertibility before applying?
   - Apply conversion attempts and roll back if too many failures occur?
   - Recommendation: Use sampling approach - analyze first N rows to determine field convertibility

4. **Timeformat Default**: Should we use locale-aware defaults or fixed format?
   - Recommendation: Use ISO 8601 format (%Y-%m-%d %H:%M:%S) as default for international compatibility

5. **Extensibility**: Should we design an interface for adding custom conversion functions?
   - Recommendation: No interface needed ( BaseConversionUDF abstract class is sufficient)

6. **Wildcard Performance**: How to optimize wildcard matching with large field counts?
   - Recommendation: Build field pattern cache at planning time, apply conversions in streaming fashion

7. **AS Clause Scoping**: When using AS clause, should original field remain accessible downstream?
   - Recommendation: Yes, both original and converted fields should be available unless explicitly dropped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] PPL `convert` Command #5001

Problem Statement

Current State

Long-Term Goals

Proposal

Syntax

Parameters

Approach

Phase 1: Grammar and Parser Updates

Phase 2: AST Building

Phase 3: Calcite UDF Implementation

Phase 4: Conversion Logic Implementation

Phase 5: Integration and Execution

Phase 6: Testing and Documentation

Examples

Example 1: Convert all fields to numeric values

Example 2: Convert specific fields while excluding others

Example 3: Convert duration to seconds

Example 4: Convert with extended duration format (days + time)

Example 5: Remove units from field values

Example 6: Convert memory values to kilobytes

Example 7: Convert UNIX timestamp to readable format

Example 8: Convert MM:SS.SSS to seconds

Example 9: Remove commas from numeric values

Example 10: Multiple conversions in one command

Example 11: Create new fields with AS clause

Alternative

Implementation Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

event_id	unix_time	readable_time
e1	1522196414	"00:20:14"
e2	1522196465	"00:21:05"
e3	1522196823	"00:27:03"

[RFC] PPL convert Command #5001

Description

Problem Statement

Current State

Long-Term Goals

Proposal

Syntax

Parameters

Approach

Phase 1: Grammar and Parser Updates

Phase 2: AST Building

Phase 3: Calcite UDF Implementation

Phase 4: Conversion Logic Implementation

Phase 5: Integration and Execution

Phase 6: Testing and Documentation

Examples

Example 1: Convert all fields to numeric values

Example 2: Convert specific fields while excluding others

Example 3: Convert duration to seconds

Example 4: Convert with extended duration format (days + time)

Example 5: Remove units from field values

Example 6: Convert memory values to kilobytes

Example 7: Convert UNIX timestamp to readable format

Example 8: Convert MM:SS.SSS to seconds

Example 9: Remove commas from numeric values

Example 10: Multiple conversions in one command

Example 11: Create new fields with AS clause

Alternative

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC] PPL `convert` Command #5001