Problem Statement
PPL currently lacks a unified command for converting field values between different data types and formats. Users frequently need to convert data for analysis, calculations, and reporting purposes, but must rely on chaining multiple evaluation functions or writing complex expressions. Common conversion needs include:
- Converting string representations to numeric values
- Transforming timestamp formats for readability or calculations
- Converting duration strings to seconds for mathematical operations
- Normalizing memory values to consistent units
- Extracting numeric values from mixed alphanumeric strings
This creates verbose queries and increases the learning curve for users who need to remember multiple function names and their specific syntax.
Current State
Users must currently use evaluation functions within the eval command to perform conversions:
source=logs | eval balance_num = tonumber(balance) | eval time_readable = strftime(_time, "%H:%M:%S")
Limitations of the current approach:
- Requires knowledge of specific function names scattered across documentation
- No unified interface for common conversion operations
- Difficult to apply conversions to multiple fields efficiently
- No support for specialized conversions like duration-to-seconds or memory unit normalization
- Verbose syntax for simple operations
Long-Term Goals
- Provide a unified
convert command that consolidates common data type conversion operations
- Simplify PPL queries by reducing the need for multiple eval statements
- Support bulk field conversions with wildcard patterns
- Enable in-place field conversion or creation of new fields via AS clause
- Maintain high performance with large datasets through optimized conversion implementations
- Ensure the solution is extensible for future conversion function additions
- Align with SQL plugin architecture and PPL command patterns
- Support 90% confidence that this addresses user conversion needs based on common log analysis patterns
Proposal
Implement a convert command in PPL with the following conversion functions:
-
Numeric Conversions
auto() - Automatically converts fields to numbers using best-fit heuristics
num() - Converts to numbers, removing non-convertible values
rmcomma() - Removes commas from numeric strings
rmunit() - Extracts leading numbers and removes trailing text
-
Time Conversions
ctime() - Converts UNIX timestamps to human-readable format
mktime() - Converts human-readable time strings to epoch time
mstime() - Converts MM:SS.SSS format to seconds
dur2sec() - Converts duration format [D+]HH:MM:SS to seconds
-
Memory Conversions
memk() - Converts memory values (with k/m/g suffixes) to kilobytes
-
Field Control
none() - Excludes specific fields from wildcard conversions
Syntax
... | convert [timeformat=<string>] <convert-function>(<field>) [AS <field>] [<convert-function>(<field>) [AS <field>]]...
Parameters
timeformat: (Optional) Output format for time conversions (used by ctime and mktime)
<convert-function>: One of the conversion functions listed above
<field>: Field name to convert, supports wildcards (*)
AS <field>: (Optional) Create new field with converted value, preserving original
Approach
Phase 1: Grammar and Parser Updates
-
Update OpenSearchPPLParser.g4
- Add
convertCommand rule to commands section
- Define
convertFunction rule with all conversion function types
- Add
convertClause for handling multiple conversions in one command
- Support timeformat parameter and AS clause
-
Update OpenSearchPPLLexer.g4
- Add keywords: CONVERT, AUTO, CTIME, MKTIME, MSTIME, DUR2SEC, MEMK, RMCOMMA, RMUNIT, TIMEFORMAT
Phase 2: AST Building
- Create AstBuilder visitor method
- Implement
visitConvertCommand() in AstBuilder.java
- Parse conversion function types and field expressions
- Handle wildcard field patterns
- Extract timeformat parameter and AS aliases
- Build
Convert UnresolvedPlan node with list of ConvertFunction objects
Phase 3: Calcite UDF Implementation
-
Create BaseConversionUDF abstract class
- Extend
ImplementorUDF to integrate with Calcite
- Provide single constructor accepting conversion method name
- Use
ConversionImplementor (NotNullImplementor) for code generation
- Implement
getReturnTypeInference() returning nullable DOUBLE type
- Define
toDoubleOrNull() helper for consistent null handling
-
Implement individual UDF classes
- Create
AutoConvertFunction extending BaseConversionUDF
- Create
NumConvertFunction extending BaseConversionUDF
- Create
RmcommaConvertFunction extending BaseConversionUDF
- Create
RmunitConvertFunction extending BaseConversionUDF
- Each class: simple constructor calling super with method name (e.g., "autoConvert")
- Register each function in Calcite's function repository
Phase 4: Conversion Logic Implementation
-
Create ConversionUtils utility class
- Implement static methods for each conversion function:
autoConvert() - Intelligent conversion with comma/unit handling
numConvert() - Number extraction with comma removal
rmcommaConvert() - Pure comma removal and parsing
rmunitConvert() - Leading number extraction
- Each method signature:
public static Object methodName(Object value)
- Return
Double on success, null on failure (consistent with eval+cast)
-
Conversion logic details
-
auto(): Uses ConversionStrategy.COMPREHENSIVE with fallback chain
-
num(): Uses ConversionStrategy.STANDARD with numeric focus
-
rmcomma(): ConversionStrategy.COMMA_ONLY - regex comma removal
-
rmunit(): ConversionStrategy.UNIT_ONLY - extract leading numbers
-
All use private helper methods:
tryParseDouble() - Safe Double.parseDouble with null handling
isPotentiallyConvertible() - Quick validation before conversion
convertStandard(), convertComprehensive(), etc. - Strategy implementations
Phase 5: Integration and Execution
-
Calcite integration flow
- Convert command parsed → AST
Convert node created
- Calcite planner resolves UDF function calls
BaseConversionUDF.ConversionImplementor generates execution code
- Generated code calls
ConversionUtils.methodName()
- Result passed through
toDoubleOrNull() for type safety
- Calcite handles expression evaluation and null propagation
-
No custom operators needed
- Leverage Calcite's expression evaluation engine
- Benefits from Calcite's optimization passes
- Integrates seamlessly with other Calcite features
- Standard UDF pattern familiar to developers
Phase 6: Testing and Documentation
-
Unit tests (ConversionUtilsTest.java)
- Test each conversion function with various inputs
- Verify null handling for non-convertible values
- Test edge cases (empty strings, special characters, etc.)
-
Integration tests
- Test command parsing and execution
- Test with AS clause for new field creation
- Test wildcard patterns and multiple conversions
-
Documentation (docs/user/ppl/cmd/convert.md)
- Document each conversion function with examples
- Explain null behavior on conversion failure
- Provide usage patterns and best practices
Examples
Example 1: Convert all fields to numeric values
source=accounts | convert auto(*)
Input:
| account_number |
balance |
age |
| 1 |
"39,225" |
"32" |
| 2 |
"5,686" |
"28" |
Output:
| account_number |
balance |
age |
| 1 |
39225 |
32 |
| 2 |
5686 |
28 |
Note: auto() automatically converts string values to numbers where possible. "39,225" becomes 39225 by removing commas, "32" becomes 32.*
Example 2: Convert specific fields while excluding others
source=accounts | convert auto(*) none(account_id)
Input:
| account_number |
balance |
age |
account_id |
| 1 |
"39,225" |
"32" |
"ACC001" |
| 2 |
"5,686" |
"28" |
"ACC002" |
Output:
| account_number |
balance |
age |
account_id |
| 1 |
39225 |
32 |
"ACC001" |
| 2 |
5686 |
28 |
"ACC002" |
Note: auto() converts balance and age (removing special characters), but none(account_id) excludes account_id from conversion, preserving it as a string.*
Example 3: Convert duration to seconds
source=sessions | convert dur2sec(session_time) dur2sec(login_duration)
Input:
| user_id |
session_time |
login_duration |
| user1 |
"00:15:30" |
"00:05:45" |
| user2 |
"01:20:15" |
"00:10:30" |
Output:
| user_id |
session_time |
login_duration |
| user1 |
930 |
345 |
| user2 |
4815 |
630 |
Note: dur2sec() converts HH:MM:SS format to total seconds. 00:15:30 = (0×3600) + (15×60) + 30 = 930 seconds, 00:05:45 = 345 seconds.
Example 4: Convert with extended duration format (days + time)
source=logs | convert dur2sec(delay)
Input:
| request_id |
delay |
| req1 |
"1+02:10:15" |
| req2 |
"00:05:30" |
| req3 |
"3+00:02:45" |
Output:
| request_id |
delay |
| req1 |
94215 |
| req2 |
330 |
| req3 |
259365 |
Note: dur2sec() handles extended format D+HH:MM:SS where D+ represents days. 1+02:10:15 = (1×86400) + (2×3600) + (10×60) + 15 = 94215 seconds.
Example 5: Remove units from field values
source=metrics | convert rmunit(duration)
Input:
| metric_id |
duration |
| m1 |
"212 sec" |
| m2 |
"450 minutes" |
| m3 |
"30 hours" |
Output:
| metric_id |
duration |
| m1 |
212 |
| m2 |
450 |
| m3 |
30 |
Note: rmunit() extracts numbers from the beginning of values and removes trailing text. "212 sec" becomes 212, "450 minutes" becomes 450.
Example 6: Convert memory values to kilobytes
source=processes | convert memk(memory_usage)
Input:
| process_id |
memory_usage |
| p1 |
"512m" |
| p2 |
"1g" |
| p3 |
"256k" |
Output:
| process_id |
memory_usage |
| p1 |
524288 |
| p2 |
1048576 |
| p3 |
256 |
Note: memk() converts memory units to kilobytes. "512m" = 512×1024 = 524288 KB, "1g" = 1×1024×1024 = 1048576 KB, "256k" = 256 KB.
Example 7: Convert UNIX timestamp to readable format
source=events | convert timeformat="%H:%M:%S" ctime(unix_time) AS readable_time
Input:
| event_id |
unix_time |
| e1 |
1522196414 |
| e2 |
1522196465 |
| e3 |
1522196823 |
Output:
| event_id |
unix_time |
readable_time |
| e1 |
1522196414 |
"00:20:14" |
| e2 |
1522196465 |
"00:21:05" |
| e3 |
1522196823 |
"00:27:03" |
Note: ctime() converts UNIX timestamps to human-readable format. timeformat="%H:%M:%S" shows only hours:minutes:seconds. Original unix_time field is preserved with AS clause.
Example 8: Convert MM:SS.SSS to seconds
source=responses | convert mstime(response_time)
Input:
| request_id |
response_time |
| r1 |
"02:15.500" |
| r2 |
"01:30.250" |
| r3 |
"03:45.750" |
Output:
| request_id |
response_time |
| r1 |
135.5 |
| r2 |
90.25 |
| r3 |
225.75 |
Note: mstime() converts MM:SS.SSS to total seconds. "02:15.500" = (2×60) + 15.5 = 135.5 seconds.
Example 9: Remove commas from numeric values
source=accounts | convert rmcomma(balance)
Input:
| account_id |
balance |
| acc1 |
"39,225.50" |
| acc2 |
"5,686.75" |
| acc3 |
"32,838.25" |
Output:
| account_id |
balance |
| acc1 |
39225.50 |
| acc2 |
5686.75 |
| acc3 |
32838.25 |
Note: rmcomma() removes all commas from values. "39,225.50" becomes 39225.50 for proper numeric calculations.
Example 10: Multiple conversions in one command
source=data | convert rmcomma(balance) auto(age) memk(memory) rmunit(duration)
Input:
| id |
balance |
age |
memory |
duration |
| 1 |
"39,225" |
"32" |
"512m" |
"45 minutes" |
| 2 |
"5,686" |
"28" |
"1g" |
"30 seconds" |
Output:
| id |
balance |
age |
memory |
duration |
| 1 |
39225 |
32 |
524288 |
45 |
| 2 |
5686 |
28 |
1048576 |
30 |
Note: Multiple convert functions can be used in one command. rmcomma() removes commas, auto() converts to numbers, memk() converts to KB, rmunit() removes text.
Example 11: Create new fields with AS clause
source=logs | convert dur2sec(time_elapsed) AS time_seconds | stats sum(time_seconds) by user_id
Input:
| user_id |
time_elapsed |
| user1 |
"01:30:45" |
| user2 |
"02:15:30" |
| user3 |
"00:45:15" |
After convert:
| user_id |
time_elapsed |
time_seconds |
| user1 |
"01:30:45" |
5445 |
| user2 |
"02:15:30" |
8130 |
| user3 |
"00:45:15" |
2715 |
Note: dur2sec() converts HH:MM:SS to seconds, then the AS clause creates a new field time_seconds while preserving the original time_elapsed field. The converted values can then be used in aggregations like stats sum().
Alternative
Temporary workarounds users can employ:
- Continue using
eval with tonumber(), tostring(), and other existing functions
- Pre-process data during ingestion to normalize formats
- Create custom UDFs for specific conversion needs
However, these alternatives:
- Don't provide the unified interface that reduces cognitive load
- Require more verbose queries
- Don't cover all conversion scenarios (e.g., dur2sec, memk)
- Don't support efficient bulk conversions with wildcards
Implementation Discussion
-
Phased Implementation: Should we implement all 10 functions in the initial release, or prioritize based on usage patterns?
- Recommendation: Start with numeric conversions (auto, num, rmcomma, rmunit) and time conversions (ctime, dur2sec) as these cover 80% of use cases
-
Error Handling: How should the command handle conversion failures?
- Option A: Silently leave unconvertible values unchanged
- Option B: Set failed conversions to null (matches eval behavior) ✓
- Option C: Provide warning logs for troubleshooting
- Recommendation: Option B for consistency with existing PPL behavior
-
Performance Optimization: For wildcard conversions, should we:
- Analyze all field values to determine convertibility before applying?
- Apply conversion attempts and roll back if too many failures occur?
- Recommendation: Use sampling approach - analyze first N rows to determine field convertibility
-
Timeformat Default: Should we use locale-aware defaults or fixed format?
- Recommendation: Use ISO 8601 format (%Y-%m-%d %H:%M:%S) as default for international compatibility
-
Extensibility: Should we design an interface for adding custom conversion functions?
- Recommendation: No interface needed ( BaseConversionUDF abstract class is sufficient)
-
Wildcard Performance: How to optimize wildcard matching with large field counts?
- Recommendation: Build field pattern cache at planning time, apply conversions in streaming fashion
-
AS Clause Scoping: When using AS clause, should original field remain accessible downstream?
- Recommendation: Yes, both original and converted fields should be available unless explicitly dropped
Problem Statement
PPL currently lacks a unified command for converting field values between different data types and formats. Users frequently need to convert data for analysis, calculations, and reporting purposes, but must rely on chaining multiple evaluation functions or writing complex expressions. Common conversion needs include:
This creates verbose queries and increases the learning curve for users who need to remember multiple function names and their specific syntax.
Current State
Users must currently use evaluation functions within the
evalcommand to perform conversions:Limitations of the current approach:
Long-Term Goals
convertcommand that consolidates common data type conversion operationsProposal
Implement a
convertcommand in PPL with the following conversion functions:Numeric Conversions
auto()- Automatically converts fields to numbers using best-fit heuristicsnum()- Converts to numbers, removing non-convertible valuesrmcomma()- Removes commas from numeric stringsrmunit()- Extracts leading numbers and removes trailing textTime Conversions
ctime()- Converts UNIX timestamps to human-readable formatmktime()- Converts human-readable time strings to epoch timemstime()- Converts MM:SS.SSS format to secondsdur2sec()- Converts duration format [D+]HH:MM:SS to secondsMemory Conversions
memk()- Converts memory values (with k/m/g suffixes) to kilobytesField Control
none()- Excludes specific fields from wildcard conversionsSyntax
Parameters
timeformat: (Optional) Output format for time conversions (used byctimeandmktime)<convert-function>: One of the conversion functions listed above<field>: Field name to convert, supports wildcards (*)AS <field>: (Optional) Create new field with converted value, preserving originalApproach
Phase 1: Grammar and Parser Updates
Update OpenSearchPPLParser.g4
convertCommandrule tocommandssectionconvertFunctionrule with all conversion function typesconvertClausefor handling multiple conversions in one commandUpdate OpenSearchPPLLexer.g4
Phase 2: AST Building
visitConvertCommand()in AstBuilder.javaConvertUnresolvedPlan node with list ofConvertFunctionobjectsPhase 3: Calcite UDF Implementation
Create BaseConversionUDF abstract class
ImplementorUDFto integrate with CalciteConversionImplementor(NotNullImplementor) for code generationgetReturnTypeInference()returning nullable DOUBLE typetoDoubleOrNull()helper for consistent null handlingImplement individual UDF classes
AutoConvertFunctionextending BaseConversionUDFNumConvertFunctionextending BaseConversionUDFRmcommaConvertFunctionextending BaseConversionUDFRmunitConvertFunctionextending BaseConversionUDFPhase 4: Conversion Logic Implementation
Create ConversionUtils utility class
autoConvert()- Intelligent conversion with comma/unit handlingnumConvert()- Number extraction with comma removalrmcommaConvert()- Pure comma removal and parsingrmunitConvert()- Leading number extractionpublic static Object methodName(Object value)Doubleon success,nullon failure (consistent with eval+cast)Conversion logic details
auto(): Uses ConversionStrategy.COMPREHENSIVE with fallback chain
num(): Uses ConversionStrategy.STANDARD with numeric focus
rmcomma(): ConversionStrategy.COMMA_ONLY - regex comma removal
rmunit(): ConversionStrategy.UNIT_ONLY - extract leading numbers
All use private helper methods:
tryParseDouble()- Safe Double.parseDouble with null handlingisPotentiallyConvertible()- Quick validation before conversionconvertStandard(),convertComprehensive(), etc. - Strategy implementationsPhase 5: Integration and Execution
Calcite integration flow
Convertnode createdBaseConversionUDF.ConversionImplementorgenerates execution codeConversionUtils.methodName()toDoubleOrNull()for type safetyNo custom operators needed
Phase 6: Testing and Documentation
Unit tests (
ConversionUtilsTest.java)Integration tests
Documentation (
docs/user/ppl/cmd/convert.md)Examples
Example 1: Convert all fields to numeric values
Input:
Output:
Note: auto() automatically converts string values to numbers where possible. "39,225" becomes 39225 by removing commas, "32" becomes 32.*
Example 2: Convert specific fields while excluding others
Input:
Output:
Note: auto() converts balance and age (removing special characters), but none(account_id) excludes account_id from conversion, preserving it as a string.*
Example 3: Convert duration to seconds
Input:
Output:
Note: dur2sec() converts HH:MM:SS format to total seconds. 00:15:30 = (0×3600) + (15×60) + 30 = 930 seconds, 00:05:45 = 345 seconds.
Example 4: Convert with extended duration format (days + time)
Input:
Output:
Note: dur2sec() handles extended format D+HH:MM:SS where D+ represents days. 1+02:10:15 = (1×86400) + (2×3600) + (10×60) + 15 = 94215 seconds.
Example 5: Remove units from field values
Input:
Output:
Note: rmunit() extracts numbers from the beginning of values and removes trailing text. "212 sec" becomes 212, "450 minutes" becomes 450.
Example 6: Convert memory values to kilobytes
Input:
Output:
Note: memk() converts memory units to kilobytes. "512m" = 512×1024 = 524288 KB, "1g" = 1×1024×1024 = 1048576 KB, "256k" = 256 KB.
Example 7: Convert UNIX timestamp to readable format
Input:
Output:
Note: ctime() converts UNIX timestamps to human-readable format. timeformat="%H:%M:%S" shows only hours:minutes:seconds. Original unix_time field is preserved with AS clause.
Example 8: Convert MM:SS.SSS to seconds
Input:
Output:
Note: mstime() converts MM:SS.SSS to total seconds. "02:15.500" = (2×60) + 15.5 = 135.5 seconds.
Example 9: Remove commas from numeric values
Input:
Output:
Note: rmcomma() removes all commas from values. "39,225.50" becomes 39225.50 for proper numeric calculations.
Example 10: Multiple conversions in one command
Input:
Output:
Note: Multiple convert functions can be used in one command. rmcomma() removes commas, auto() converts to numbers, memk() converts to KB, rmunit() removes text.
Example 11: Create new fields with AS clause
Input:
After convert:
Note: dur2sec() converts HH:MM:SS to seconds, then the AS clause creates a new field time_seconds while preserving the original time_elapsed field. The converted values can then be used in aggregations like stats sum().
Alternative
Temporary workarounds users can employ:
evalwithtonumber(),tostring(), and other existing functionsHowever, these alternatives:
Implementation Discussion
Phased Implementation: Should we implement all 10 functions in the initial release, or prioritize based on usage patterns?
Error Handling: How should the command handle conversion failures?
Performance Optimization: For wildcard conversions, should we:
Timeformat Default: Should we use locale-aware defaults or fixed format?
Extensibility: Should we design an interface for adding custom conversion functions?
Wildcard Performance: How to optimize wildcard matching with large field counts?
AS Clause Scoping: When using AS clause, should original field remain accessible downstream?