Skip to content

Clp integration s3 cleanup #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Mar 17, 2025

Summary by CodeRabbit

  • New Features

    • Introduces a new optional CLP connector for advanced data sourcing and querying capabilities.
    • Enhances search functionality with a new library for efficient query execution.
    • Provides configuration settings allowing users to enable or disable the connector as needed.
  • Chores

    • Improved build configuration and dependency management to support the new features.
    • Added new CMake files for managing dependencies and integrating libraries.

Copy link

coderabbitai bot commented Mar 17, 2025

Walkthrough

This pull request introduces several new components and updates across the repository. An exception is added in the .gitignore, and a new submodule is defined in .gitmodules. Multiple CMake files are added to integrate external dependencies such as ANTLR4, msgpack-cxx, and spdlog. The main CMake configuration now includes an option to enable the CLP connector with its associated dependency management. The Makefile is updated to no longer treat warnings as errors. A comprehensive set of new source and header files is provided for the CLP connector and its search library, which includes new classes for data sourcing, query execution, and schema handling.

Changes

Files Change Summary
.gitignore, .gitmodules Added an exception in .gitignore to un-ignore third-party/antlr/antlr-4.13.1-complete.jar; added new submodule entry for velox/connectors/clp/search_lib/clp_src in .gitmodules.
CMake/ExternalAntlr4Cpp.cmake, CMake/FindANTLR.cmake, CMake/resolve_dependency_modules/msgpack-cxx.cmake, CMake/resolve_dependency_modules/spdlog.cmake New CMake modules introduced for integration of external dependencies (ANTLR4 C++ runtime, msgpack-cxx, spdlog) and to define custom macros and dependency resolution logic.
CMakeLists.txt, velox/connectors/CMakeLists.txt Updated to add the option VELOX_ENABLE_CLP_CONNECTOR with conditional logic to enable/disable the connector and resolve related dependencies (including spdlog, msgpack-cxx, absl); updated BOOST_INCLUDE_LIBRARIES to include url.
Makefile Modified the TREAT_WARNINGS_AS_ERRORS variable from a conditional assignment (?= 1) to a fixed assignment (= 0), changing the treatment of warnings during build.
velox/connectors/clp/... New files added for the CLP connector, including headers and source files for ClpColumnHandle, ClpConfig, ClpConnector (and its factory), ClpConnectorSplit, ClpDataSource, and ClpTableHandle, establishing the connector’s core functionality.
velox/connectors/clp/search_lib/... New search library integrated via files such as CMakeLists.txt, Cursor.cpp/.h, OrderedProjection.cpp/.h, and QueryRunner.cpp/.h to manage query execution and data retrieval for the CLP connector.
velox/connectors/clp/search_lib/clp_src New subproject commit reference added as a submodule, linking a specific state of the clp source code.

Sequence Diagram(s)

sequenceDiagram
    participant U as User/System
    participant CF as ClpConnectorFactory
    participant C as ClpConnector
    participant DS as ClpDataSource

    U->>CF: newConnector(id, config, executor)
    CF->>C: Instantiate ClpConnector with ClpConfig
    C->>U: Return new ClpConnector instance
    U->>C: Invoke createDataSource(...)
    C->>DS: Create and return ClpDataSource instance
Loading
sequenceDiagram
    participant DS as ClpDataSource
    participant Cur as Cursor
    participant QR as QueryRunner

    DS->>Cur: addSplit(split)
    Cur->>Cur: load_archive()
    Cur->>Cur: execute_query(query, output_columns)
    Cur->>QR: Pass query execution results
    QR->>Cur: Return data/results
    Cur->>DS: Provide fetched rows
    DS->>DS: next() returns RowVector with data
Loading

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (42)
CMake/FindANTLR.cmake (3)

18-18: Address the TODO regarding ANTLR cleanup.

You have a TODO comment here suggesting cleanup of the ANTLR CMake files. Would you like assistance to refactor or remove outdated parts? Taking action on this can help maintain clarity and reduce technical debt.


37-54: Consider adding a fallback for ANTLR version detection failures.

When the execute_process(...) invocation fails to detect the ANTLR version, you currently terminate with an error. It can be beneficial to provide a fallback or more descriptive error handling to guide users in correcting their environment setup.


55-161: Ensure the generated ANTLR outputs don’t clutter source directories.

The macro’s logic to generate source files in antlr4cpp_generated_src under CMAKE_CURRENT_BINARY_DIR is typically good practice. Still, be sure your .gitignore excludes or manages generated files. This helps avoid accidental staging of these files and keeps the source tree clean.

velox/connectors/clp/ClpTableHandle.cpp (2)

4-6: Consider whether a pass-through implementation is necessary.

The toString() method is currently just forwarding to the parent class implementation. If no specialized behavior is needed, consider whether this method needs to be explicitly implemented at all.


8-10: Consider whether a pass-through implementation is necessary.

Similarly, the serialize() method is just calling the parent class implementation. If no specialized behavior is required, this method may be unnecessary.

velox/connectors/clp/ClpColumnHandle.h (2)

27-29: Consider whether const member variables are appropriate for all fields.

Using const member variables ensures immutability, which is good for thread safety. However, it also means these values cannot be modified after construction. Ensure this restriction aligns with the intended usage of the class.


6-30: Consider adding serialization/deserialization support.

For connector components, serialization and deserialization capabilities are often needed. Consider adding methods like serialize() and a static deserialize() method if this class needs to be transferred over the network or persisted.

velox/connectors/clp/CMakeLists.txt (1)

3-4: Ensure SOURCE_PATH_SIZE definition is properly scoped

The SOURCE_PATH_SIZE definition is added globally using add_definitions(). Consider using target_compile_definitions() to scope this definition to only the targets that need it, rather than affecting the entire build.

-string(LENGTH "${CMAKE_SOURCE_DIR}/" SOURCE_PATH_SIZE)
-add_definitions("-DSOURCE_PATH_SIZE=${SOURCE_PATH_SIZE}")
+string(LENGTH "${CMAKE_SOURCE_DIR}/" SOURCE_PATH_SIZE)
+target_compile_definitions(velox_clp_connector PRIVATE 
+  "SOURCE_PATH_SIZE=${SOURCE_PATH_SIZE}")
velox/connectors/clp/search_lib/CMakeLists.txt (1)

14-152: Consider organizing source files into logical groups with comments.

The source file list is quite extensive and would benefit from organization into logical groups with comments explaining each group's purpose (e.g., readers, writers, parsers, etc.). This would enhance maintainability and make it easier for developers to understand the codebase structure.

CMakeLists.txt (3)

126-126: CLP connector enabled by default.

The CLP connector is enabled by default (VELOX_ENABLE_CLP_CONNECTOR is set to ON), which might increase build times and dependencies for users who don't need this connector. Consider setting it to OFF by default to follow the pattern of other optional connectors like S3, GCS, and ABFS.

-option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." ON)
+option(VELOX_ENABLE_CLP_CONNECTOR "Build CLP connector." OFF)

489-501: Inconsistent code formatting in conditional blocks.

The conditional blocks for CLP connector dependencies have inconsistent spacing. There's a space after "if" but before the opening parenthesis, which differs from the coding style used in the rest of the file.

-if (${VELOX_ENABLE_CLP_CONNECTOR})
+if(${VELOX_ENABLE_CLP_CONNECTOR})

Similar fix should be applied to the second conditional block as well.


489-496: Consider adding explicit version requirements for all external dependencies.

While spdlog has a version requirement (1.12.0), other dependencies like msgpack-cxx don't have explicit version requirements. This could lead to compatibility issues if different versions are used.

  velox_set_source(msgpack-cxx)
-  velox_resolve_dependency(msgpack-cxx)
+  velox_resolve_dependency(msgpack-cxx 4.1.1)
velox/connectors/clp/ClpTableHandle.h (2)

8-14: Inconsistent parameter handling in constructor.

The constructor takes tableName as a const reference but stores a copy, while query is taken as a shared_ptr and moved. Consider consistent parameter handling, especially if you're taking ownership of both parameters.

  ClpTableHandle(
      std::string connectorId,
-      const std::string& tableName,
+      std::string tableName,
      std::shared_ptr<std::string> query)
      : ConnectorTableHandle(connectorId),
-        tableName_(tableName),
+        tableName_(std::move(tableName)),
        query_(std::move(query)) {}

16-22: Unnecessary nodiscard attribute on accessor methods.

Using the [[nodiscard]] attribute on simple getters is unnecessary as they don't produce results that should be used. This attribute is more appropriate for operations that return a value that should not be ignored.

-  [[nodiscard]] const std::string& tableName() const {
+  const std::string& tableName() const {
    return tableName_;
  }

-  [[nodiscard]] const std::shared_ptr<std::string>& query() const {
+  const std::shared_ptr<std::string>& query() const {
    return query_;
  }
CMake/ExternalAntlr4Cpp.cmake (1)

135-138: Consider enabling C++ standard configuration.

The file has commented-out options for setting the C++ standard for ANTLR4. Consider uncommenting one of these lines to ensure compatibility with the project's C++ standard (C++17).

            -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
            -DWITH_STATIC_CRT:BOOL=${ANTLR4_WITH_STATIC_CRT}
            -DDISABLE_WARNINGS:BOOL=ON
            # -DCMAKE_CXX_STANDARD:STRING=17 # if desired, compile the runtime with a different C++ standard
-            # -DCMAKE_CXX_STANDARD:STRING=${CMAKE_CXX_STANDARD} # alternatively, compile the runtime with the same C++ standard as the outer project
+            -DCMAKE_CXX_STANDARD:STRING=${CMAKE_CXX_STANDARD} # compile the runtime with the same C++ standard as the outer project
            INSTALL_COMMAND ""
            EXCLUDE_FROM_ALL 1)
velox/connectors/clp/ClpConnector.cpp (3)

5-11: Ensure executor pointer validity.
The constructor sets executor_ but does not verify its validity. Consider adding a check or defensive code in case executor is null.


13-26: Consider error handling for invalid column handles.
While this method currently relies on ClpDataSource for validation, a pre-check here could produce more targeted error messages for the caller.


28-34: Implement or track unimplemented data sink functionality.
createDataSink is currently unimplemented. Ensure you have a plan or placeholder test to track this missing functionality.

Do you want to create a stub that returns a descriptive error to alert calling code?

velox/connectors/clp/ClpDataSource.cpp (2)

15-92: Use consistent logging instead of std::cout.
Line 128 prints with std::cout, which may not align with Velox logging conventions. Consider replacing with VELOX_INFO, spdlog, or another consistent logging mechanism.


94-112: Validate split details further.
The logic branches on inputSource_ but does not safeguard archiveDir_ or archiveId. If they are empty or invalid, the cursor might not initialize properly.

velox/connectors/clp/search_lib/OrderedProjection.cpp (2)

32-61: Confirm column type coverage.
The switch statement handles the known ColumnType cases. If new types are introduced later, ensure they are also supported in this method to prevent runtime exceptions.


70-128: Improve diagnostic details for partial matches.
When a column partially matches a path and then fails later tokens, there is no detailed log or user-facing error. Consider logging or throwing a specialized error to aid debugging in complex schemas.

velox/connectors/clp/search_lib/Cursor.cpp (4)

16-18: Discourage usage of using namespace in the implementation file
While convenient, this can introduce name collisions or confusion about symbol origins in larger codebases. Consider using fully qualified names to improve maintainability and clarity.


75-75: Address the TODO comment
Line 75 references a TODO: fix it. Leaving it unresolved might create confusion for future maintainers. Would you like assistance in opening an issue to track this item or proposing an immediate fix?


153-174: Consider adding dedicated unit tests for query preprocessing
The parsing and logical transformations here are non-trivial. Adding targeted tests will enable early detection of syntax or logical errors and ensure robustness.


265-334: Refactor duplicated fetch_next methods
Both overloads share similar logic for cycling through archives and schemas. Consider introducing a common helper function or reorganizing the flow to reduce code duplication and simplify maintenance.

Also applies to: 336-405

velox/connectors/clp/ClpConfig.h (1)

1-1: Apply required formatting
The pipeline indicates formatting issues at line 1. Kindly run the project’s standard formatting tool (e.g., clang-format) to resolve these warnings.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/search_lib/OrderedProjection.h (2)

1-1: Run code formatting on file
The pipeline has flagged format issues. Running the consistent formatting tool will keep code style uniform and resolve any spacing or alignment discrepancies.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.


11-19: Improve error messages in OperationFailed
Providing clearer error messages (e.g., including a descriptive string) facilitates easier troubleshooting. Consider augmenting the exception with contextual details about the operation that failed.

velox/connectors/clp/ClpDataSource.h (3)

31-40: Clarify concurrency behaviour for split addition.

The addSplit method is declared, but the concurrency model is unclear. If multiple threads may invoke addSplit, consider thread-safety measures (e.g. locking a mutex) to avoid race conditions.


42-53: Consider atomic or synchronized counters.

getCompletedBytes() and getCompletedRows() return totals from private members that could be updated concurrently if this class is used in a multi-threaded environment. Use atomic operations or appropriate locking for reliable concurrency support.


55-71: Use of Boost.Filesystem instead of std::filesystem.

C++17 and later provide <filesystem> in the standard library. Depending on your minimum compiler requirements, consider switching to std::filesystem to reduce external dependencies.

velox/connectors/clp/search_lib/Cursor.h (3)

1-16: Harmonize ingestion of standard and library headers.

Ensure that your inclusion of headers like <optional>, <vector>, and <string> is consistent across the codebase. This helps maintain clarity and consistency for future maintainers.


19-28: Extend error codes for thorough troubleshooting.

The ErrorCode enum might benefit from additional entries to capture different granular errors (e.g. network failures) if your system can encounter them. This would facilitate more refined error handling in the future.


30-35: Document the Field structure.

Consider including brief documentation for what each field member represents and how to interpret ColumnType. This helps clarify usage for new contributors or future references.

velox/connectors/clp/ClpConnector.h (3)

1-5: Apply code formatting to align with project standards.

The pipeline has flagged format issues in this file. Please run the recommended code style tool or apply the provided patch to resolve these warnings.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.


14-18: Provide rationale for excluding dynamic filters.

canAddDynamicFilter() always returns false. If dynamic filters are valuable in the future, consider adding references or comments either explaining the limitation or pointing to JIRA tasks for potential implementation.


35-40: Ensure the data sink is implemented or intentionally left out.

createDataSink is declared but may not be fully implemented if CLP is read-only. If so, consider clarifying this with either a comment or a VELOX_NYI.

Would you like me to add a placeholder for createDataSink with a VELOX_NYI or a comment indicating that data insertion is not supported?

velox/connectors/clp/search_lib/QueryRunner.h (1)

78-80: Empty init method implementation.
It appears that this init method is intentionally left empty. If it is a placeholder, consider adding a TODO comment or removing it if it is no longer needed.

velox/connectors/clp/search_lib/QueryRunner.cpp (3)

108-127: Suspicious debugging code at line 118.
Inside the fetch_next method with VectorPtrs, the statement int a = 1; does nothing. This might be leftover debug code. Consider removing it to avoid confusion.

-        int a = 1;

278-352: Use false == ret instead of !ret for consistency with coding guidelines.
On line 344, the code uses !ret. The guidelines recommend preferring false == ret rather than !ret. Updating this would maintain style consistency.

-    ret = cur->is_inverted() ? !ret : ret;
+    ret = cur->is_inverted() ? (false == ret) : ret;

354-412: Use const reference to avoid copying map entries.
Within the loop on line 377 (for (auto entry : m_datestring_readers)), you are copying the map entries rather than referencing them. Using a const reference might improve efficiency and clarity.

-for (auto entry : m_datestring_readers) {
+for (auto const& entry : m_datestring_readers) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2229ed and 4b74119.

⛔ Files ignored due to path filters (1)
  • third-party/antlr/antlr-4.13.1-complete.jar is excluded by !**/*.jar
📒 Files selected for processing (27)
  • .gitignore (1 hunks)
  • .gitmodules (1 hunks)
  • CMake/ExternalAntlr4Cpp.cmake (1 hunks)
  • CMake/FindANTLR.cmake (1 hunks)
  • CMake/resolve_dependency_modules/msgpack-cxx.cmake (1 hunks)
  • CMake/resolve_dependency_modules/spdlog.cmake (1 hunks)
  • CMakeLists.txt (4 hunks)
  • Makefile (1 hunks)
  • velox/connectors/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/ClpColumnHandle.h (1 hunks)
  • velox/connectors/clp/ClpConfig.h (1 hunks)
  • velox/connectors/clp/ClpConnector.cpp (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpConnectorSplit.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/ClpDataSource.h (1 hunks)
  • velox/connectors/clp/ClpTableHandle.cpp (1 hunks)
  • velox/connectors/clp/ClpTableHandle.h (1 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/Cursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/Cursor.h (1 hunks)
  • velox/connectors/clp/search_lib/OrderedProjection.cpp (1 hunks)
  • velox/connectors/clp/search_lib/OrderedProjection.h (1 hunks)
  • velox/connectors/clp/search_lib/QueryRunner.cpp (1 hunks)
  • velox/connectors/clp/search_lib/QueryRunner.h (1 hunks)
  • velox/connectors/clp/search_lib/clp_src (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/ClpTableHandle.cpp
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/OrderedProjection.cpp
  • velox/connectors/clp/search_lib/Cursor.cpp
  • velox/connectors/clp/ClpConnector.cpp
  • velox/connectors/clp/search_lib/QueryRunner.cpp
🪛 GitHub Actions: Run Checks
velox/connectors/CMakeLists.txt

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/CMakeLists.txt

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

CMake/resolve_dependency_modules/spdlog.cmake

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

CMake/resolve_dependency_modules/msgpack-cxx.cmake

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/search_lib/CMakeLists.txt

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

CMake/ExternalAntlr4Cpp.cmake

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

CMakeLists.txt

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/ClpDataSource.cpp

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

CMake/FindANTLR.cmake

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/search_lib/OrderedProjection.h

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/ClpConfig.h

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/ClpDataSource.h

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/ClpConnector.h

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

🪛 Gitleaks (8.21.2)
velox/connectors/clp/ClpConfig.h

17-17: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🔇 Additional comments (26)
velox/connectors/clp/search_lib/clp_src (1)

1-1: Subproject Commit Reference Check

This subproject commit with hash 8fce672d4ec3a45010e3afb0f0f5b0e44d98fe04 has been added as expected. Please ensure that this commit hash accurately reflects the state of the subproject you intend to integrate with the CLP connector, and double-check that all necessary submodule updates have been applied.

.gitmodules (1)

1-3: Please ensure submodule commit pinning for predictability.

Adding a new submodule referencing an external repository is fine. However, you might consider pinning the submodule to a specific commit or tag instead of tracking a moving branch. This prevents unexpected changes in upstream code and ensures reproducible builds.

Makefile (1)

22-22: Confirm the rationale for disabling warnings as errors.

Changing TREAT_WARNINGS_AS_ERRORS to 0 suppresses potential build-time awareness of code issues. Be sure this is an intentional decision. Otherwise, you risk overlooking valid compiler warnings that may degrade code health or lead to regressions.

CMake/FindANTLR.cmake (1)

21-23: Validate the fixed OpenJDK installation path on macOS.

By force-setting JAVA_HOME to /usr/local/opt/openjdk@11/, you might encounter compatibility issues on developer machines that use a different installation path. Consider searching for the correct Java path or falling back to a default if OpenJDK 11 is not found in this exact directory.

.gitignore (1)

149-149: LGTM: The addition of the ANTLR JAR exception is appropriate.

This change allows Git to track the ANTLR JAR file while still ignoring all other JAR files, which is necessary for the CLP connector implementation and dependencies.

velox/connectors/CMakeLists.txt (1)

20-22: LGTM: The conditional inclusion of the CLP connector follows the pattern used for other connectors.

This change enables the build system to conditionally include the CLP connector based on the VELOX_ENABLE_CLP_CONNECTOR variable, which is consistent with how other connectors are integrated in this file.

velox/connectors/clp/ClpColumnHandle.h (2)

8-12: LGTM: The constructor is well-designed and initializes all member variables.

The constructor properly takes all necessary parameters and initializes the member variables in the initialization list, which is good practice.


14-24: LGTM: The accessor methods are correctly implemented.

The get methods are properly defined as const methods returning const references where appropriate.

velox/connectors/clp/ClpConnectorSplit.h (1)

1-25: Well-structured split implementation with appropriate class design

This new class follows good C++ practices with a clear constructor, proper use of const member variables, and appropriate base class initialization. The toString() method provides useful diagnostic output.

I appreciate the use of [[nodiscard]] which ensures callers don't accidentally ignore return values, and the consistent naming conventions across the class structure.

CMake/resolve_dependency_modules/spdlog.cmake (1)

1-34:

❓ Verification inconclusive

CMake dependency module correctly handles spdlog integration

The spdlog dependency is properly configured with version pinning, SHA256 checksum verification, and correct URL construction. The use of FetchContent with OVERRIDE_FIND_PACKAGE is appropriate for ensuring consistent builds.

I like that you've enabled SPDLOG_FMT_EXTERNAL to avoid potential conflicts with other fmt library versions.


🏁 Script executed:

#!/bin/bash
# Verify if the specified spdlog version is still current and appropriate
curl -s https://api.github.com/repos/gabime/spdlog/releases/latest | grep -o '"tag_name": "v[0-9.]*"' | cut -d '"' -f 4

Length of output: 129


Action Required: Confirm Intentional spdlog Version Pinning

The dependency module is correctly configured with version pinning, SHA256 checksum verification, and proper URL construction. However, our verification shows that the latest release of spdlog is v1.15.1, whereas the module currently pins to v1.12.0. Please confirm whether using v1.12.0 is intentional for stability or compatibility reasons.

  • Verify that the chosen version (v1.12.0) is deliberately retained despite the more recent upstream release.
  • Ensure that any rationale for not updating to v1.15.1 is documented if necessary.
🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/CMakeLists.txt (2)

6-11: Clean library definition with proper source files

The library is correctly defined as an OBJECT library with appropriate source files.


13-23: Complete and well-structured dependency list

The dependencies are properly specified with a clear PRIVATE designation. The mix of internal (clp-s-search, velox_connector) and external (simdjson, msgpack-cxx, spdlog, etc.) dependencies looks appropriate for this connector.

CMake/resolve_dependency_modules/msgpack-cxx.cmake (1)

1-33:

✅ Verification successful

Well-configured msgpack-cxx dependency module

The msgpack-cxx dependency is properly configured with version pinning, SHA256 checksum verification, and correct URL construction. The use of FetchContent with OVERRIDE_FIND_PACKAGE ensures consistent builds.


🏁 Script executed:

#!/bin/bash
# Verify if the specified msgpack-cxx version is still current
curl -s https://api.github.com/repos/msgpack/msgpack-c/releases/latest | grep -o '"tag_name": "[^"]*"' | cut -d '"' -f 4

Length of output: 132


Dependency Module Configuration Approved

The msgpack-cxx dependency module is correctly configured:

  • The version pinning (cpp-7.0.0) is confirmed to be current.
  • The SHA256 checksum verification and URL construction are implemented properly.
  • The use of FetchContent with OVERRIDE_FIND_PACKAGE ensures consistent builds.

No further modifications are required.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

velox/connectors/clp/search_lib/CMakeLists.txt (1)

164-164:

❓ Verification inconclusive

C++20 standard differs from project default.

The project uses C++17 as the standard (as set in the main CMakeLists.txt), but this library is configured to use C++20. Consider documenting the reason for this deviation or aligning with the project standard if there are no specific C++20 features required.


🏁 Script executed:

#!/bin/bash
# Check for C++20 specific features in the search library code
rg -l "\[\[likely\]\]|\[\[unlikely\]\]|requires|concept|constinit|consteval|using enum|co_await|co_yield|co_return|std::span|std::format" ${CLP_SRC_DIR}/clp_s/

Length of output: 195


Action Required: Verify C++20 Usage or Document Necessity

The build configuration in velox/connectors/clp/search_lib/CMakeLists.txt sets the compile standard to C++20 via

target_compile_features(clp-s-search PRIVATE cxx_std_20)

while the overall project defaults to C++17. The initial verification attempt using a regex search failed due to an incorrect directory path, so we couldn’t confirm whether any C++20-specific language features (such as [[likely]], [[unlikely]], concepts, or coroutines) are actually in use.

  • Next Steps:
    • Re-run the search within velox/connectors/clp/search_lib to verify if any C++20 features are present. For example:
      #!/bin/bash
      rg -l "\[\[likely\]\]|\[\[unlikely\]\]|requires|concept|constinit|consteval|using enum|co_await|co_yield|co_return|std::span|std::format" velox/connectors/clp/search_lib/
    • If no C++20 features are found, consider either aligning this library’s standard with the project default (C++17) or document clearly why C++20 is necessary.
    • If there are C++20 features being used, document their usage and rationale in the code or project documentation.

Please review and verify this change to ensure that the library’s use of C++20 is both necessary and well-documented.


Please run this script to verify whether any C++20-specific features are present in the search library code. This will help us decide whether to align the library’s standard with the project default (C++17) or document the need for C++20.


🏁 Script executed:

#!/bin/bash
# Re-run the search for C++20-specific features within the correct directory.
rg -l "\[\[likely\]\]|\[\[unlikely\]\]|requires|concept|constinit|consteval|using enum|co_await|co_yield|co_return|std::span|std::format" velox/connectors/clp/search_lib/

Length of output: 294


Ensure C++20 Standard Usage Is Justified

The file velox/connectors/clp/search_lib/CMakeLists.txt configures the clp-s-search target to compile with C++20, which deviates from the project’s default of C++17. The initial automated search did not return evidence of C++20-specific features in this library; however, this outcome may be inconclusive.

  • Action Items:
    • Please run the above script to verify if any C++20 features (e.g. [[likely]], concepts, coroutine keywords, etc.) are present in the library.
    • If the search remains empty, consider either aligning the library with the project default (C++17) or explicitly document the requirement for C++20 if specific features are needed.
    • Conduct a manual review if necessary to confirm that no overlooked C++20-specific code is in use.

🏁 Script executed:

#!/bin/bash
# First, verify that the directory exists.
if [ -d "velox/connectors/clp/search_lib" ]; then
  echo "Directory exists: velox/connectors/clp/search_lib"
  # Now search for C++20-specific features within this directory.
  rg -l "\[\[likely\]\]|\[\[unlikely\]\]|requires|concept|constinit|consteval|using enum|co_await|co_yield|co_return|std::span|std::format" "velox/connectors/clp/search_lib/"
else
  echo "Directory 'velox/connectors/clp/search_lib' not found."
fi

Length of output: 448


C++ Standard Discrepancy: Verify and Document Necessity

  • In velox/connectors/clp/search_lib/CMakeLists.txt, the target clp-s-search is configured to compile with C++20 via:

    target_compile_features(clp-s-search PRIVATE cxx_std_20)
  • However, the overall project uses C++17 by default.

  • A search for C++20-specific features in the velox/connectors/clp/search_lib/ directory did not return any results.

  • If no C++20 features (e.g. [[likely]], concepts, coroutines, etc.) are being used, either align this library with the project standard (C++17) or document the rationale for using C++20.

Please review and confirm the intended standard for this library.

velox/connectors/clp/ClpTableHandle.h (1)

24-26:

✅ Verification successful

Method declarations without implementation.

The toString() and serialize() methods are declared but their implementations are not visible in this file. Ensure that these are properly implemented in a corresponding .cpp file.


🏁 Script executed:

#!/bin/bash
# Check for implementations of toString and serialize methods
echo "Searching for ClpTableHandle::toString implementation:"
rg -A 3 "ClpTableHandle::toString" --type cpp velox/connectors/clp/

echo "Searching for ClpTableHandle::serialize implementation:"
rg -A 3 "ClpTableHandle::serialize" --type cpp velox/connectors/clp/

Length of output: 933


Revised Review Comment – Implementation Verified

We've verified that the implementations of the declared methods are present in the corresponding .cpp file. Specifically:

  • ClpTableHandle::toString() is implemented in velox/connectors/clp/ClpTableHandle.cpp
  • ClpTableHandle::serialize() is implemented in the same file

There are no further corrections needed based on the current state of the code.

velox/connectors/clp/ClpConnector.cpp (1)

36-36: Connector factory registration looks good.
No issues here.

velox/connectors/clp/ClpDataSource.cpp (1)

114-136: Check for data misalignment.
Vectors are created assuming the cursor returns data in the same order and size. If the cursor returns fewer columns or different ordering, the logic could break silently. Consider validating the vector layout against the fields returned from cursor_.

Are you sure that field order always matches the fields_ vector? If not, consider verifying via a test or an assertion.

velox/connectors/clp/search_lib/OrderedProjection.cpp (2)

13-30: Prevent potential double insertion.
The check in lines 21-25 avoids duplicates, which is good. Ensure that ColumnDescriptor::operator== aligns with your intended semantics (e.g., pointer equality vs. field-by-field comparisons).

Could you confirm that the equality operator checks more than just the object pointer?


63-68: Method design is clear.
The loop over m_selected_columns effectively resolves each column. No concerns here.

velox/connectors/clp/ClpConfig.h (1)

17-17: False positive potential from static analysis
The string assigned to kS3Bucket appears to be a configuration key rather than an API credential. You may safely disregard the warning or add a rule to suppress this false positive.

🧰 Tools
🪛 Gitleaks (8.21.2)

17-17: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

velox/connectors/clp/ClpDataSource.h (2)

1-11: Check the portability of POSIX headers.

The includes <unistd.h> and <sys/wait.h> are not portable across all platforms (e.g. Windows). If cross-platform builds are required, consider conditionally including these headers or using a more portable abstraction.

Would you like me to search for any usage of POSIX-specific functions within the codebase to confirm whether a platform-agnostic approach is necessary?

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.


20-30: Validate your pointer arguments.

Ensure that clpConfig, tableHandle, and columnHandles are valid references to prevent potential null pointer dereferences and runtime errors.

Consider verifying their usage in the codebase with a script scan or test coverage to confirm.

velox/connectors/clp/search_lib/Cursor.h (1)

44-81: Validate input query parameters.

Within execute_query, ensure that the received query string and output columns undergo robust validation to handle user-supplied or potentially malformed queries.

velox/connectors/clp/ClpConnector.h (2)

7-13: Confirm executor ownership.

Storing a raw pointer folly::Executor* executor suggests that the connector does not manage the executor lifecycle. Ensure that the executor remains valid for as long as this object can be used.


23-30: Validate column handles in createDataSource.

Make sure that invalid or missing column handles do not lead to undefined behaviour in dependent classes such as ClpDataSource.

velox/connectors/clp/search_lib/QueryRunner.h (1)

87-89: filter method's always-false return might be a placeholder.
This override currently always returns false, which could render the filter logic non-operational. Verify whether this is intended or a placeholder that needs further implementation.

@@ -0,0 +1,34 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address the formatting issue flagged by pipeline

The pipeline check indicates there are formatting issues in this file. Please apply the provided formatting patch to resolve this warning.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

@@ -0,0 +1,26 @@
add_subdirectory(search_lib)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix formatting issues flagged by the pipeline

Please apply the formatting fix provided by the pipeline to resolve the formatting issues.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.


velox_include_directories(velox_clp_connector PRIVATE)

target_compile_features(velox_clp_connector PRIVATE cxx_std_20)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

C++20 standard is specified as required

Requiring C++20 standard features is appropriate, but ensure this is consistent with the rest of the Velox project's requirements.


🏁 Script executed:

#!/bin/bash
# Check if C++20 is used consistently across the project
grep -r "cxx_std_" --include="CMakeLists.txt" . | grep -v "cxx_std_20" | head -n 10

Length of output: 279


C++ Standard Inconsistency: Align Required or Justify Divergence

The velox_clp_connector now requires C++20 (line 26 in velox/connectors/clp/CMakeLists.txt), but other modules—evidenced by occurrences in velox/experimental/breeze/CMakeLists.txt and velox/experimental/breeze/test/CMakeLists.txt—are still using C++17. Please either update these modules to use C++20 to maintain consistency across the Velox project or provide a clear rationale for this divergence.

  • File: velox/connectors/clp/CMakeLists.txt (line 26; using cxx_std_20)
  • Files in velox/experimental/breeze continue with cxx_std_17

@@ -0,0 +1,33 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix formatting issues flagged by pipeline

Please apply the formatting fixes provided by the pipeline to resolve the formatting issues.

🧰 Tools
🪛 GitHub Actions: Run Checks

[warning] 1-1: Found format issues. Please apply fix using the provided patch.

${CLP_SRC_DIR}/clp_s/BufferViewReader.hpp
${CLP_SRC_DIR}/clp_s/ColumnReader.cpp
${CLP_SRC_DIR}/clp_s/ColumnReader.hpp
${CLP_SRC_DIR}/clp_s/ColumnReader.cpp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Duplicate source file entry detected.

The source file ${CLP_SRC_DIR}/clp_s/ColumnReader.cpp is listed twice in the CLP_SRC_FILES variable (lines 23 and 25), which may cause build issues or warnings.

-    ${CLP_SRC_DIR}/clp_s/ColumnReader.cpp
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
${CLP_SRC_DIR}/clp_s/ColumnReader.cpp
set(CLP_SRC_FILES
${CLP_SRC_DIR}/clp_s/ColumnReader.cpp # original entry on line 23
- ${CLP_SRC_DIR}/clp_s/ColumnReader.cpp # duplicate entry removed from line 25
# ... any other source files ...
)

Comment on lines +20 to +22
[[nodiscard]] const std::shared_ptr<std::string>& query() const {
return query_;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider using a more appropriate type for query representation.

Using a shared_ptr<std::string> for the query seems unusual. If the query is optional, consider using std::optional<std::string>. If ownership sharing is necessary, document the reason.

Also applies to: 30-30


🏁 Script executed:

#!/bin/bash
# Check how the query is used in related files
echo "Searching for query_ usage in ClpTableHandle:"
rg -A 5 "query_" --type cpp velox/connectors/clp/

echo "Searching for query() method calls:"
rg -A 5 "query\(\)" --type cpp velox/connectors/clp/

Length of output: 19091


Action Required: Query Representation in ClpTableHandle

  • In velox/connectors/clp/ClpTableHandle.h (lines 20–22), the query() method returns a std::shared_ptr<std::string> used to represent an optional query value.
  • Usage in velox/connectors/clp/ClpDataSource.cpp (and related files) checks whether the pointer is non-null before accessing its contents, which suggests that the intent is to model an optional value.
  • If the query is indeed optional, please consider replacing std::shared_ptr<std::string> with std::optional<std::string> to more clearly convey this intent. Alternatively, if shared ownership is necessary, add documentation to explain why a shared pointer was chosen over other alternatives.

Also applies to the similar snippet at line 30.

Comment on lines +29 to +33
if(NOT DEFINED ANTLR4_TAG)
# Set to branch name to keep library updated at the cost of needing to rebuild after 'clean'
# Set to commit hash to keep the build stable and does not need to rebuild after 'clean'
set(ANTLR4_TAG master)
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use of master branch as default tag could lead to non-reproducible builds.

Using the master branch as the default tag for the ANTLR4 repository could lead to non-reproducible builds as the branch can change over time. Consider using a specific version tag or commit hash for better build reproducibility.

if(NOT DEFINED ANTLR4_TAG)
    # Set to branch name to keep library updated at the cost of needing to rebuild after 'clean'
    # Set to commit hash to keep the build stable and does not need to rebuild after 'clean'
-    set(ANTLR4_TAG master)
+    set(ANTLR4_TAG 4.13.1)
endif()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if(NOT DEFINED ANTLR4_TAG)
# Set to branch name to keep library updated at the cost of needing to rebuild after 'clean'
# Set to commit hash to keep the build stable and does not need to rebuild after 'clean'
set(ANTLR4_TAG master)
endif()
if(NOT DEFINED ANTLR4_TAG)
# Set to branch name to keep library updated at the cost of needing to rebuild after 'clean'
# Set to commit hash to keep the build stable and does not need to rebuild after 'clean'
set(ANTLR4_TAG 4.13.1)
endif()

Comment on lines +186 to +263
ErrorCode Cursor::execute_query(
std::string& query,
std::vector<Field>& output_columns) {
m_output_columns = output_columns;
m_query = query;
m_completed_archive_cycles = false;

m_error_code = preprocess_query();
if (m_error_code != ErrorCode::Success) {
return m_error_code;
}

m_end_archive_index = m_current_archive_index;
while (false == m_completed_archive_cycles) {
m_error_code = load_archive();
if (ErrorCode::InternalError == m_error_code) {
return m_error_code;
}

if (ErrorCode::Success != m_error_code) {
move_to_next_archive();
continue;
}

m_query_runner = std::make_shared<QueryRunner>(
m_expr,
m_schema_match,
m_ignore_case,
m_schema_map,
m_schema_tree,
m_projection,
m_var_dict,
m_log_dict);

// clear the stage from last run
m_query_runner->populate_string_queries();

// probably have another class for query evaluation and filter
while (false == m_completed_schema_cycles) {
m_current_schema_id = m_matched_schemas[m_current_schema_index];

m_query_runner->set_schema(m_current_schema_id);

m_query_runner->populate_searched_wildcard_columns();

m_expression_value = m_query_runner->constant_propagate();

if (m_expression_value != EvaluatedValue::False) {
m_query_runner->add_wildcard_columns_to_searched_columns();
if (m_archive_read_stage < ArchiveReadStage::TablesInitialized) {
m_archive_reader.open_packed_streams();
m_archive_read_stage = ArchiveReadStage::TablesInitialized;
}

auto& reader = m_archive_reader.read_schema_table(
m_current_schema_id, false, false);
reader.initialize_filter_with_column_map(m_query_runner.get());
m_error_code = ErrorCode::Success;
m_current_schema_table_loaded = true;
break;
}

m_current_schema_index =
(m_current_schema_index + 1) % m_matched_schemas.size();
m_completed_schema_cycles = m_current_schema_index == m_end_schema_index;
m_error_code = ErrorCode::DictionaryNotFound;
m_current_schema_table_loaded = false;
}

if (m_expression_value != EvaluatedValue::False) {
break;
}

move_to_next_archive();
}

return m_error_code;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Reduce the complexity of execute_query
This method covers numerous steps (parsing, loading, iteration, schema matching). Splitting it into smaller private functions can enhance readability and maintainability.

Comment on lines 82 to 141
size_t num_rows,
std::vector<facebook::velox::VectorPtr>& column_vectors);

size_t fetch_next(size_t num_rows, std::vector<ColumnData>& column_vectors);

private:
/**
* Moves to the next archive in the list of archives.
*/
void move_to_next_archive();

/**
* Preprocesses the query to.
*/
ErrorCode preprocess_query();

/**
* Loads the archive at the current index.
*/
ErrorCode load_archive();

ErrorCode m_error_code;
bool m_ignore_case;

std::string m_archive_path;
clp_s::InputSource m_input_source;
std::vector<std::string> m_archive_ids;
size_t m_current_archive_index;
size_t m_end_archive_index;
bool m_completed_archive_cycles;
std::vector<int32_t> m_matched_schemas;
size_t m_current_schema_index;
size_t m_end_schema_index;
bool m_completed_schema_cycles;
int32_t m_current_schema_id;

ArchiveReadStage m_archive_read_stage;

std::shared_ptr<clp_s::search::Expression> m_expr;
std::shared_ptr<OrderedProjection> m_projection;
std::string m_query;
std::vector<Field> m_output_columns;

bool m_current_schema_table_loaded;

std::shared_ptr<clp_s::search::SchemaMatch> m_schema_match;

std::shared_ptr<QueryRunner> m_query_runner;

clp_s::EvaluatedValue m_expression_value;

clp_s::ArchiveReader m_archive_reader;
clp_s::SchemaReader* m_schema_reader;
std::shared_ptr<clp_s::SchemaTree> m_schema_tree;
std::shared_ptr<clp_s::ReaderUtils::SchemaMap> m_schema_map;
std::shared_ptr<clp_s::VariableDictionaryReader> m_var_dict;
std::shared_ptr<clp_s::LogTypeDictionaryReader> m_log_dict;
std::shared_ptr<clp_s::LogTypeDictionaryReader> m_array_dict;
std::shared_ptr<clp_s::TimestampDictionaryReader> m_timestamp_dict;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Assess concurrency for archival operations.

fetch_next, move_to_next_archive, and load_archive might be called on multiple threads if your design implies parallel reading. Confirm that you are properly synchronizing access to shared state such as m_current_archive_index.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
velox/connectors/clp/search_lib/Cursor.h (2)

50-54: Constructor parameter matches member variable name

The parameter m_ignore_case has the same name as the member variable, which could cause confusion. Consider renaming the parameter to avoid shadowing the member variable.

-  explicit Cursor(
-      clp_s::InputSource input_source,
-      std::vector<std::string> archive_paths,
-      bool m_ignore_case);
+  explicit Cursor(
+      clp_s::InputSource input_source,
+      std::vector<std::string> archive_paths,
+      bool ignore_case);

90-101: Incomplete documentation for preprocess_query()

The documentation comment for this method is incomplete and doesn't describe what the method does, even though it's a crucial step in query processing.

-  /**
-   * Preprocesses the query to.
-   */
+  /**
+   * Preprocesses the query to standardize and validate it.
+   * @return The error code indicating success or failure.
+   */
velox/connectors/clp/search_lib/Cursor.cpp (4)

50-50: Remove unresolved TODO comment

The TODO comment doesn't specify what needs to be fixed, making it difficult for other developers to understand the intention.

Either provide a more descriptive comment explaining what needs to be fixed or remove the TODO if the issue has been resolved.


73-74: Improve readability of assignment within conditional

Using assignment within a conditional statement can be error-prone and less readable, as it's easy to confuse with an equality check.

-  if (m_expr = m_schema_match->run(m_expr);
-      std::dynamic_pointer_cast<EmptyExpr>(m_expr)) {
+  m_expr = m_schema_match->run(m_expr);
+  if (std::dynamic_pointer_cast<EmptyExpr>(m_expr)) {

137-138: Similar assignment within conditional issue in multiple places

Similar to the previous comment, there are other instances of assignments within conditional statements that could be made clearer.

For example, for lines 137-138:

-  if (m_expr = standardize_pass.run(m_expr);
-      std::dynamic_pointer_cast<EmptyExpr>(m_expr)) {
+  m_expr = standardize_pass.run(m_expr);
+  if (std::dynamic_pointer_cast<EmptyExpr>(m_expr)) {

Apply similar changes to the other instances at lines 144-145 and 151-152.

Also applies to: 144-145, 151-152


239-379: Reduce code duplication between the two fetch_next methods

There's significant code duplication between the two fetch_next methods. Consider refactoring common logic into a private helper method.

Create a private template method that handles the common logic, and have both public methods call it with their specific parameters:

template <typename T>
size_t fetch_next_impl(size_t num_rows, std::vector<T>& column_vectors) {
    // Common implementation
    // ...
}

Then modify the public methods to use this implementation.

velox/connectors/clp/ClpConnector.cpp (1)

26-32: Provide an implementation or a clear TODO for createDataSink.
This method is declared but not implemented. Adding a TODO clarifies upcoming work and avoids confusion for other contributors.

Do you need help drafting an initial implementation or opening a tracking issue?

velox/connectors/clp/ClpConnector.h (2)

14-18: Check for potential configuration nullability.
Though std::shared_ptr typically guards against null usage, consider adding explicit safety checks to prevent runtime errors in edge cases.


35-39: Method scope is declared but remains unimplemented.
Consider adding a VELOX_NYI or a documented TODO to ensure clarity in future development.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b74119 and 16932ec.

📒 Files selected for processing (6)
  • velox/connectors/clp/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/ClpConnector.cpp (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/search_lib/Cursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/Cursor.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • velox/connectors/clp/CMakeLists.txt
  • velox/connectors/clp/ClpDataSource.cpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/ClpConnector.cpp
  • velox/connectors/clp/search_lib/Cursor.cpp
🧬 Code Definitions (2)
velox/connectors/clp/ClpConnector.cpp (1)
velox/connectors/clp/ClpConnector.h (5) (5)
  • ClpConnector (10:12)
  • id (55:61)
  • id (55:59)
  • outputType (23:29)
  • inputType (35:39)
velox/connectors/clp/search_lib/Cursor.cpp (1)
velox/connectors/clp/search_lib/Cursor.h (4) (4)
  • Cursor (58:61)
  • query (70:72)
  • num_rows (80:82)
  • num_rows (84:84)
🔇 Additional comments (11)
velox/connectors/clp/search_lib/Cursor.h (1)

86-101: Assess concurrency for archive operations

move_to_next_archive, load_archive, and other methods might be called on multiple threads if your design implies parallel reading. Confirm that you are properly synchronizing access to shared state such as m_current_archive_index.

velox/connectors/clp/search_lib/Cursor.cpp (1)

160-237: Reduce the complexity of execute_query

This method covers numerous steps (parsing, loading, iteration, schema matching). Splitting it into smaller private functions can enhance readability and maintainability.

velox/connectors/clp/ClpConnector.cpp (2)

5-9: Consider validating the incoming config pointer.
Even though it's likely non-null, adding a small check or assertion can help prevent accidental null dereferences.


11-24: Implementation appears correct and straightforward.
The creation of ClpDataSource with all relevant parameters is well-organized, and returning a unique pointer is well aligned with ownership semantics.

velox/connectors/clp/ClpConnector.h (7)

10-12: Constructor usage is consistent with the base class.
Accepting both id and config ensures flexibility in connector instantiation.


19-21: canAddDynamicFilter returning false is acceptable.
If the connector doesn't support dynamic filters, confirming this explicitly avoids unexpected usage elsewhere.


23-30: Method signature matches the corresponding implementation in the .cpp file.
Everything appears consistent for creating data sources.


31-33: supportsSplitPreload returning false is appropriate if unavailable.
Clearly communicates that the connector does not optimize with a split preload mechanism.


42-42: Good practice to store the specialized config object.
Retaining config_ as an internal member fosters a cohesive design, as it centralizes connector-specific configuration.


49-53: ClpConnectorFactory constructor properly initializes the connector name and pattern.
Invoking TimestampPattern::init() upon factory construction is a suitable place to guarantee readiness.


55-61: newConnector method is concise and matches the factory pattern.
This approach neatly instantiates the ClpConnector, avoiding complex logic in the factory.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
velox/connectors/clp/search_lib/OrderedProjection.cpp (2)

13-30: Validate exception messages for clarity.
The duplicate column and unresolved descriptor checks are well-handled. However, including a descriptive error message in OperationFailed would make troubleshooting easier.


32-61: Add a default case for future-proofing.
The switch statement does not handle a default scenario. Consider adding a default branch (even if empty) to safeguard against unhandled ColumnType values that may be introduced later.

+    default:
+      // Handle or ignore new column types gracefully
+      break;
velox/connectors/clp/search_lib/QueryRunner.h (2)

78-82: Document the empty method body.
The second init overload is empty. If this method is intentionally a stub, consider adding a brief comment to clarify its status or a TODO tag to indicate future implementation plans.


88-90: Implement or rename the placeholder filter method.
Currently, filter always returns false. If the method is expected to perform real filtering, consider implementing it or marking it as a temporary placeholder.

velox/connectors/clp/search_lib/QueryRunner.cpp (4)

15-16: Replace macro with a static helper function.
Using a macro for eval(op, a, b) can reduce clarity and hinder debugging. A small, inlined helper function or lambda would be more robust and type-safe.

-#define eval(op, a, b) (((op) == FilterOperation::EQ) ? ((a) == (b)) : ((a) != (b)))
+inline bool eval(FilterOperation op, auto a, auto b) {
+  if (op == FilterOperation::EQ) {
+    return a == b;
+  }
+  return a != b;
+}

279-353: Complex expression evaluation logic.
Manually tracking evaluation state with a stack is valid but can be hard to maintain. Consider a more traditional recursive evaluation to improve readability and reduce the risk of logical errors.


644-713: Consider partial matching or substring search.
The current array filter logic checks for an exact string match in each element. If partial or flexible matching becomes a requirement, updating this logic to handle more advanced patterns may be advisable.


1270-1285: Enhance error handling for date parsing.
If parsing fails or a non-integer operand is inadvertently passed, highlight the issue clearly. Currently, as_int can silently fail, returning false without an exception.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16932ec and 0d3b636.

📒 Files selected for processing (4)
  • velox/connectors/clp/search_lib/Cursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/OrderedProjection.cpp (1 hunks)
  • velox/connectors/clp/search_lib/QueryRunner.cpp (1 hunks)
  • velox/connectors/clp/search_lib/QueryRunner.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • velox/connectors/clp/search_lib/Cursor.cpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/OrderedProjection.cpp
  • velox/connectors/clp/search_lib/QueryRunner.cpp
🧬 Code Definitions (1)
velox/connectors/clp/search_lib/QueryRunner.h (1)
velox/connectors/clp/search_lib/QueryRunner.cpp (6) (6)
  • populate_string_queries (947-1024)
  • populate_string_queries (947-948)
  • populate_searched_wildcard_columns (1026-1057)
  • populate_searched_wildcard_columns (1026-1027)
  • constant_propagate (1066-1268)
  • constant_propagate (1066-1067)
🔇 Additional comments (8)
velox/connectors/clp/search_lib/OrderedProjection.cpp (2)

63-68: Short and clear method implementation.
The loop cleanly resolves each column descriptor. No issues detected.


70-129: Confirm schemaTree validity before resolution.
If tree might be null or improperly initialized, adding a check could prevent runtime errors. Otherwise, this resolution logic is well-structured and consistent with your code style.

velox/connectors/clp/search_lib/QueryRunner.h (2)

106-117: Good use of separate method for constant propagation.
The design keeps expression simplification distinct. This separation improves readability.


130-135: Check vector size assumptions.
When calling fetch_next, ensure column_vectors.size() aligns with m_projection. If usage deviates, validating or documenting assumptions would prevent runtime issues.

velox/connectors/clp/search_lib/QueryRunner.cpp (4)

78-82: Ensure date reader assignments are unique.
Your comment implies that a DateStringColumnReader for a specific column ID should not repeat. Verify no unintended conflicts can arise if schemas evolve or if multiple columns share the same node ID.


108-127: Validate message capacity in output vectors.
Ensure column_vectors[0] (and subsequent columns) have sufficient space to accommodate num_rows_fetched + 1 entries, especially when projecting numerous rows in a single fetch.


760-843: Handle nested wildcard arrays with caution.
This method recursively checks arrays and objects. If deeper nesting or new data structures are introduced, ensure the recursion depth is handled gracefully to avoid stack overflows.


913-945: Efficient boolean filtering.
Returning true as soon as a match is found avoids unnecessary reads or comparisons. This short-circuit approach is optimal for performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant