Clp integration s3 #1

wraymo · 2025-03-05T22:27:40Z

Summary by CodeRabbit

New Features
- Introduced an optional connector that provides enhanced querying and search capabilities for improved data access.
- Extended IP address functionality to now include IP prefix handling, offering better support for network operations.
Build & Dependency Enhancements
- Upgraded build configurations and dependency management, ensuring smoother installation and greater flexibility across environments.
- Streamlined prerequisite installations for a more reliable setup process.
Refinements
- Optimised internal processes for IP casting and function registration, contributing to improved performance and maintainability.

anlowee · 2025-03-10T14:55:11Z

velox/connectors/clp/search_lib/Cursor.cpp

+    // clear the stage from last run
+    m_query_runner->populate_string_queries();
+
+    // probably have another class for query evaluation and filter


Is this a TODO?

anlowee · 2025-03-10T15:01:46Z

velox/connectors/clp/search_lib/Cursor.cpp

+      m_current_schema_index =
+          (m_current_schema_index + 1) % m_matched_schemas.size();


Same question, why it is a cycle?

anlowee · 2025-03-10T15:07:25Z

velox/connectors/clp/search_lib/Cursor.cpp

+      m_current_schema_table_loaded = false;
+    }
+
+    if (m_expression_value != EvaluatedValue::False) {


Can we replace all "!= EvaluatedValue::False" with "== EvaluatedValue::True"? I think that'll be more clear.

Actually "EvaluatedValue::True == m_expression_value", I remember that we should write the constant before the variable in the condition

There's an Unknown case

anlowee · 2025-03-10T15:18:30Z

velox/connectors/clp/search_lib/Cursor.cpp

+          m_error_code = ErrorCode::DictionaryNotFound;
+          continue;


Why it's "continue" but not "break" when running into the error? Actually I feel a bit confusing in this inner while loop, I fill we mixed two things in the same iteration. If here we just want to skip those schemas not found, maybe we can filter them before this inner while loop starts

anlowee · 2025-03-10T15:26:50Z

velox/connectors/clp/search_lib/Cursor.cpp

+    while (false == m_completed_archive_cycles) {
+      m_error_code = load_archive();
+
+      if (ErrorCode::Success == m_error_code) {
+        m_query_runner = std::make_shared<QueryRunner>(
+            m_expr,
+            m_schema_match,
+            m_ignore_case,
+            m_schema_map,
+            m_schema_tree,
+            m_projection,
+            m_var_dict,
+            m_log_dict);
+        m_query_runner->populate_string_queries();
+        break;
+      }
+      move_to_next_archive();
+    }
+  }
+  return 0;


But here we are actually already in the while (false == m_completed_archive_cycles) loop right? Seems we can get rid of the outer one? Because when this loop ends, m_completed_archive_cycles is true and the outer loop also ends. So does the outer while loop actually only iterates once?

No. could be multiple times

anlowee · 2025-03-10T15:36:40Z

velox/connectors/clp/search_lib/Cursor.cpp

+size_t Cursor::fetch_next(
+    size_t num_rows,
+    std::vector<ColumnData>& column_vectors) {
+  if (m_error_code != ErrorCode::Success) {
+    return 0;
+  }
+
+  while (false == m_completed_archive_cycles) {
+    while (false == m_completed_schema_cycles) {
+      // whether the schema table is loaded
+      if (false == m_current_schema_table_loaded) {
+        m_current_schema_id = m_matched_schemas[m_current_schema_index];
+        m_query_runner->set_schema(m_current_schema_id);
+        m_query_runner->populate_searched_wildcard_columns();
+        m_expression_value = m_query_runner->constant_propagate();
+
+        if (m_expression_value != EvaluatedValue::False) {
+          m_query_runner->add_wildcard_columns_to_searched_columns();
+
+          if (m_archive_read_stage < ArchiveReadStage::TablesInitialized) {
+            m_archive_reader.open_packed_streams();
+            m_archive_read_stage = ArchiveReadStage::TablesInitialized;
+          }
+          auto& reader = m_archive_reader.read_schema_table(
+              m_current_schema_id, false, false);
+          reader.initialize_filter_with_column_map(m_query_runner.get());
+          m_error_code = ErrorCode::Success;
+          m_current_schema_table_loaded = true;
+        } else {
+          m_current_schema_index =
+              (m_current_schema_index + 1) % m_matched_schemas.size();
+          m_error_code = ErrorCode::DictionaryNotFound;
+          continue;
+        }
+      }
+
+      if (auto num_rows_fetched =
+              m_query_runner->fetch_next(num_rows, column_vectors);
+          num_rows_fetched > 0) {
+        return num_rows_fetched;
+      }
+
+      m_current_schema_index =
+          (m_current_schema_index + 1) % m_matched_schemas.size();
+      m_completed_schema_cycles = m_current_schema_index == m_end_schema_index;
+      m_current_schema_table_loaded = false;
+    }
+
+    move_to_next_archive();
+    while (false == m_completed_archive_cycles) {
+      m_error_code = load_archive();
+
+      if (ErrorCode::Success == m_error_code) {
+        m_query_runner = std::make_shared<QueryRunner>(
+            m_expr,
+            m_schema_match,
+            m_ignore_case,
+            m_schema_map,
+            m_schema_tree,
+            m_projection,
+            m_var_dict,
+            m_log_dict);
+        m_query_runner->populate_string_queries();
+        break;
+      }
+      move_to_next_archive();
+    }
+  }
+  return 0;
+}


Feels a major code duplication, any chance we can improve this?

anlowee · 2025-03-10T15:37:11Z

velox/connectors/clp/search_lib/Cursor.h

+      size_t num_rows,
+      std::vector<facebook::velox::VectorPtr>& column_vectors);
+
+  size_t fetch_next(size_t num_rows, std::vector<ColumnData>& column_vectors);


I guess this method could be private? Seems no other places directly call it

anlowee · 2025-03-10T15:51:50Z

velox/connectors/clp/search_lib/OrderedProjection.cpp

+   * The main reason is that here we don't want to allow projection to travel
+   * inside unstructured objects -- it may be possible to support such a thing
+   * in the future, but it poses some extra challenges (e.g. deciding what to do
+   * when projecting repeated elements in a structure).


Mark it as TODO?

anlowee · 2025-03-10T16:22:41Z

velox/connectors/clp/search_lib/OrderedProjection.cpp

+  std::vector<int32_t> local_matching_node_list;
+  auto cur_node_id = tree->get_object_subtree_node_id();
+  auto it = column->descriptor_begin();
+  while (it != column->descriptor_end()) {
+    bool matched_any{false};
+    auto cur_it = it++;
+    bool last_token = it == column->descriptor_end();
+    auto const& cur_node = tree->get_node(cur_node_id);
+    for (int32_t child_node_id : cur_node.get_children_ids()) {
+      auto const& child_node = tree->get_node(child_node_id);
+
+      // Intermediate nodes must be objects
+      if (false == last_token && child_node.get_type() != NodeType::Object) {
+        continue;
+      }
+
+      if (child_node.get_key_name() != cur_it->get_token()) {
+        continue;
+      }
+
+      matched_any = true;
+      if (last_token &&
+          column->matches_type(node_to_literal_type(child_node.get_type()))) {
+        m_matching_nodes.insert(child_node_id);
+        local_matching_node_list.push_back(child_node_id);
+      } else if (false == last_token) {
+        cur_node_id = child_node_id;
+        break;
+      }
+    }
+
+    if (false == matched_any) {
+      break;
+    }
+  }


Feel not very confident reviewing this function

anlowee · 2025-03-10T16:26:49Z

velox/connectors/clp/search_lib/QueryRunner.h

+   * Set the schema to filter
+   * @param schema
+   */
+  void set_schema(int32_t schema) {


Maybe rename it to init_schema? It does much more things than just set "m_schema"

anlowee · 2025-03-10T17:15:43Z

velox/connectors/clp/search_lib/QueryRunner.h

+   * @param vector_index
+   * @param column_vectors
+   */
+  void get_message(


How about renaming it to "populate_message"? It's weird that a getter function return void

Also introduce a bit of term "message" here? It's essentially the a piece of data in clp-s perspective and for each "message" we convert it to a "row" of results right?

anlowee · 2025-03-10T17:41:18Z