Skip to content

Conversation

@BiteTheDDDDt
Copy link
Contributor

pick from #57737 #57860 #58124

HappenLee and others added 3 commits November 21, 2025 10:34
### What problem does this PR solve?

Remove unless time var in block:
```
    int64_t _decompress_time_ns = 0;
    int64_t _decompressed_bytes = 0;

    mutable int64_t _compress_time_ns = 0;
```
This pull request refactors how columns are accessed by name in the
`Block` class and related code, removing the internal name-to-index map
and switching to linear search. It also removes temporary column
handling and updates code throughout the backend to use the new column
access pattern. Additionally, it improves error handling and updates
some operator logic for schema scans.

* Removed the internal `index_by_name` map from the `Block` class,
replacing all column name lookups with linear search via the new
`get_position_by_name` method. This affects methods like `get_by_name`,
`try_get_by_name`, `has`, and related insert/erase logic.
[[1]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L85-R87)
[[2]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L164-L193)
[[3]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L204-L227)
[[4]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L241-L247)
[[5]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L259-L288)
[[6]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L310-R268)
[[7]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L774)
[[8]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L795-L803)
[[9]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L825-L835)

* Removed the ability to erase columns by name and the logic for
maintaining the name-to-index map, simplifying column management.
[[1]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L259-L288)
[[2]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L241-L247)
[[3]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L795-L803)

* Removed the definition and usage of temporary block column prefixes
and the method `erase_tmp_columns`, as well as related cleanup logic in
scan operators.
[[1]](diffhunk://#diff-d569a6c78638e9c539783057051e89bcb135cc254c763e3bc5e3a71e334df883L27-L28)
[[2]](diffhunk://#diff-ac38473f151ccd99db4ed80dfbfa176f4f957ae2f5335ce8fbb4f5dbb0e932e9L1332-L1338)
[[3]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L795-L803)

* Updated all code that previously accessed columns by name using the
old map to use the new position-based access, including changes in
partial update logic, push broker reader, and schema scan operator.
[[1]](diffhunk://#diff-23fa0193d626ba712c4186c66bcd1809c7e55bfc04ea10f5a91c691ed3e04727L944-R947)
[[2]](diffhunk://#diff-2ed235dda16244dccd76626375b4512b6ade1724933269c40a2953c29dd95c61L314-R314)
[[3]](diffhunk://#diff-2ed235dda16244dccd76626375b4512b6ade1724933269c40a2953c29dd95c61L387-R388)
[[4]](diffhunk://#diff-2ed235dda16244dccd76626375b4512b6ade1724933269c40a2953c29dd95c61L423-L443)
[[5]](diffhunk://#diff-aa97e616efca2638565e6a58aa9295252714265fcc1ed904d503425ef4930ce7L481-R484)
[[6]](diffhunk://#diff-aa97e616efca2638565e6a58aa9295252714265fcc1ed904d503425ef4930ce7L494-R494)
[[7]](diffhunk://#diff-991f87a3acb91e58d14e3b30520de5dd07ecade8d4ecf885c02bf436873b4262R182)
[[8]](diffhunk://#diff-991f87a3acb91e58d14e3b30520de5dd07ecade8d4ecf885c02bf436873b4262L253-R257)

* Added a slot-to-column index mapping (`_slot_offsets`) to
`SchemaScanOperatorX` for more efficient column mapping during scans.
[[1]](diffhunk://#diff-82ea988d6a5179d6698c2844ad8f06167ebdb362225b07b35e9a5ba0a7bb173aR22-R23)
[[2]](diffhunk://#diff-82ea988d6a5179d6698c2844ad8f06167ebdb362225b07b35e9a5ba0a7bb173aR90-R92)
[[3]](diffhunk://#diff-991f87a3acb91e58d14e3b30520de5dd07ecade8d4ecf885c02bf436873b4262R182)
[[4]](diffhunk://#diff-991f87a3acb91e58d14e3b30520de5dd07ecade8d4ecf885c02bf436873b4262L253-R257)

* Improved error handling for missing columns, notably in partial update
logic, by returning internal errors with block structure dumps when
columns are not found.

---

This refactor simplifies the `Block` class, improves code consistency,
and enhances error handling across the backend.
This pull request refactors how columns are accessed by name in several
vectorized execution components, replacing the inefficient O(N)
`get_by_name` method with a map-based lookup via a new
`get_name_to_pos_map` function. This change improves performance and
error handling when working with blocks of columns, especially in ORC
and Parquet readers. The updates also remove the old `get_by_name`
methods and update all relevant call sites to use the new approach.

* Added `get_name_to_pos_map` to the `Block` class, enabling efficient
mapping from column names to positions for fast lookup.
* Removed the O(N) `get_by_name` methods from the `Block` class,
enforcing use of position-based access via the map.
[[1]](diffhunk://#diff-5a2c9e19a27153df9fce7277a09325589cd009441c63da55761c286627417fb3L147-L151)
[[2]](diffhunk://#diff-76dba768c36c76cce660f7fe39514a98b509030a8976aabc9ac47d58bc923976L244-L261)

* Updated all column accesses in `OrcReader` (`vorc_reader.cpp`) to use
the name-to-position map, improving performance and adding error
handling for missing columns.
[[1]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R1286-R1289)
[[2]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R1314-R1324)
[[3]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425L1338-R1348)
[[4]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R1993-R1996)
[[5]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2062-R2072)
[[6]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425L2088-R2094)
[[7]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2205-R2225)
[[8]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2239-R2247)
[[9]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2273-R2276)
[[10]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425L2310-R2334)
[[11]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2648-R2655)

* Updated `RowGroupReader` in `vparquet_group_reader.cpp` to use the
name-to-position map for column access, with improved error handling for
missing columns.

* Added checks for missing columns using the map's `contains` method
before accessing columns, returning internal errors if columns are not
found.
[[1]](diffhunk://#diff-2ed235dda16244dccd76626375b4512b6ade1724933269c40a2953c29dd95c61L436-R441)
[[2]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R1314-R1324)
[[3]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2062-R2072)
[[4]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2239-R2247)
[[5]](diffhunk://#diff-d09145594c823444cca71879eb6515950211b548d7cbef65f0caf5c1d88f296fR396-R421)

* Removed unused includes and updated struct initialization style for
clarity and consistency.
[[1]](diffhunk://#diff-d09145594c823444cca71879eb6515950211b548d7cbef65f0caf5c1d88f296fR28)
[[2]](diffhunk://#diff-97945196187497c82dd245460b397955be0ebb9caeb75267a72c2bff2d545425R2205-R2225)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [x] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [x] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@Thearas
Copy link
Contributor

Thearas commented Nov 21, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 23.91% (66/276) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.78% (18148/34381)
Line Coverage 38.13% (165281/433518)
Region Coverage 33.15% (128383/387325)
Branch Coverage 33.94% (55350/163080)

@yiguolei yiguolei merged commit df2a963 into apache:branch-4.0 Nov 21, 2025
24 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants