Skip to content

Antalya: Cache the list objects operation on object storage using a TTL + prefix matching cache implementation #743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: antalya
Choose a base branch
from

Conversation

arthurpassos
Copy link
Collaborator

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Cache for listobjects calls

Documentation entry for user-facing changes

@arthurpassos arthurpassos changed the title draft immpl Cache the list objects operation on object storage using a TTL + prefix matching cache implementation Apr 17, 2025
@arthurpassos
Copy link
Collaborator Author

arthur :) SELECT date, count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_hive_partitioning=1, use_object_storage_list_objects_cache=0;

SELECT
    date,
    count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_hive_partitioning = 1, use_object_storage_list_objects_cache = 0

Query id: 29d096ab-0297-43a3-8844-b83b6a7856fb

    ┌─date───────┬─count()─┐
 1. │ 2025-01-01 │  292213 │
 2. │ 2025-01-02 │  402440 │
 3. │ 2025-01-03 │  409341 │
 4. │ 2025-01-04 │  432302 │
 5. │ 2025-01-05 │  433954 │
 6. │ 2025-01-06 │  366260 │
 7. │ 2025-01-07 │  352121 │
 8. │ 2025-01-08 │  399976 │
 9. │ 2025-01-09 │  534013 │
10. │ 2025-01-10 │  408769 │
11. │ 2025-01-11 │  361190 │
12. │ 2025-01-12 │  380525 │
13. │ 2025-01-13 │  408248 │
14. │ 2025-01-14 │  352684 │
15. │ 2025-01-15 │  354014 │
16. │ 2025-01-16 │  375439 │
17. │ 2025-01-17 │  425661 │
18. │ 2025-01-18 │  360666 │
19. │ 2025-01-19 │  388509 │
20. │ 2025-01-20 │  350291 │
21. │ 2025-01-21 │  324412 │
22. │ 2025-01-22 │  432369 │
23. │ 2025-01-23 │  326010 │
24. │ 2025-01-24 │  369243 │
25. │ 2025-01-25 │  338988 │
26. │ 2025-01-26 │  309651 │
27. │ 2025-01-27 │  332102 │
28. │ 2025-01-28 │  305953 │
29. │ 2025-01-29 │  355332 │
30. │ 2025-01-30 │  335134 │
31. │ 2025-01-31 │  328684 │
    └────────────┴─────────┘

31 rows in set. Elapsed: 4.080 sec. Processed 11.55 million rows, 0.00 B (2.83 million rows/s., 0.00 B/s.)
Peak memory usage: 3.17 MiB.
arthur :) SELECT date, count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_hive_partitioning=1, use_object_storage_list_objects_cache=1;

SELECT
    date,
    count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE (date >= '2025-01-01') AND (date <= '2025-01-31')
GROUP BY date
ORDER BY date ASC
SETTINGS use_hive_partitioning = 1, use_object_storage_list_objects_cache = 1

Query id: 4afafbfb-eb93-4b96-8c4f-4a94f723805c

    ┌─date───────┬─count()─┐
 1. │ 2025-01-01 │  292213 │
 2. │ 2025-01-02 │  402440 │
 3. │ 2025-01-03 │  409341 │
 4. │ 2025-01-04 │  432302 │
 5. │ 2025-01-05 │  433954 │
 6. │ 2025-01-06 │  366260 │
 7. │ 2025-01-07 │  352121 │
 8. │ 2025-01-08 │  399976 │
 9. │ 2025-01-09 │  534013 │
10. │ 2025-01-10 │  408769 │
11. │ 2025-01-11 │  361190 │
12. │ 2025-01-12 │  380525 │
13. │ 2025-01-13 │  408248 │
14. │ 2025-01-14 │  352684 │
15. │ 2025-01-15 │  354014 │
16. │ 2025-01-16 │  375439 │
17. │ 2025-01-17 │  425661 │
18. │ 2025-01-18 │  360666 │
19. │ 2025-01-19 │  388509 │
20. │ 2025-01-20 │  350291 │
21. │ 2025-01-21 │  324412 │
22. │ 2025-01-22 │  432369 │
23. │ 2025-01-23 │  326010 │
24. │ 2025-01-24 │  369243 │
25. │ 2025-01-25 │  338988 │
26. │ 2025-01-26 │  309651 │
27. │ 2025-01-27 │  332102 │
28. │ 2025-01-28 │  305953 │
29. │ 2025-01-29 │  355332 │
30. │ 2025-01-30 │  335134 │
31. │ 2025-01-31 │  328684 │
    └────────────┴─────────┘

31 rows in set. Elapsed: 0.040 sec. Processed 11.55 million rows, 0.00 B (287.50 million rows/s., 0.00 B/s.)
Peak memory usage: 844.09 KiB.

arthur :) 

@arthurpassos
Copy link
Collaborator Author

laptop@arthur:~/work/altinity/list_objects_cache$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "SELECT date, count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_hive_partitioning=1, use_object_storage_list_objects_cache=0;"

Queries executed: 10.

localhost:9000, queries: 10, QPS: 0.389, RPS: 4487684.448, MiB/s: 0.000, result RPS: 12.049, result MiB/s: 0.000.

0%		2.363 sec.	
10%		2.379 sec.	
20%		2.382 sec.	
30%		2.391 sec.	
40%		2.402 sec.	
50%		2.410 sec.	
60%		2.410 sec.	
70%		2.451 sec.	
80%		2.458 sec.	
90%		3.159 sec.	
95%		3.212 sec.	
99%		3.212 sec.	
99.9%		3.212 sec.	
99.99%		3.212 sec.	

laptop@arthur:~/work/altinity/list_objects_cache$ ./cmake-build-release/programs/clickhouse benchmark -i 10 --cumulative -q "SELECT date, count()
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/*/*.parquet', NOSIGN)
WHERE date between '2025-01-01' and '2025-01-31'
GROUP BY date ORDER BY date
SETTINGS use_hive_partitioning=1, use_object_storage_list_objects_cache=1;"
Loaded 1 queries.

Queries executed: 10.

localhost:9000, queries: 10, QPS: 33.280, RPS: 384262109.406, MiB/s: 0.000, result RPS: 1031.666, result MiB/s: 0.028.

0%		0.015 sec.	
10%		0.015 sec.	
20%		0.015 sec.	
30%		0.016 sec.	
40%		0.017 sec.	
50%		0.017 sec.	
60%		0.017 sec.	
70%		0.018 sec.	
80%		0.018 sec.	
90%		0.018 sec.	
95%		0.018 sec.	
99%		0.018 sec.	
99.9%		0.018 sec.	
99.99%		0.018 sec.	

:)

{
if (const auto it = cache.find(key); it != cache.end())
{
if (IsStaleFunction()(it->first))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is interesting. In case we find an exact match, but it has expired. Should we try to find a prefix match or simply update the entry?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there can be a more up-to-date prefix entry, so why not try to reuse it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is that this entry would cease to exist. It would never be cached again. And it would become a linear search forever.

Actually, not forever, if the more up-to-date prefix entry gets evicted and this query is performed again, it would re-appear.

But I think you are right.

{
throw Exception(
ErrorCodes::BAD_ARGUMENTS,
"Using glob iterator with path without globs is not allowed (used path: {})",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall not it be LOGICAL_ERROR ?
This looks like a branch of code where we cannot get normally (user does not select which iterator to use manually)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should probably be LOGICAL_ERROR. But:

This is mostly a copy and paste from the existing GlobIterator.

I might refactor this to avoid duplication. For now, this is just a draft implementation.

Even if I refactor this, I would opt for keeping parity with existing code & upstream. This will make the review and merges with upsteam easier

@svb-alt svb-alt added the antalya-25.2.2 Planned for 25.2.2 release label Apr 18, 2025
@@ -6108,6 +6108,9 @@ Limit for hosts used for request in object storage cluster table functions - azu
Possible values:
- Positive integer.
- 0 — All hosts in cluster.
)", EXPERIMENTAL) \
DECLARE(Bool, use_object_storage_list_objects_cache, true, R"(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add it to the src/Core/SettingsChangesHistory.cpp

cache.setMaxCount(count);
}

void ObjectStorageListObjectsCache::setTTL(std::size_t ttl_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it in seconds/miliseconds/minutes/hours?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In seconds, will modify the argument name

@@ -435,6 +436,16 @@ BlockIO InterpreterSystemQuery::execute()
break;
#else
throw Exception(ErrorCodes::SUPPORT_IS_DISABLED, "The server was compiled without the support for Parquet");
#endif
}
case Type::DROP_OBJECT_STORAGE_LIST_OBJECTS_CACHE:
Copy link
Member

@Enmk Enmk Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is caching works only on Parquet files or generally on any S3 ListObject requests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, copy and paste issues. Should be any :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 128 to 130
throw Exception(
ErrorCodes::BAD_ARGUMENTS,
"Using glob iterator with path without globs is not allowed (used path: {})",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor nitpick, but maybe throw early, i.e. effectively do something like this?:

if (!configuration->isPathWithGlobs())
{
    throw Exception(...);
}
// rest of the function as it was, but not inside indented block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. Like I said, this is both: 1) a copy & paste; 2) wip.

I'll look into it with more attention once the core features and testing have been implemented.

@arthurpassos
Copy link
Collaborator Author

Kind of problematic to test this using stateless or integeration tests. A single glob query can perform multiple list object calls, which affect profileevents counters. Not only that, but some of these list calls do not iterate through the entire list, hence not inserting into the cache.

Relying on hard-coded numbers based on current behavior is kind of bad. Best I can do with stateless test would be some sort of event_counter > 0.

Or test the cache alone using unit tests.

@zvonand
Copy link
Collaborator

zvonand commented Apr 19, 2025

Or test the cache alone using unit tests.

As long as we are storing the cache in the system anyway, maybe we could make it available as some kind of system table, e.g. system.s3_prefix_cache? This also could be greatly helpful for debugging

@arthurpassos
Copy link
Collaborator Author

As long as we are storing the cache in the system anyway,

I didn't understand what you mean by this

maybe we could make it available as some kind of system table, e.g. system.s3_prefix_cache

Yeah, we need to design one that also covers #586, just not sure I'll include it in this PR

@zvonand
Copy link
Collaborator

zvonand commented Apr 19, 2025

I meant that "we have a cache, why not make an interface to view it"

@Enmk Enmk changed the title Cache the list objects operation on object storage using a TTL + prefix matching cache implementation Antalya: Cache the list objects operation on object storage using a TTL + prefix matching cache implementation Apr 21, 2025
@arthurpassos
Copy link
Collaborator Author

Reverted my last two commits due to the performance degradation #743 (comment)


std::vector<Key> to_remove;

for (auto it = cache.begin(); it != cache.end(); ++it)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this cycle.
In my opinion better something like (in pseudocode)

while (!key.prefix.empty())
{
  if (auto res = cache.getWithKey(key))
  {  // should not be more than one passed key, if key '/foo/bar' exists, key '/foo' can't be in cache
    if (IsStaleFunction(res))
    {
      BasePolicy::remove(res);
      return std::nullopt;
    }
    else
      return res;
  }
  key.prefix.pop_back();
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why it is better?

And why do you assume the below?

{ // should not be more than one passed key, if key '/foo/bar' exists, key '/foo' can't be in cache

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, assuming this version you suggested works, the time complexity goes to O(key_path_size). Which should probably be better than O(N).

But it won't find "the best match", tho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just implemented it, can you please have a look?

Btw, thanks for the suggestion, it's a great one.

Copy link

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote some comments.

Copy link

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
antalya-25.2.2 Planned for 25.2.2 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants