Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit telemetry events for hit and miss #76

Merged

Conversation

aerosol
Copy link
Contributor

@aerosol aerosol commented Mar 26, 2024

Tentative first pass on #75

Since get might be effectively called twice by get_or_store, we need a way to suppress emitting telemetry events on plain get.

I'll add documentation upon implementation approval.

Copy link
Owner

@sasa1977 sasa1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! From what I can tell, this isn't covering fetch.

I think it's worth refactoring internals a bit, in the following way:

  1. Remove all get functions from Operations.
  2. Support telemetry in Operations.fetch just like you did with get.
  3. Implement get & co in ConCache on top of fetch.

As an optional extra, I think it would be worth removing fetch_or_store from Operations too, i.e. move this logic to ConCache. That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.

Another extra is to add fetch to ConCache (not sure why I didn't do it in the first place. With that done, all get operations could be written on top of ConCache functions, i.e. they wouldn't have to interact directly with Operations.

WDYT?

Comment on lines 23 to 24
def get(cache, key, opts \\ []) do
case fetch(cache, key) do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get is implemented on top of fetch, so I think you'd need to do this in fetch. Because otherwise, ConCache.fetch_or_store won't work.

Comment on lines 475 to 486
hit = ConCache.Operations.telemetry_hit()
miss = ConCache.Operations.telemetry_miss()

:telemetry.attach_many(
"test",
[
hit,
miss
],
&TestTelemetryHandler.handle_event/4,
nil
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this in the setup block to reduce the noise from tests.

end

test "get/2 emits telemetry events" do
assert {:ok, cache} = ConCache.start_link(ttl_check_interval: false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the assert here, since we're not testing start_link.

Comment on lines 515 to 516
refute_receive _
ConCache.get_or_store(cache, :key, fn -> :value end)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate with a blank line.

@aerosol
Copy link
Contributor Author

aerosol commented Mar 26, 2024

Thanks for the review, all sounds sensible. Pushed some changes already but some are pending still - I'll ping you when done.

@aerosol
Copy link
Contributor Author

aerosol commented Mar 26, 2024

I think I'm having trouble understanding the separation of concerns you want me to achieve.

Remove all get functions from Operations.
Support telemetry in Operations.fetch just like you did with get.
Implement get & co in ConCache on top of fetch.

This I think I've done, but "implement get & co in ConCache" is a bit unclear.
Whilst I understand that you can now do:

  # defmodule ConCache
  @spec get(t, key) :: value
  def get(cache_id, key) do
    case fetch(cache_id, key) do
      :error -> nil
      {:ok, value} -> value
    end
  end

now by "& co" I assume you mean get_or_store? So does that also mean we should get rid of Operations.get_or_store so that Operations exposes only isolated_get_or_store? That way the telemetry events emission would have to be done in ConCache or am I completely missing the point? Sorry, it's been a long day 😵‍💫

@sasa1977
Copy link
Owner

I assume you mean get_or_store?

Yeah, I was thinking about all versions of get (including the dirty ones). If we have fetch (& co) in place, then all versions of get could use those functions, and just convert the result to value | nil, right?

@aerosol
Copy link
Contributor Author

aerosol commented Mar 27, 2024

Sorry, totally confused. What do we keep in Operations then?

If get_or_store is implemented in ConCache, using local fetch (that delegates to Operations), and dirty_get_or_store is also implemented in ConCache (the actual logic this time), what's left in Operations? Just value/1 and isolated/3? What about checking ets type validity?

That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.

Basically this is the boundary I'm having hard time to recognize I guess.

@aerosol
Copy link
Contributor Author

aerosol commented Mar 27, 2024

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

@sasa1977
Copy link
Owner

What do we keep in Operations then?

Now that you ask, and looking at the code, I see little value in having this module :-D

I wasn't asking you to do such merge, but I was proposing a part of that change, which in hindsight was needless and caused this confusion and PR bloat.

So let's instead avoid these refactorings and just focus on getting telemetry in. I can unify these two modules once this is merged.

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

OK?

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

Operations is not a part of the ConCache public API. It's explicitly marked with @moduledoc false, which makes it internal.

@aerosol
Copy link
Contributor Author

aerosol commented Apr 3, 2024

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

@sasa1977 done, sorry for the delay - life got in my way. Please hold off with merging until I confirm that this branch works as expected on my end.

BTW, not entirely related, but just wanted to mention it: there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes for a cache with the following settings:

ttl_check_interval: :timer.seconds(1),
global_ttl: :timer.minutes(30)

image

The app nodes are identical and the load balancing was verified granting resource access evenly, so it must be something in ConCache internals.

The growth continued indefinitely - one node had allocated around ~68GiB of memory, while the other two stayed in the 3GiB ballpark. Unfortunately, I did not manage to inspect any specific process info before the node was simply restarted. I'll provide more information once/if this happens again.

@aerosol
Copy link
Contributor Author

aerosol commented Apr 3, 2024

Please hold off with merging until I confirm that this branch works as expected on my end.

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Another slight inconvenience is, currently it's the Operations module that defines the telemetry events shape - via Operations.telemetry_hit() and Operations.telemetry_miss() functions. As you've pointed out, @moduledocs false makes the API effectively private. I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

I recall Oban changing the telemetry event shape under a minor release at some point. That's what I'd like to avoid from the client perspective here - the telemetry is in fact an integration contract.

WDYT?

@sasa1977
Copy link
Owner

sasa1977 commented Apr 3, 2024

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

You can do that as a part of this PR, to keep it focused. But my goal immediately after is to merge everything inside ConCache. There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes

This is worrying. A few possible explanations that come to mind:

  1. Normal behavior - ttl doesn't always guard against the cache growth. It only ensures that the items are purged after the last write (or read if you turn on touch on read). If you frequently extend item's ttl, the cache may grow indefinitely.
  2. Owner overload - expiry is done by the owner process (https://github.com/sasa1977/con_cache/blob/1.0.0/lib/con_cache/owner.ex). If the cache size is large, the owner might not be able to keep up. How many items do you have in the cache?
  3. Some actual bug in the expiry logic. This is always a possibility, but it wouldn't be my first bet, given that the library has been used in production, and it hasn't changed much in the past 10 years.

1 or 2 would be my first suspects. Just glancing at the code, I see one possible issue: the owner process performs isolated (i.e. locked) deletes for every item that needs to be expired. In a large cache this might be too slow, causing the owner to lag increasingly.

@aerosol
Copy link
Contributor Author

aerosol commented Apr 4, 2024

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

Let me know if you approve the attempt at 3ccb5d1

There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

Fair enough, thanks! 👍

How many items do you have in the cache?

image

On a normal day it peaks close to 600K predictably - keys are {int(), int()} tuples, values are structs like:

%Plausible.ClickhouseSessionV2{
  __meta__: #Ecto.Schema.Metadata<:built, "sessions_v2">,
  hostname: "redacted.example.com",
  site_id: 0,
  user_id: 11111111111111,
  session_id: 11111111111111,
  start: ~N[2024-01-04 06:46:20],
  duration: 605,
  is_bounce: false,
  entry_page: "/",
  exit_page: "/",
  pageviews: 2,
  events: 2,
  sign: 1,
  "entry_meta.key": [],
  "entry_meta.value": [],
  utm_medium: nil,
  utm_source: nil,
  utm_campaign: nil,
  utm_content: nil,
  utm_term: nil,
  referrer: nil,
  referrer_source: nil,
  country_code: "XX",
  subdivision1_code: "XX-10",
  subdivision2_code: nil,
  city_geoname_id: 1111111,
  screen_size: "Desktop",
  operating_system: "Mac",
  operating_system_version: "10.15",
  browser: "Firefox",
  browser_version: "124.0",
  timestamp: ~N[2024-01-04 06:56:25],
  transferred_from: nil
}

On that unlucky node over several days, it peaked at 200M, despite the traffic being as usual. I might be misinterpreting it, but I don't see any evictions - the growth is constant. And we do have quieter periods over the course of the day - so I assume the Owner, even if overloaded, should be able to perform some deletions at least?

image

Anyway, I'll share more detailed info if/when this happens again.

@sasa1977
Copy link
Owner

sasa1977 commented Apr 4, 2024

I might be misinterpreting it, but I don't see any evictions - the growth is constant.

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see #73.

@sasa1977
Copy link
Owner

sasa1977 commented Apr 4, 2024

Let me know if you approve the attempt at 3ccb5d1

Looks good to me. Do you plan to make other changes or you're done?

@aerosol
Copy link
Contributor Author

aerosol commented Apr 4, 2024

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see #73.

I'm using plain values with global_ttl and the other app nodes were expiring just as expected (all are identical).

@aerosol
Copy link
Contributor Author

aerosol commented Apr 4, 2024

Looks good to me. Do you plan to make other changes or you're done?

I'm done. Made a test integration branch on my end and all is well. Thanks for being so responsive on this.

@sasa1977 sasa1977 merged commit 104362d into sasa1977:master Apr 8, 2024
@aerosol
Copy link
Contributor Author

aerosol commented Apr 16, 2024

Hey, thanks for merging. Any chance for a hexpm release?

@sasa1977
Copy link
Owner

published 1.1.0

@sasa1977 sasa1977 mentioned this pull request Apr 16, 2024
aerosol added a commit to plausible/analytics that referenced this pull request Apr 17, 2024
Now that sasa1977/con_cache#76
is released, we don't have to use low-level operations
to emit hit/miss events.

This PR also wraps cache processes with
a function returning appropriate child specs lists.

Ideally each cache will have its own supervisor/child specs
going forward. This is an intermediate step in that direction.
aerosol added a commit to plausible/analytics that referenced this pull request Apr 17, 2024
Now that sasa1977/con_cache#76
is released, we don't have to use low-level operations
to emit hit/miss events.

This PR also wraps cache processes with
a function returning appropriate child specs lists.

Ideally each cache will have its own supervisor/child specs
going forward. This is an intermediate step in that direction.
aerosol added a commit to plausible/analytics that referenced this pull request Apr 17, 2024
Now that sasa1977/con_cache#76
is released, we don't have to use low-level operations
to emit hit/miss events.

This PR also wraps cache processes with
a function returning appropriate child specs lists.

Ideally each cache will have its own supervisor/child specs
going forward. This is an intermediate step in that direction.
aerosol added a commit to plausible/analytics that referenced this pull request Apr 29, 2024
* Rely on con_cache telemetry

Now that sasa1977/con_cache#76
is released, we don't have to use low-level operations
to emit hit/miss events.

This PR also wraps cache processes with
a function returning appropriate child specs lists.

Ideally each cache will have its own supervisor/child specs
going forward. This is an intermediate step in that direction.

* Update lib/plausible/application.ex

Co-authored-by: Adrian Gruntkowski <[email protected]>

* Declare caches without warmers with plain child specs

---------

Co-authored-by: Adrian Gruntkowski <[email protected]>
ruslandoga pushed a commit to plausible/analytics that referenced this pull request May 1, 2024
* Rely on con_cache telemetry

Now that sasa1977/con_cache#76
is released, we don't have to use low-level operations
to emit hit/miss events.

This PR also wraps cache processes with
a function returning appropriate child specs lists.

Ideally each cache will have its own supervisor/child specs
going forward. This is an intermediate step in that direction.

* Update lib/plausible/application.ex

Co-authored-by: Adrian Gruntkowski <[email protected]>

* Declare caches without warmers with plain child specs

---------

Co-authored-by: Adrian Gruntkowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants