Emit telemetry events for hit and miss #76

aerosol · 2024-03-26T08:05:53Z

Tentative first pass on #75

Since get might be effectively called twice by get_or_store, we need a way to suppress emitting telemetry events on plain get.

I'll add documentation upon implementation approval.

sasa1977

Good job! From what I can tell, this isn't covering fetch.

I think it's worth refactoring internals a bit, in the following way:

Remove all get functions from Operations.
Support telemetry in Operations.fetch just like you did with get.
Implement get & co in ConCache on top of fetch.

As an optional extra, I think it would be worth removing fetch_or_store from Operations too, i.e. move this logic to ConCache. That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.

Another extra is to add fetch to ConCache (not sure why I didn't do it in the first place. With that done, all get operations could be written on top of ConCache functions, i.e. they wouldn't have to interact directly with Operations.

WDYT?

sasa1977 · 2024-03-26T09:13:56Z

lib/con_cache/operations.ex

+  def get(cache, key, opts \\ []) do
    case fetch(cache, key) do


Get is implemented on top of fetch, so I think you'd need to do this in fetch. Because otherwise, ConCache.fetch_or_store won't work.

sasa1977 · 2024-03-26T09:15:13Z

test/con_cache_test.exs

+    hit = ConCache.Operations.telemetry_hit()
+    miss = ConCache.Operations.telemetry_miss()
+
+    :telemetry.attach_many(
+      "test",
+      [
+        hit,
+        miss
+      ],
+      &TestTelemetryHandler.handle_event/4,
+      nil
+    )


I'd put this in the setup block to reduce the noise from tests.

sasa1977 · 2024-03-26T09:15:43Z

test/con_cache_test.exs

+  end
+
+  test "get/2 emits telemetry events" do
+    assert {:ok, cache} = ConCache.start_link(ttl_check_interval: false)


Drop the assert here, since we're not testing start_link.

sasa1977 · 2024-03-26T09:16:16Z

test/con_cache_test.exs

+    refute_receive _
+    ConCache.get_or_store(cache, :key, fn -> :value end)


Separate with a blank line.

aerosol · 2024-03-26T11:58:51Z

Thanks for the review, all sounds sensible. Pushed some changes already but some are pending still - I'll ping you when done.

aerosol · 2024-03-26T19:20:33Z

I think I'm having trouble understanding the separation of concerns you want me to achieve.

Remove all get functions from Operations.
Support telemetry in Operations.fetch just like you did with get.
Implement get & co in ConCache on top of fetch.

This I think I've done, but "implement get & co in ConCache" is a bit unclear.
Whilst I understand that you can now do:

  # defmodule ConCache
  @spec get(t, key) :: value
  def get(cache_id, key) do
    case fetch(cache_id, key) do
      :error -> nil
      {:ok, value} -> value
    end
  end

now by "& co" I assume you mean get_or_store? So does that also mean we should get rid of Operations.get_or_store so that Operations exposes only isolated_get_or_store? That way the telemetry events emission would have to be done in ConCache or am I completely missing the point? Sorry, it's been a long day 😵‍💫

sasa1977 · 2024-03-26T21:20:57Z

I assume you mean get_or_store?

Yeah, I was thinking about all versions of get (including the dirty ones). If we have fetch (& co) in place, then all versions of get could use those functions, and just convert the result to value | nil, right?

aerosol · 2024-03-27T11:59:12Z

Sorry, totally confused. What do we keep in Operations then?

If get_or_store is implemented in ConCache, using local fetch (that delegates to Operations), and dirty_get_or_store is also implemented in ConCache (the actual logic this time), what's left in Operations? Just value/1 and isolated/3? What about checking ets type validity?

That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.

Basically this is the boundary I'm having hard time to recognize I guess.

aerosol · 2024-03-27T12:02:09Z

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

sasa1977 · 2024-03-27T13:27:29Z

What do we keep in Operations then?

Now that you ask, and looking at the code, I see little value in having this module :-D

I wasn't asking you to do such merge, but I was proposing a part of that change, which in hindsight was needless and caused this confusion and PR bloat.

So let's instead avoid these refactorings and just focus on getting telemetry in. I can unify these two modules once this is merged.

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

OK?

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

Operations is not a part of the ConCache public API. It's explicitly marked with @moduledoc false, which makes it internal.

This reverts commit e9aa6b9.

This reverts commit a0fe1bb.

aerosol · 2024-04-03T08:28:21Z

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

@sasa1977 done, sorry for the delay - life got in my way. Please hold off with merging until I confirm that this branch works as expected on my end.

BTW, not entirely related, but just wanted to mention it: there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes for a cache with the following settings:

ttl_check_interval: :timer.seconds(1),
global_ttl: :timer.minutes(30)

The app nodes are identical and the load balancing was verified granting resource access evenly, so it must be something in ConCache internals.

The growth continued indefinitely - one node had allocated around ~68GiB of memory, while the other two stayed in the 3GiB ballpark. Unfortunately, I did not manage to inspect any specific process info before the node was simply restarted. I'll provide more information once/if this happens again.

aerosol · 2024-04-03T09:16:16Z

Please hold off with merging until I confirm that this branch works as expected on my end.

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Another slight inconvenience is, currently it's the Operations module that defines the telemetry events shape - via Operations.telemetry_hit() and Operations.telemetry_miss() functions. As you've pointed out, @moduledocs false makes the API effectively private. I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

I recall Oban changing the telemetry event shape under a minor release at some point. That's what I'd like to avoid from the client perspective here - the telemetry is in fact an integration contract.

WDYT?

sasa1977 · 2024-04-03T20:47:30Z

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

You can do that as a part of this PR, to keep it focused. But my goal immediately after is to merge everything inside ConCache. There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes

This is worrying. A few possible explanations that come to mind:

Normal behavior - ttl doesn't always guard against the cache growth. It only ensures that the items are purged after the last write (or read if you turn on touch on read). If you frequently extend item's ttl, the cache may grow indefinitely.
Owner overload - expiry is done by the owner process (https://github.com/sasa1977/con_cache/blob/1.0.0/lib/con_cache/owner.ex). If the cache size is large, the owner might not be able to keep up. How many items do you have in the cache?
Some actual bug in the expiry logic. This is always a possibility, but it wouldn't be my first bet, given that the library has been used in production, and it hasn't changed much in the past 10 years.

1 or 2 would be my first suspects. Just glancing at the code, I see one possible issue: the owner process performs isolated (i.e. locked) deletes for every item that needs to be expired. In a large cache this might be too slow, causing the owner to lag increasingly.

aerosol · 2024-04-04T07:31:41Z

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

Let me know if you approve the attempt at 3ccb5d1

There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

Fair enough, thanks! 👍

How many items do you have in the cache?

On a normal day it peaks close to 600K predictably - keys are {int(), int()} tuples, values are structs like:

%Plausible.ClickhouseSessionV2{
  __meta__: #Ecto.Schema.Metadata<:built, "sessions_v2">,
  hostname: "redacted.example.com",
  site_id: 0,
  user_id: 11111111111111,
  session_id: 11111111111111,
  start: ~N[2024-01-04 06:46:20],
  duration: 605,
  is_bounce: false,
  entry_page: "/",
  exit_page: "/",
  pageviews: 2,
  events: 2,
  sign: 1,
  "entry_meta.key": [],
  "entry_meta.value": [],
  utm_medium: nil,
  utm_source: nil,
  utm_campaign: nil,
  utm_content: nil,
  utm_term: nil,
  referrer: nil,
  referrer_source: nil,
  country_code: "XX",
  subdivision1_code: "XX-10",
  subdivision2_code: nil,
  city_geoname_id: 1111111,
  screen_size: "Desktop",
  operating_system: "Mac",
  operating_system_version: "10.15",
  browser: "Firefox",
  browser_version: "124.0",
  timestamp: ~N[2024-01-04 06:56:25],
  transferred_from: nil
}

On that unlucky node over several days, it peaked at 200M, despite the traffic being as usual. I might be misinterpreting it, but I don't see any evictions - the growth is constant. And we do have quieter periods over the course of the day - so I assume the Owner, even if overloaded, should be able to perform some deletions at least?

Anyway, I'll share more detailed info if/when this happens again.

sasa1977 · 2024-04-04T09:11:28Z

I might be misinterpreting it, but I don't see any evictions - the growth is constant.

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see #73.

sasa1977 · 2024-04-04T09:15:01Z

Let me know if you approve the attempt at 3ccb5d1

Looks good to me. Do you plan to make other changes or you're done?

aerosol · 2024-04-04T09:15:41Z

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see #73.

I'm using plain values with global_ttl and the other app nodes were expiring just as expected (all are identical).

aerosol · 2024-04-04T09:56:14Z

Looks good to me. Do you plan to make other changes or you're done?

I'm done. Made a test integration branch on my end and all is well. Thanks for being so responsive on this.

aerosol · 2024-04-16T09:22:56Z

Hey, thanks for merging. Any chance for a hexpm release?

sasa1977 · 2024-04-16T20:33:21Z

published 1.1.0

Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction.

* Rely on con_cache telemetry Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction. * Update lib/plausible/application.ex Co-authored-by: Adrian Gruntkowski <[email protected]> * Declare caches without warmers with plain child specs --------- Co-authored-by: Adrian Gruntkowski <[email protected]>

aerosol added 4 commits March 26, 2024 08:45

Add :telemetry dependency

e4dc4ae

Emit telemetry events for cache hit and miss

2a45d32

Add another test case

6f7b0f8

Format

0cf791c

sasa1977 reviewed Mar 26, 2024

View reviewed changes

aerosol added 4 commits March 26, 2024 12:40

Implement ConCache.get on top of Operations.fetch

95c6c38

Format

c8ebdd7

Implement fetch_or_store at ConCache module level

a0fe1bb

Expose ConCase.fetch/2 interface

e9aa6b9

aerosol added 3 commits April 3, 2024 10:12

Revert "Expose ConCase.fetch/2 interface"

87abae5

This reverts commit e9aa6b9.

Revert "Implement fetch_or_store at ConCache module level"

7f3804e

This reverts commit a0fe1bb.

Ensure fetch_or_store emits telemetry

6acd747

Propagate cache name to %ConCache{} struct

3ccb5d1

sasa1977 merged commit 104362d into sasa1977:master Apr 8, 2024

sasa1977 mentioned this pull request Apr 16, 2024

Hit/miss ratio tracking #75

Closed

This was referenced Apr 17, 2024

Rely on con_cache telemetry plausible/analytics#4019

Merged

Document telemetry events #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit telemetry events for hit and miss #76

Emit telemetry events for hit and miss #76

aerosol commented Mar 26, 2024 •

edited

Loading

sasa1977 left a comment

sasa1977 Mar 26, 2024

sasa1977 Mar 26, 2024

sasa1977 Mar 26, 2024

sasa1977 Mar 26, 2024

aerosol commented Mar 26, 2024

aerosol commented Mar 26, 2024

sasa1977 commented Mar 26, 2024

aerosol commented Mar 27, 2024 •

edited

Loading

aerosol commented Mar 27, 2024

sasa1977 commented Mar 27, 2024

aerosol commented Apr 3, 2024 •

edited

Loading

aerosol commented Apr 3, 2024

sasa1977 commented Apr 3, 2024

aerosol commented Apr 4, 2024

sasa1977 commented Apr 4, 2024

sasa1977 commented Apr 4, 2024

aerosol commented Apr 4, 2024

aerosol commented Apr 4, 2024

aerosol commented Apr 16, 2024

sasa1977 commented Apr 16, 2024

		refute_receive _
		ConCache.get_or_store(cache, :key, fn -> :value end)

Emit telemetry events for hit and miss #76

Emit telemetry events for hit and miss #76

Conversation

aerosol commented Mar 26, 2024 • edited Loading

sasa1977 left a comment

Choose a reason for hiding this comment

sasa1977 Mar 26, 2024

Choose a reason for hiding this comment

sasa1977 Mar 26, 2024

Choose a reason for hiding this comment

sasa1977 Mar 26, 2024

Choose a reason for hiding this comment

sasa1977 Mar 26, 2024

Choose a reason for hiding this comment

aerosol commented Mar 26, 2024

aerosol commented Mar 26, 2024

sasa1977 commented Mar 26, 2024

aerosol commented Mar 27, 2024 • edited Loading

aerosol commented Mar 27, 2024

sasa1977 commented Mar 27, 2024

aerosol commented Apr 3, 2024 • edited Loading

aerosol commented Apr 3, 2024

sasa1977 commented Apr 3, 2024

aerosol commented Apr 4, 2024

sasa1977 commented Apr 4, 2024

sasa1977 commented Apr 4, 2024

aerosol commented Apr 4, 2024

aerosol commented Apr 4, 2024

aerosol commented Apr 16, 2024

sasa1977 commented Apr 16, 2024

aerosol commented Mar 26, 2024 •

edited

Loading

aerosol commented Mar 27, 2024 •

edited

Loading

aerosol commented Apr 3, 2024 •

edited

Loading