-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit telemetry events for hit and miss #76
Emit telemetry events for hit and miss #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! From what I can tell, this isn't covering fetch.
I think it's worth refactoring internals a bit, in the following way:
- Remove all
get
functions fromOperations
. - Support telemetry in
Operations.fetch
just like you did withget
. - Implement
get
& co inConCache
on top of fetch.
As an optional extra, I think it would be worth removing fetch_or_store
from Operations
too, i.e. move this logic to ConCache
. That way Operations
exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache
module.
Another extra is to add fetch
to ConCache
(not sure why I didn't do it in the first place. With that done, all get
operations could be written on top of ConCache
functions, i.e. they wouldn't have to interact directly with Operations
.
WDYT?
lib/con_cache/operations.ex
Outdated
def get(cache, key, opts \\ []) do | ||
case fetch(cache, key) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get is implemented on top of fetch, so I think you'd need to do this in fetch
. Because otherwise, ConCache.fetch_or_store
won't work.
test/con_cache_test.exs
Outdated
hit = ConCache.Operations.telemetry_hit() | ||
miss = ConCache.Operations.telemetry_miss() | ||
|
||
:telemetry.attach_many( | ||
"test", | ||
[ | ||
hit, | ||
miss | ||
], | ||
&TestTelemetryHandler.handle_event/4, | ||
nil | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this in the setup
block to reduce the noise from tests.
test/con_cache_test.exs
Outdated
end | ||
|
||
test "get/2 emits telemetry events" do | ||
assert {:ok, cache} = ConCache.start_link(ttl_check_interval: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the assert
here, since we're not testing start_link
.
test/con_cache_test.exs
Outdated
refute_receive _ | ||
ConCache.get_or_store(cache, :key, fn -> :value end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate with a blank line.
Thanks for the review, all sounds sensible. Pushed some changes already but some are pending still - I'll ping you when done. |
I think I'm having trouble understanding the separation of concerns you want me to achieve.
This I think I've done, but "implement get & co in ConCache" is a bit unclear. # defmodule ConCache
@spec get(t, key) :: value
def get(cache_id, key) do
case fetch(cache_id, key) do
:error -> nil
{:ok, value} -> value
end
end now by "& co" I assume you mean |
Yeah, I was thinking about all versions of get (including the dirty ones). If we have |
Sorry, totally confused. What do we keep in If
Basically this is the boundary I'm having hard time to recognize I guess. |
Another concern with all this is, removing |
Now that you ask, and looking at the code, I see little value in having this module :-D I wasn't asking you to do such merge, but I was proposing a part of that change, which in hindsight was needless and caused this confusion and PR bloat. So let's instead avoid these refactorings and just focus on getting telemetry in. I can unify these two modules once this is merged. So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations. OK?
|
@sasa1977 done, sorry for the delay - life got in my way. Please hold off with merging until I confirm that this branch works as expected on my end. BTW, not entirely related, but just wanted to mention it: there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes for a cache with the following settings: ttl_check_interval: :timer.seconds(1),
global_ttl: :timer.minutes(30) The app nodes are identical and the load balancing was verified granting resource access evenly, so it must be something in ConCache internals. The growth continued indefinitely - one node had allocated around ~68GiB of memory, while the other two stayed in the 3GiB ballpark. Unfortunately, I did not manage to inspect any specific process info before the node was simply restarted. I'll provide more information once/if this happens again. |
I guess what's not really ideal with this approach is, the telemetry arrives with Another slight inconvenience is, currently it's the I recall Oban changing the telemetry event shape under a minor release at some point. That's what I'd like to avoid from the client perspective here - the telemetry is in fact an integration contract. WDYT? |
Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.
You can do that as a part of this PR, to keep it focused. But my goal immediately after is to merge everything inside
This is worrying. A few possible explanations that come to mind:
1 or 2 would be my first suspects. Just glancing at the code, I see one possible issue: the owner process performs isolated (i.e. locked) deletes for every item that needs to be expired. In a large cache this might be too slow, causing the owner to lag increasingly. |
Let me know if you approve the attempt at 3ccb5d1
Fair enough, thanks! 👍
On a normal day it peaks close to 600K predictably - keys are
On that unlucky node over several days, it peaked at 200M, despite the traffic being as usual. I might be misinterpreting it, but I don't see any evictions - the growth is constant. And we do have quieter periods over the course of the day - so I assume the Owner, even if overloaded, should be able to perform some deletions at least? Anyway, I'll share more detailed info if/when this happens again. |
Just to rule out the obvious cause, are you inserting plain values, or are you using |
Looks good to me. Do you plan to make other changes or you're done? |
I'm using plain values with |
I'm done. Made a test integration branch on my end and all is well. Thanks for being so responsive on this. |
Hey, thanks for merging. Any chance for a hexpm release? |
published 1.1.0 |
Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction.
Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction.
Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction.
* Rely on con_cache telemetry Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction. * Update lib/plausible/application.ex Co-authored-by: Adrian Gruntkowski <[email protected]> * Declare caches without warmers with plain child specs --------- Co-authored-by: Adrian Gruntkowski <[email protected]>
* Rely on con_cache telemetry Now that sasa1977/con_cache#76 is released, we don't have to use low-level operations to emit hit/miss events. This PR also wraps cache processes with a function returning appropriate child specs lists. Ideally each cache will have its own supervisor/child specs going forward. This is an intermediate step in that direction. * Update lib/plausible/application.ex Co-authored-by: Adrian Gruntkowski <[email protected]> * Declare caches without warmers with plain child specs --------- Co-authored-by: Adrian Gruntkowski <[email protected]>
Tentative first pass on #75
Since
get
might be effectively called twice byget_or_store
, we need a way to suppress emitting telemetry events on plainget
.I'll add documentation upon implementation approval.