-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add extra logging for makeTimeoutPromise #2056
Conversation
This log adds context on what the output gate was locked on.
// This really only protects against total hangs. Lowering the timeout drastically is risky, | ||
// since low timeouts can spuriously fire when under heavy CPU load, failing requests that | ||
// would otherwise succeed. | ||
auto timeout = 30 * kj::SECONDS; | ||
co_await timerChannel.afterLimitTimeout(timeout); | ||
KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I avoid this extra log by putting the operationInfo into the exception without exposing it to the user? Is there some logic in the jsg errors that strips out information before the jsg.Error substring or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of anything like that, FWIW.
We can either make make sure the operationInfo
strings are ok to expose to users, or keep this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I know about is
workerd/src/workerd/jsg/exception.c++
Line 11 in 7bb684f
kj::StringPtr stripRemoteExceptionPrefix(kj::StringPtr internalMessage) { |
flushPromise = flushPromise.catch_([this](kj::Exception&& e) { | ||
return gate.lockWhile(kj::Promise<void>(kj::mv(e))); | ||
flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) { | ||
return gate.lockWhile(kj::Promise<void>(kj::mv(e)), kj::mv(opInfo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't necessarily tell us which operation was stuck, just which operation was the first one to be scheduled for the current flush (since additional operations could also be added and later included in this flush as well.
This change still may help, we just need to be cognizant when interpreting the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If additional ops get scheduled later, wouldn't they need their own lockWhile? I'd think that each op needs its own lockwhile, otherwise I don't really understand how we ensure that we're waiting on every operation that got scheduled, rather than just on the first and releasing the output gate when the first op is completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that that the else branch of this if statement is what would put in place a lockWhile for subsequent ops while a flush is already scheduled: https://github.com/cloudflare/workerd/pull/2056/files#diff-52b552cc9d6477516b49e970aa23de236c7a1f7327b09c243c336445cbb2e18eR2171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActorCache naturally batches writes together. One possible (and common!) order of events:
- User calls put(), which calls ensureFlushScheduled(), which does the
flushPromise = lastFlush.addBranch().attach(...).then([]() { flushImpl() })
call above, scheduling a flush for later. We thengate.lockWhile()
thatflushPromise
(at least in the branch a couple linesbelow, I guess I put this comment on the wrong line) - That flush doesn't happen yet, either because
lastFlush()
hasn't finished yet or because the user's code keeps synchronously running. - The user does more operations (setAlarm(), delete(), txn(), or whatever). Each time
ensureFlushScheduled
is called on their behalf, nothing is done becauseflushScheduled
andflushScheduledWithOutputGate
are already true. - The
lastFlush
promise finally resolves, soflushImpl
is called. flushImpl
grabs all dirty keys / alarms for writing and flushes them together.- The flush suceeds, releasing the output gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the !flushScheduledWithOutputGate probably prevents that from functioning how I'm describing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That else
branch applies when a flush has been scheduled, but the flush was scheduled without the output gate and the current operation wants to use the output gate.
Additional operations may attempt to ensureFlushScheduled
later, but they might skip all the branches because a flush is already scheduled with the output gate.
Edit: Annnnnd I refresh the page and there's a bunch of new comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to have some kj::Maybe<kj::Exception>
that we populate inside ActorCache::flushImplUsingTxn()
(and perhaps the other functions we call when we go to flush different dirty entries) if any of our individual RPCs to storage throw, then when we're in the catch branch of flushImpl()
here we can KJ_LOG
the exception before giving the user the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see, I think what would be most useful to log here would be
- what flush we're waiting on (since it seems like multiple flushes can be enqueued at once)
- what's in the flush we're waiting on
I'll see if it's feasible to log that info instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MellowYarker that sounds interesting too. I'll have a look at putting up a PR for that. Not sure what catch branch you're referring to though, don't immediately see a catch there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's the errorhandler of that .then()
@@ -2156,19 +2156,19 @@ void ActorCache::ensureFlushScheduled(const WriteOptions& options) { | |||
if (options.allowUnconfirmed) { | |||
// Don't apply output gate. But, if an exception is thrown, we still want to break the gate, | |||
// so arrange for that. | |||
flushPromise = flushPromise.catch_([this](kj::Exception&& e) { | |||
return gate.lockWhile(kj::Promise<void>(kj::mv(e))); | |||
flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's harmless, but AFAIK there's no need to mv
a kj::LiteralStringConst
. Copying it should be free.
@@ -107,7 +107,7 @@ kj::Maybe<kj::Promise<void>> ActorSqlite::ExplicitTxn::commit() { | |||
// We committed the root transaction, so it's time to signal any replication layer and lock | |||
// the output gate in the meantime. | |||
actorSqlite.commitTasks.add( | |||
actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback())); | |||
actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback(), "explicit txn commit"_kjc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but it'd be nice if the two in this file made clear they were for sqlite txn/commit
// This really only protects against total hangs. Lowering the timeout drastically is risky, | ||
// since low timeouts can spuriously fire when under heavy CPU load, failing requests that | ||
// would otherwise succeed. | ||
auto timeout = 30 * kj::SECONDS; | ||
co_await timerChannel.afterLimitTimeout(timeout); | ||
KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of anything like that, FWIW.
We can either make make sure the operationInfo
strings are ok to expose to users, or keep this log.
@@ -702,7 +702,7 @@ class ActorCache final: public ActorCacheInterface { | |||
kj::Promise<kj::Maybe<Value>> getImpl(kj::Own<Entry> entry, ReadOptions options); | |||
|
|||
// Ensure that we will flush dirty entries soon. | |||
void ensureFlushScheduled(const WriteOptions& options); | |||
void ensureFlushScheduled(const WriteOptions& options, kj::LiteralStringConst operationInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh... LiteralStringConst
wasn't actually meant to be used this way (it's meant purely as a helper for constructing ConstString
). But I guess it ends up being a convenient way to force people to pass a string literal, whereas accepting StringPtr
could allow them to pass a string whose lifetime isn't sufficiently long.
I guess that seems fine then. Just hadn't thought of it that way previously.
Now I kind of wish it was just called LiteralString
instead of LiteralStringConst
though.
Going to close b/c I understand this to be of limited usefulness because of : #2056 (comment) |
This log adds context on what the output gate was locked on.