Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extra logging for makeTimeoutPromise #2056

Closed
wants to merge 1 commit into from

Conversation

Frederik-Baetens
Copy link
Contributor

This log adds context on what the output gate was locked on.

This log adds context on what the output gate was locked on.
// This really only protects against total hangs. Lowering the timeout drastically is risky,
// since low timeouts can spuriously fire when under heavy CPU load, failing requests that
// would otherwise succeed.
auto timeout = 30 * kj::SECONDS;
co_await timerChannel.afterLimitTimeout(timeout);
KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I avoid this extra log by putting the operationInfo into the exception without exposing it to the user? Is there some logic in the jsg errors that strips out information before the jsg.Error substring or something like that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of anything like that, FWIW.

We can either make make sure the operationInfo strings are ok to expose to users, or keep this log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I know about is

kj::StringPtr stripRemoteExceptionPrefix(kj::StringPtr internalMessage) {
, which only strips off the "remote exception" prefix.

flushPromise = flushPromise.catch_([this](kj::Exception&& e) {
return gate.lockWhile(kj::Promise<void>(kj::mv(e)));
flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) {
return gate.lockWhile(kj::Promise<void>(kj::mv(e)), kj::mv(opInfo));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't necessarily tell us which operation was stuck, just which operation was the first one to be scheduled for the current flush (since additional operations could also be added and later included in this flush as well.

This change still may help, we just need to be cognizant when interpreting the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If additional ops get scheduled later, wouldn't they need their own lockWhile? I'd think that each op needs its own lockwhile, otherwise I don't really understand how we ensure that we're waiting on every operation that got scheduled, rather than just on the first and releasing the output gate when the first op is completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that that the else branch of this if statement is what would put in place a lockWhile for subsequent ops while a flush is already scheduled: https://github.com/cloudflare/workerd/pull/2056/files#diff-52b552cc9d6477516b49e970aa23de236c7a1f7327b09c243c336445cbb2e18eR2171

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActorCache naturally batches writes together. One possible (and common!) order of events:

  1. User calls put(), which calls ensureFlushScheduled(), which does the flushPromise = lastFlush.addBranch().attach(...).then([]() { flushImpl() }) call above, scheduling a flush for later. We then gate.lockWhile() that flushPromise (at least in the branch a couple linesbelow, I guess I put this comment on the wrong line)
  2. That flush doesn't happen yet, either because lastFlush() hasn't finished yet or because the user's code keeps synchronously running.
  3. The user does more operations (setAlarm(), delete(), txn(), or whatever). Each time ensureFlushScheduled is called on their behalf, nothing is done because flushScheduled and flushScheduledWithOutputGate are already true.
  4. The lastFlush promise finally resolves, so flushImpl is called.
  5. flushImpl grabs all dirty keys / alarms for writing and flushes them together.
  6. The flush suceeds, releasing the output gate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the !flushScheduledWithOutputGate probably prevents that from functioning how I'm describing.

Copy link
Contributor

@MellowYarker MellowYarker Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That else branch applies when a flush has been scheduled, but the flush was scheduled without the output gate and the current operation wants to use the output gate.

Additional operations may attempt to ensureFlushScheduled later, but they might skip all the branches because a flush is already scheduled with the output gate.

Edit: Annnnnd I refresh the page and there's a bunch of new comments :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to have some kj::Maybe<kj::Exception> that we populate inside ActorCache::flushImplUsingTxn() (and perhaps the other functions we call when we go to flush different dirty entries) if any of our individual RPCs to storage throw, then when we're in the catch branch of flushImpl() here we can KJ_LOG the exception before giving the user the error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see, I think what would be most useful to log here would be

  • what flush we're waiting on (since it seems like multiple flushes can be enqueued at once)
  • what's in the flush we're waiting on

I'll see if it's feasible to log that info instead.

Copy link
Contributor Author

@Frederik-Baetens Frederik-Baetens Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MellowYarker that sounds interesting too. I'll have a look at putting up a PR for that. Not sure what catch branch you're referring to though, don't immediately see a catch there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's the errorhandler of that .then()

@@ -2156,19 +2156,19 @@ void ActorCache::ensureFlushScheduled(const WriteOptions& options) {
if (options.allowUnconfirmed) {
// Don't apply output gate. But, if an exception is thrown, we still want to break the gate,
// so arrange for that.
flushPromise = flushPromise.catch_([this](kj::Exception&& e) {
return gate.lockWhile(kj::Promise<void>(kj::mv(e)));
flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's harmless, but AFAIK there's no need to mv a kj::LiteralStringConst. Copying it should be free.

@@ -107,7 +107,7 @@ kj::Maybe<kj::Promise<void>> ActorSqlite::ExplicitTxn::commit() {
// We committed the root transaction, so it's time to signal any replication layer and lock
// the output gate in the meantime.
actorSqlite.commitTasks.add(
actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback()));
actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback(), "explicit txn commit"_kjc));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but it'd be nice if the two in this file made clear they were for sqlite txn/commit

// This really only protects against total hangs. Lowering the timeout drastically is risky,
// since low timeouts can spuriously fire when under heavy CPU load, failing requests that
// would otherwise succeed.
auto timeout = 30 * kj::SECONDS;
co_await timerChannel.afterLimitTimeout(timeout);
KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of anything like that, FWIW.

We can either make make sure the operationInfo strings are ok to expose to users, or keep this log.

@@ -702,7 +702,7 @@ class ActorCache final: public ActorCacheInterface {
kj::Promise<kj::Maybe<Value>> getImpl(kj::Own<Entry> entry, ReadOptions options);

// Ensure that we will flush dirty entries soon.
void ensureFlushScheduled(const WriteOptions& options);
void ensureFlushScheduled(const WriteOptions& options, kj::LiteralStringConst operationInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh... LiteralStringConst wasn't actually meant to be used this way (it's meant purely as a helper for constructing ConstString). But I guess it ends up being a convenient way to force people to pass a string literal, whereas accepting StringPtr could allow them to pass a string whose lifetime isn't sufficiently long.

I guess that seems fine then. Just hadn't thought of it that way previously.

Now I kind of wish it was just called LiteralString instead of LiteralStringConst though.

@Frederik-Baetens
Copy link
Contributor Author

Going to close b/c I understand this to be of limited usefulness because of : #2056 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants