add extra logging for makeTimeoutPromise #2056

Frederik-Baetens · 2024-04-24T15:33:57Z

This log adds context on what the output gate was locked on.

Frederik-Baetens · 2024-04-24T15:36:33Z

src/workerd/io/worker.c++

      // This really only protects against total hangs. Lowering the timeout drastically is risky,
      // since low timeouts can spuriously fire when under heavy CPU load, failing requests that
      // would otherwise succeed.
      auto timeout = 30 * kj::SECONDS;
      co_await timerChannel.afterLimitTimeout(timeout);
+      KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo);


Can I avoid this extra log by putting the operationInfo into the exception without exposing it to the user? Is there some logic in the jsg errors that strips out information before the jsg.Error substring or something like that?

I'm not aware of anything like that, FWIW.

We can either make make sure the operationInfo strings are ok to expose to users, or keep this log.

The only thing I know about is

workerd/src/workerd/jsg/exception.c++

Line 11 in 7bb684f

kj::StringPtr stripRemoteExceptionPrefix(kj::StringPtr internalMessage) {

, which only strips off the "remote exception" prefix.

a-robinson · 2024-04-24T15:57:06Z

src/workerd/io/actor-cache.c++

-      flushPromise = flushPromise.catch_([this](kj::Exception&& e) {
-        return gate.lockWhile(kj::Promise<void>(kj::mv(e)));
+      flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) {
+        return gate.lockWhile(kj::Promise<void>(kj::mv(e)), kj::mv(opInfo));


This won't necessarily tell us which operation was stuck, just which operation was the first one to be scheduled for the current flush (since additional operations could also be added and later included in this flush as well.

This change still may help, we just need to be cognizant when interpreting the logs.

If additional ops get scheduled later, wouldn't they need their own lockWhile? I'd think that each op needs its own lockwhile, otherwise I don't really understand how we ensure that we're waiting on every operation that got scheduled, rather than just on the first and releasing the output gate when the first op is completed?

I thought that that the else branch of this if statement is what would put in place a lockWhile for subsequent ops while a flush is already scheduled: https://github.com/cloudflare/workerd/pull/2056/files#diff-52b552cc9d6477516b49e970aa23de236c7a1f7327b09c243c336445cbb2e18eR2171

ActorCache naturally batches writes together. One possible (and common!) order of events:

User calls put(), which calls ensureFlushScheduled(), which does the flushPromise = lastFlush.addBranch().attach(...).then([]() { flushImpl() }) call above, scheduling a flush for later. We then gate.lockWhile() that flushPromise (at least in the branch a couple linesbelow, I guess I put this comment on the wrong line)

That flush doesn't happen yet, either because lastFlush() hasn't finished yet or because the user's code keeps synchronously running.

The user does more operations (setAlarm(), delete(), txn(), or whatever). Each time ensureFlushScheduled is called on their behalf, nothing is done because flushScheduled and flushScheduledWithOutputGate are already true.

The lastFlush promise finally resolves, so flushImpl is called.

flushImpl grabs all dirty keys / alarms for writing and flushes them together.

The flush suceeds, releasing the output gate.

Hmm, the !flushScheduledWithOutputGate probably prevents that from functioning how I'm describing.

That else branch applies when a flush has been scheduled, but the flush was scheduled without the output gate and the current operation wants to use the output gate.

Additional operations may attempt to ensureFlushScheduled later, but they might skip all the branches because a flush is already scheduled with the output gate.

Edit: Annnnnd I refresh the page and there's a bunch of new comments :)

Maybe it would be better to have some kj::Maybe<kj::Exception> that we populate inside ActorCache::flushImplUsingTxn() (and perhaps the other functions we call when we go to flush different dirty entries) if any of our individual RPCs to storage throw, then when we're in the catch branch of flushImpl() here we can KJ_LOG the exception before giving the user the error?

Hmm, I see, I think what would be most useful to log here would be

what flush we're waiting on (since it seems like multiple flushes can be enqueued at once)

what's in the flush we're waiting on

I'll see if it's feasible to log that info instead.

@MellowYarker that sounds interesting too. I'll have a look at putting up a PR for that. ~~Not sure what catch branch you're referring to though, don't immediately see a catch there?~~

Ah, it's the errorhandler of that .then()

a-robinson · 2024-04-24T15:57:25Z

src/workerd/io/actor-cache.c++

@@ -2156,19 +2156,19 @@ void ActorCache::ensureFlushScheduled(const WriteOptions& options) {
    if (options.allowUnconfirmed) {
      // Don't apply output gate. But, if an exception is thrown, we still want to break the gate,
      // so arrange for that.
-      flushPromise = flushPromise.catch_([this](kj::Exception&& e) {
-        return gate.lockWhile(kj::Promise<void>(kj::mv(e)));
+      flushPromise = flushPromise.catch_([this, opInfo = kj::mv(operationInfo)](kj::Exception&& e) {


It's harmless, but AFAIK there's no need to mv a kj::LiteralStringConst. Copying it should be free.

a-robinson · 2024-04-24T15:59:37Z

src/workerd/io/actor-sqlite.c++

@@ -107,7 +107,7 @@ kj::Maybe<kj::Promise<void>> ActorSqlite::ExplicitTxn::commit() {
    // We committed the root transaction, so it's time to signal any replication layer and lock
    // the output gate in the meantime.
    actorSqlite.commitTasks.add(
-        actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback()));
+        actorSqlite.outputGate.lockWhile(actorSqlite.commitCallback(), "explicit txn commit"_kjc));


Nit, but it'd be nice if the two in this file made clear they were for sqlite txn/commit

a-robinson · 2024-04-24T16:01:37Z

src/workerd/io/worker.c++

      // This really only protects against total hangs. Lowering the timeout drastically is risky,
      // since low timeouts can spuriously fire when under heavy CPU load, failing requests that
      // would otherwise succeed.
      auto timeout = 30 * kj::SECONDS;
      co_await timerChannel.afterLimitTimeout(timeout);
+      KJ_LOG(ERROR, "Durable Object storage operation exceeded timeout", operationInfo);


I'm not aware of anything like that, FWIW.

We can either make make sure the operationInfo strings are ok to expose to users, or keep this log.

kentonv · 2024-04-24T17:17:08Z

src/workerd/io/actor-cache.h

@@ -702,7 +702,7 @@ class ActorCache final: public ActorCacheInterface {
  kj::Promise<kj::Maybe<Value>> getImpl(kj::Own<Entry> entry, ReadOptions options);

  // Ensure that we will flush dirty entries soon.
-  void ensureFlushScheduled(const WriteOptions& options);
+  void ensureFlushScheduled(const WriteOptions& options, kj::LiteralStringConst operationInfo);


Huh... LiteralStringConst wasn't actually meant to be used this way (it's meant purely as a helper for constructing ConstString). But I guess it ends up being a convenient way to force people to pass a string literal, whereas accepting StringPtr could allow them to pass a string whose lifetime isn't sufficiently long.

I guess that seems fine then. Just hadn't thought of it that way previously.

Now I kind of wish it was just called LiteralString instead of LiteralStringConst though.

Frederik-Baetens · 2024-04-25T10:08:38Z

Going to close b/c I understand this to be of limited usefulness because of : #2056 (comment)

add extra logging for makeTimeoutPromise

1c2c995

This log adds context on what the output gate was locked on.

Frederik-Baetens requested review from a team as code owners April 24, 2024 15:33

Frederik-Baetens requested review from byule, garrettgu10 and justin-mp April 24, 2024 15:33

Frederik-Baetens commented Apr 24, 2024

View reviewed changes

a-robinson approved these changes Apr 24, 2024

View reviewed changes

justin-mp approved these changes Apr 24, 2024

View reviewed changes

kentonv reviewed Apr 24, 2024

View reviewed changes

Frederik-Baetens closed this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add extra logging for makeTimeoutPromise #2056

add extra logging for makeTimeoutPromise #2056

Frederik-Baetens commented Apr 24, 2024

Frederik-Baetens Apr 24, 2024

a-robinson Apr 24, 2024

justin-mp Apr 24, 2024

a-robinson Apr 24, 2024

Frederik-Baetens Apr 24, 2024

Frederik-Baetens Apr 24, 2024

a-robinson Apr 24, 2024

Frederik-Baetens Apr 24, 2024

MellowYarker Apr 24, 2024 •

edited

Loading

MellowYarker Apr 24, 2024

Frederik-Baetens Apr 24, 2024

Frederik-Baetens Apr 24, 2024 •

edited

Loading

Frederik-Baetens Apr 24, 2024

a-robinson Apr 24, 2024

a-robinson Apr 24, 2024

a-robinson Apr 24, 2024

kentonv Apr 24, 2024

Frederik-Baetens commented Apr 25, 2024

add extra logging for makeTimeoutPromise #2056

add extra logging for makeTimeoutPromise #2056

Conversation

Frederik-Baetens commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MellowYarker Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frederik-Baetens Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frederik-Baetens commented Apr 25, 2024

MellowYarker Apr 24, 2024 •

edited

Loading

Frederik-Baetens Apr 24, 2024 •

edited

Loading