Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: reasoning about the TTFB outliers #7570

Open
mogita opened this issue Dec 17, 2024 · 0 comments
Open

🐛 BUG: reasoning about the TTFB outliers #7570

mogita opened this issue Dec 17, 2024 · 0 comments
Labels
bug Something that isn't working

Comments

@mogita
Copy link

mogita commented Dec 17, 2024

Which Cloudflare product(s) does this pertain to?

Workers Runtime

What version(s) of the tool(s) are you using?

3.91.0 [wrangler], 5.7.2 [typescript], 4.20241202.0 [@cloudflare/workers-types]

What version of Node are you using?

v20.18.0

What operating system and version are you using?

macOS 15.2

Describe the Bug

My workers have the following setup:

  • 2 workers as http server, listening by different URL path on the same hostname
  • 1 worker as the tail worker for both of the http servers, that handles logging and metrics reporting for each request

Currently I'm on a trial of the enterprise plan and discovered, through logpush, that around ~0.0003% requests having a high EdgeTimeToFirstByteMs ranged from 500ms~7s. The logs also indicated these outliers had a CacheCacheStatus: hit status.

The 2 main workers are using node:diagnostics_channel to pass information to the tail worker, seemingly related to the TTFB issue so I'll post the relevant part of the code here. More details and full code can be shared once this is picked up by the Cloudflare support team.

Observed behavior

  • Among all the requests, an avg of 0.0003% requests having TTFB over 500ms, up to 7s.
  • According to Cloudflare technical support's initial findings, the high TTFB cases were considered to be caused by making subrequests from our workers to a third-party metrics vendor, reaching the IAD node.

Expected behavior

  • All requests should have a consistent TTFB.
  • By our code review, there's no subrequest made by our workers code to the third-party metrics vendor except for the tail worker. There should not be such subrequests observed.

Steps to reproduce

A minimal working subset of your worker code:

// inside the fetch() function of index.ts
const originStart = Date.now()
const cache = caches.default
const cached = await cache.match(request.url)
if (cached) {
    const respHeaders = new Headers(cached.headers)
    respHeaders.set('Access-Control-Allow-Origin', '*')
    respHeaders.set('Vary', 'Origin')

    const resp = new Response(cached.body, {
        headers: respHeaders,
        status: cached.status,
    })

    sentinel.event({
        event: eventNameRedacted,
        request,
        response: resp,
        payload: {
            ...reqCtx,
            duration: Date.now() - originStart
        },
    })
    sentinel.event({
        event: eventNameRedacted,
        request,
        response: resp,
        payload: reqCtx
    })
    return resp
}
// inside the sentinel.ts
// datadog imported for calling a util method for generating the tags which doesn't make any HTTP request
import { Channel, channel } from 'node:diagnostics_channel'

import { SentinelEventType, SentinelMessage } from './types'

import { Datadog } from '../../shared/datadog'
import { DatadogPayload } from '../../shared/datadog.d'

export class Sentinel {
    private readonly createdAt: Date
    private readonly serviceName: string
    private readonly channel: Channel

    constructor(serviceName: string) {
        this.createdAt = new Date()
        this.serviceName = serviceName
        this.channel = channel(serviceName)
    }

    event(message: { event: SentinelEventType;request ? : Request;response ? : Response;payload: DatadogPayload }): void {
        if (!Boolean(message.payload.duration)) {
            message.payload.duration = Date.now() - this.createdAt.getTime()
        }
        this.channel.publish({
                event: message.event,
                duration: message.payload.duration,
                log: {
                    request: {
                      // tag fields redacted
                    },
                    response: {
                      // tag fields redacted
                    },
                },
                metric: {
                      // tag fields redacted
                },
            }
            as SentinelMessage)
    }
}

A minimal working subset of your wrangler.toml:

# for the http server

#:schema node_modules/wrangler/config-schema.json
main = "src/index.ts"
compatibility_date = "2024-08-06"
compatibility_flags = ["nodejs_compat"]
account_id = "redacted"

name = "redacted"
r2_buckets  = [
  { binding = "BUCKET", preview_bucket_name = "redacted", bucket_name = "redacted" }
]
tail_consumers = [{service = "tail worker name redacted"}]

[vars]
# redacted

[[durable_objects.bindings]]
name = "redacted"
class_name = "redacted"

[[migrations]]
tag = "v1"
new_classes = ["redacted"]
# for the tail workers

#:schema node_modules/wrangler/config-schema.json
main = "src/index.ts"
compatibility_date = "2024-08-06"
compatibility_flags = ["nodejs_compat"]
account_id = "redacted"

name = "redacted"

[vars]
# redacted

[observability]
enabled = true

Commands used to start your local dev server, including custom env and cli args:

Not relevant, issue observed on edge.

Steps to be performed in the browser, curl commands, or a test we can run that reliably fails (at least a percent of the time):

This is a part of the enterprise support. Details are confidential.

@mogita mogita added the bug Something that isn't working label Dec 17, 2024
@github-project-automation github-project-automation bot moved this to Untriaged in workers-sdk Dec 17, 2024
@andyjessop andyjessop removed this from workers-sdk Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working
Projects
None yet
Development

No branches or pull requests

1 participant