Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to disable tracing via env var, and connection times out after 30 seconds rather than default #7599

Open
lukesturgis opened this issue Sep 10, 2024 · 1 comment

Comments

@lukesturgis
Copy link

lukesturgis commented Sep 10, 2024

Hello, I'm having a couple of issues with the Java trace API.

  1. When setting the environment variable DD_TRACE_ENABLED to false, the tracing library still attempts to initiate a connection to an agent endpoint. In the event it cannot connect, this will take up to the full timeout to exit, whereas I'd expect it to disable all tracing.
user@host:~$ export DD_TRACE_ENABLED=false
user@host:~$ time <long java start command> -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=localhost java/lib/hellojava.jar
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
[dd.trace 2024-09-10 20:18:39:649 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
rumentation.class=datadog.trace.instrumentation.akka.concurrent.AkkaForkJoinTaskInstrumentation
[dd.trace 2024-09-10 20:18:39:652 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
rumentation.class=datadog.trace.instrumentation.akka.concurrent.AkkaMailboxInstrumentation
[dd.trace 2024-09-10 20:18:39:686 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
rumentation.class=datadog.trace.instrumentation.googlepubsub.PublisherInstrumentation
[dd.trace 2024-09-10 20:18:39:716 +0000] [main] DEBUG datadog.trace.agent.tooling.AgentInstaller - Adding filtered classes - inst
rumentation.class=datadog.trace.instrumentation.java.concurrent.JavaForkJoinTaskInstrumentation

...

[dd.trace 2024-09-10 20:19:23:055 +0000] [main] DEBUG datadog.trace.agent.tooling.TracerInstaller - Tracing is disabled, not installing GlobalTracer.

...

[dd.trace 2024-09-10 20:19:23:389 +0000] [main] EXCLUDE_TELEMETRY datadog.communication.ddagent.DDAgentFeaturesDiscovery - Error querying info at http://localhost:8126/


real    0m1.994s
user    0m2.931s
sys     0m0.271s

  1. After disabling the agent running on localhost, when I set the agent.host to localhost the connection fails within less than 3 seconds and exits:
user@host:~$ time <long java start command> -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=localhost java/lib/hellojava.jar
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
[dd.trace 2024-09-10 20:14:32:022 +0000] [dd-telemetry] WARN datadog.telemetry.TelemetryRouter - Got FAILURE sending telemetry request to http://localhost:8126/telemetry/proxy/api/v2/apmtelemetry.

real    0m2.638s
user    0m4.009s
sys     0m0.334s

However, if I set it to a host that does not exist over the ethernet device, it will attempt to connect and take over 30 seconds to fail:

user@host:~$ time <long java start command> -pdd -Ddd.trace.health.metrics.enabled=true -Ddd.agent.host=this-host-doesnt-exist.com java/lib/hellojava.jar
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
[dd.trace 2024-09-06 19:33:57:476 +0000] [dd-task-scheduler] INFO datadog.trace.agent.core.StatusLogger - DATADOG TRACER CONFIGURATION {"version":"1.35.0~97065ed022","os_name":"Linux","os_version":"6.1.85-ts1-amd64","architecture":"amd64","lang":"jvm","lang_version":"21.0.2-tsjgss","jvm_vendor":"N/A","jvm_version":"21.0.2-tsjgss","java_class_version":"65.0","http_nonProxyHosts":"null","http_proxyHost":"null","enabled":true,"service":"hellojava","agent_url":"http://this-host-doesnt-exist.com:8126","agent_error":true,"debug":false,"trace_propagation_style_extract":["datadog","tracecontext"],"trace_propagation_style_inject":["datadog","tracecontext"],"analytics_enabled":false,"priority_sampling_enabled":true,"logs_correlation_enabled":true,"profiling_enabled":false,"remote_config_enabled":true,"debugger_enabled":false,"debugger_exception_enabled":false,"appsec_enabled":"ENABLED_INACTIVE","telemetry_enabled":true,"telemetry_dependency_collection_enabled":true,"telemetry_log_collection_enabled":false,"dd_version":"","health_checks_enabled":true,"configuration_file":"no config file present","runtime_id":"1f517376-f108-4012-a0ef-ea1940f110f3","logging_settings":{"levelInBrackets":false,"dateTimeFormat":"'[dd.trace 'yyyy-MM-dd HH:mm:ss:SSS Z']'","logFile":"System.err","configurationFile":"simplelogger.properties","showShortLogName":false,"showDateTime":true,"showLogName":true,"showThreadName":true,"defaultLogLevel":"INFO","warnLevelString":"WARN","embedException":false},"cws_enabled":false,"cws_tls_refresh":5000,"datadog_profiler_enabled":false,"datadog_profiler_safe":true,"datadog_profiler_enabled_overridden":false,"data_streams_enabled":false}
[dd.trace 2024-09-06 19:34:01:688 +0000] [main] WARN datadog.telemetry.TelemetrySystem - Telemetry thread join was not completed

real    0m38.012s
user    0m5.508s
sys     0m0.353s

I did a network capture that shows the localhost sent back an RST immediately:

21   1.761860    127.0.0.1 → 127.0.0.1    TCP 74 40270 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=65495 SACK_PERM=1 TSval=2013823740 TSecr=0 WS=2048
22   1.761878    127.0.0.1 → 127.0.0.1    TCP 54 8126 → 40270 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0

But to the agent over the network, it went into a TCP retransmission loop (the host I am attempting to connect to lives behind several layers of switches and firewalls, and does not send back an RST):

251  16.514437 x.x.x.x → y.y.y.y   TCP 74 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238250723 TSecr=0 WS=2048
266  17.550506 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238251760 TSecr=0 WS=2048
396  19.598486 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238253808 TSecr=0 WS=2048
428  23.630480 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 44472 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238257840 TSecr=0 WS=2048
634  26.525267 x.x.x.x → y.y.y.y   TCP 74 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238260734 TSecr=0 WS=2048
645  27.534484 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238261744 TSecr=0 WS=2048
707  29.582488 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238263792 TSecr=0 WS=2048
742  33.614484 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 40424 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238267824 TSecr=0 WS=2048
766  36.527430 x.x.x.x → y.y.y.y   TCP 74 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238270736 TSecr=0 WS=2048
786  37.582481 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238271792 TSecr=0 WS=2048
815  39.630481 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238273840 TSecr=0 WS=2048
855  43.662508 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 33904 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238277872 TSecr=0 WS=2048
890  46.899895 x.x.x.x → y.y.y.y   TCP 74 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238281109 TSecr=0 WS=2048
893  47.086992 x.x.x.x → y.y.y.y   TCP 74 41120 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238281296 TSecr=0 WS=2048
894  47.359935 x.x.x.x → y.y.y.y   UDP 305 59395 → 8125 Len=263
904  47.950505 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238282160 TSecr=0 WS=2048
926  48.766598 x.x.x.x → y.y.y.y   UDP 247 59395 → 8125 Len=205
927  48.766633 x.x.x.x → y.y.y.y   UDP 249 59395 → 8125 Len=207
928  48.767289 x.x.x.x → y.y.y.y   UDP 268 59395 → 8125 Len=226
929  48.767567 x.x.x.x → y.y.y.y   UDP 249 59395 → 8125 Len=207
930  48.768114 x.x.x.x → y.y.y.y   UDP 259 59395 → 8125 Len=217
931  48.768306 x.x.x.x → y.y.y.y   UDP 268 59395 → 8125 Len=226
932  48.768883 x.x.x.x → y.y.y.y   UDP 267 59395 → 8125 Len=225
933  48.769061 x.x.x.x → y.y.y.y   UDP 266 59395 → 8125 Len=224
934  48.769228 x.x.x.x → y.y.y.y   UDP 264 59395 → 8125 Len=222
935  48.769381 x.x.x.x → y.y.y.y   UDP 263 59395 → 8125 Len=221
936  48.769545 x.x.x.x → y.y.y.y   UDP 243 59395 → 8125 Len=201
937  48.769713 x.x.x.x → y.y.y.y   UDP 267 59395 → 8125 Len=225
938  48.769914 x.x.x.x → y.y.y.y   UDP 245 59395 → 8125 Len=203
939  48.770073 x.x.x.x → y.y.y.y   UDP 253 59395 → 8125 Len=211
940  48.770220 x.x.x.x → y.y.y.y   UDP 241 59395 → 8125 Len=199
941  48.770366 x.x.x.x → y.y.y.y   UDP 249 59395 → 8125 Len=207
942  48.770537 x.x.x.x → y.y.y.y   UDP 257 59395 → 8125 Len=215
943  48.770692 x.x.x.x → y.y.y.y   UDP 251 59395 → 8125 Len=209
944  48.770858 x.x.x.x → y.y.y.y   UDP 248 59395 → 8125 Len=206
945  48.771013 x.x.x.x → y.y.y.y   UDP 267 59395 → 8125 Len=225
946  48.771365 x.x.x.x → y.y.y.y   UDP 257 59395 → 8125 Len=215
947  48.771559 x.x.x.x → y.y.y.y   UDP 274 59395 → 8125 Len=232
948  48.771735 x.x.x.x → y.y.y.y   UDP 272 59395 → 8125 Len=230
958  49.998482 x.x.x.x → y.y.y.y   TCP 74 [TCP Retransmission] 41110 → 8126 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1 TSval=238284208 TSecr=0 WS=2048

The DEFAULT_AGENT_TIMEOUT looks to be 10 seconds, so I'm not sure why this is taking 30 seconds to complete, other than the fact that the TCP retransmissions are adding to the delay.

Can you please advise:

  1. Why the tracer still attempts to initiate a connection to an agent if the environment variable is set to disable it
  2. If we are setting our agent timeout incorrectly

Please let me know if you need any more information. Thank you!

@mcculls
Copy link
Contributor

mcculls commented Dec 20, 2024

Hi @lukesturgis - this is fixed by #8120 which will be in v1.45.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants