Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http4s Client leaks with ZIO 2.x #640

Closed
ollyw opened this issue Jan 4, 2023 · 3 comments
Closed

Http4s Client leaks with ZIO 2.x #640

ollyw opened this issue Jan 4, 2023 · 3 comments

Comments

@ollyw
Copy link

ollyw commented Jan 4, 2023

Background

Since migrating some projects to ZIO 2.0.x there have been a few issues with managed heap space running out. Some improvements in recent ZIO 2.0.x versions have improved it (migration from fiber root nursery), but there still seems at least one underlying issue. The most recent issue was triggered with a Cats-Effect upgrade from 3.3.x to 3.4.x.. When trying to reproduce, a small part of the app behaviour was simulated in this reproducer. It is unclear if it is the root cause in the app, but certainly this reproducer is not behaving as expected.

Reproducer

See https://github.com/ollyw/zio-2-fiber-leak-reproducer

The app simply polls an endpoint using Http4s Client + Blaze + ZIO. Over time the number of suspended fibers grows and grows. The essence of the app is basically just

   (for {
      client <- ZIO.service[Client[Task]]
      _ <- client.expect[String]("https://zio.dev/")
    } yield ()).repeat(Schedule.fixed(1.seconds))

The app logs the number of suspended fibers so it can be seen. It can also dump the fibers if you uncomment the code
The leak can also be verified with running with tools such as Visual VM and inspecting the heap after running for a while. If you want to speed up the leaking, change the schedule and point to a local http endpoint instead of the hardcoded URL (https://zio.dev)

Results

Example trace of the leaking fibers

"zio-fiber-176" (32s 780ms) 
	Status: Suspended((Interruption, CooperativeYielding, FiberRoots), zio.interop.ZioAsync.async_(ZioAsync.scala:59))
	at .executeRequest(Http1Connection.scala:217:0)
	at zio.interop.package.signalOnNoExternalInterrupt(package.scala:206)
	at zio.interop.ZioConcurrent.start.trace(cats.scala:248)


"zio-fiber-39" (37s 777ms) 
	Status: Suspended((Interruption, CooperativeYielding, FiberRoots), zio.interop.ZioAsync.async_(ZioAsync.scala:59))
	at .executeRequest(Http1Connection.scala:217:0)
	at zio.interop.package.signalOnNoExternalInterrupt(package.scala:206)
	at zio.interop.ZioConcurrent.start.trace(cats.scala:248)
	
... repeated 100s of times	

Do different versions of ZIO affect it?

  • 2.0.0 leaks a little, then seems to stop
  • 2.0.1 runs out of heap space really quickly
  • 2.0.2 runs out of heap space really quickly
  • 2.0.3 runs out of heap space really quickly
  • 2.0.4 runs out of heap space really quickly
  • 2.0.5 leaks, but heap grows slowly
  • 1.0.15 heap grows but stablises

Does the latest version of Cats-interop fix it?

Version 23.0.0.0 was release a few weeks ago with some improvements for cats-effect "lawfulness". However this version and the previous version 3.3.0 both some the same behaviour. The fiber trace points to a different line number in ZioAsync though.

Further diagnosis

The fiber dumps seem to indicate that the leaking fiber is started as a timeout action.

  // Http4s Code from https://github.com/http4s/blaze/blob/main/blaze-client/src/main/scala/org/http4s/blaze/client/Http1Connection.scala#L221-L251
    
  val idleTimeoutF: F[TimeoutException] = idleTimeoutStage match {
    case Some(stage) => F.async_[TimeoutException](stage.setTimeout) // THIS IS THE START OF THE LEAKING FIBER TRACE
    case None => F.never[TimeoutException]
  }

  idleTimeoutF.start.flatMap { timeoutFiber =>
    // the request timeout, the response header timeout, and the idle timeout
    val mergedTimeouts = cancellation.race(timeoutFiber.joinWithNever).map(_.merge)
    F.bracketCase(
      writeRequest.start
    )(writeFiber =>
      ....
    }.race(mergedTimeouts) // THIS IS THE SUSPECTED RACE CONDITION THAT IS NOT CANCELLING THE TIMEOUT FIBER
      .flatMap {
        case Left(r) => F.pure(r)
        case Right(t) => F.raiseError(t)
      }

It could be that this issue is related existing issues #616
and the follow on "Support the "onCancel associates over uncancelable boundary" law (ZIO 2/CE 3)" #617. However it might also be something different. Perhaps Http4s is misusing the effects somehow?

@ollyw
Copy link
Author

ollyw commented Jan 4, 2023

Further to this, if enabling the RuntimeMetrics, the gap between started and completed fibers continues to grow:

e.g. after running for a while

zio_fiber_started = 4298701
zio_fiber_successes = 1003621
zio_fiber_failures = 3070406

Which means that 224674 fibers aren't complete, but the app should only have a few fibers running at once

@ollyw
Copy link
Author

ollyw commented Jan 10, 2023

I have an update after @petoalbert and I have been investigating. It turns out that there is (at least) three bugs in play here:

  1. Firstly, the WeakConcurrentBag was causing lots of FiberRuntime instances to be referenced and therefore not collected. This is the largest source of leaks and obscured the following leaks
  2. Once retesting with the fix, it becomes clear that cats-interop 3.3.0 is leaking FiberId$Runtimes using the example above. The size of these objects is small so the leak takes time to manifest. Once upgrading to 23.0.0.0 this leak disappears. This is due to the modifications within ZioAsync. A suggestion is to make it clearer on the release notes for 23.0.0.0 and perhaps announce on Discord that 23.0.0.0 fixes leaks in cats-interop and is a recommended update
  3. Once the two bugs where out the way, the standard Http4s/Blaze client reproducer doesn't leak 🎉 HOWEVER our application which uses Http4s/Blaze Server continues to work on Cats 3.3.x and breaks on Cats 3.4.x. After some additional debugging it turns out that the CE Dispatcher is holding references to _fiberRefs, which in turn seems to be holding lots of tuples containing an Integer and FiberId$Runtime.

Below is a screenshot of our app leaking. Blaze Server is running all the time holding onto a Dispatcher that is indirectly holding the FiberRefs
Fiber Refs leak during perf test 2023-01-09 16-31-45

After some attempts, the following simplified app shows similar behaviour:

package zio

import cats.effect._
import cats.effect.std.Dispatcher
import zio.interop.catz._

object FiberLeakViaDispatcherExample extends ZIOAppDefault {

  override def runtime: Runtime[Any] = Runtime(ZEnvironment.empty, FiberRefs.empty, RuntimeFlags(
    RuntimeFlag.Interruption,
    RuntimeFlag.CooperativeYielding, // Removed FiberRoots to reduce fibers still referenced in the heap dump
  ))

  override def run: ZIO[ZIOAppArgs with Scope, Any, Any] = {
    ZIO.runtime[Any].flatMap { implicit runtime =>
        Dispatcher.parallel[Task].use(dispatcher =>
          for {
            _ <- ZIO.replicateZIODiscard(1000000)(ZIO.fromFuture(_ => dispatcher.unsafeToFuture(Concurrent[Task].sleep(100.micro.asScala))))
            _ <- Console.printLine("Finished")
            _ <- ZIO.never
            _ = println(dispatcher) // Keep dispatcher from being GCd. Disposing of the dispatcher will lead to GC'ing leak which is pinned in a Dispatcher latch
          } yield ())
    }
  }
}

Full project at (with dependencies) at https://github.com/ollyw/zio-2-fiber-leak-reproducer/blob/main/src/main/scala/FiberLeakViaDispatcherExample.scala

Here is a screenshot in the debugger of the latch holding onto the many values in _fiberRefs indirectly
Fiber Refs leak during perf test 2023-01-09 16-31-45

The suspicion is that when FiberRefs.forkAs is copying the runtime it is not being eliminated later and therefore the parent holds a growing set of child fiber stacks. That is just a hypothesis and I'll investigate further. However I thought I would provide an update as it would be really great if somebody from the core team could take a look and help diagnose (and eventually fix!) the issue.

@ollyw
Copy link
Author

ollyw commented Jan 12, 2023

Final updated here. It appears the issue specific to the Cats Effect upgrade was caused by one or more issues with the ZIO core library, not cats-interop. After some supervision related PRs were merged, all seems to work as of 2.0.5+49-a3909794-SNAPSHOT

zio/zio#7676
zio/zio#7686
zio/zio#7683

Thanks for fixing @adamgfraser

@ollyw ollyw closed this as completed Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant