-
-
Notifications
You must be signed in to change notification settings - Fork 404
POC Remote Caching #2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
POC Remote Caching #2777
Conversation
How does you deal with absolute paths? I though, we need to adapt |
From what I can tell, PathRef already ignores the enclosing path; it only cares about relative paths within the target path That's not to say this already works across different folders though. There are a bunch of absolute paths embedded in the generated sources or bytecode that makes CodeSig invalidate things. Should be fixable |
Yeah, we already made |
@lefou I've got this working end-to-end(-ish): I can build in one folder, go to another folder, and have everything be downloaded from the remote cache. Still pretty rough, but it would be great if you could take a look. Seems like a pretty small change overall, but it raises some questions that are probably worth discussing:
|
I never used bazel or bazel-remote, but am interested into this topic as well. How is the cache handling older results? Is it only storing the latest result or all until some space/count criteria is met? Does that mean, git bisecting may always find cached results in the best case? I think a sub-goal of getting remote caching to work is making the
Without having inspected that change yet, I think we should always encode paths as "relative" to some "base", and the first "base" I can imagine is the local workspace. Another is the (relocateable)
I'm not really sure, this is the right thing. This seems like an acceptable trade to avoid recompilation in Mill's buildscripts, but for shared users code, this is probably wrong. This may result in wrong log messages in production code, for example. Instead, we should strive for some alternative to
Agreed, the latter may reduce size and avoid leakage of temporary results. Persistent targets are probably a special case, as we can't reason about what is needed by an persistent target to fulfill it's requirement to be transparent.
Here, the returned result contains
I think, this should be part of a local Mill configuration. We already have the plan to introduce a Mill config file (#1226). If we apply best practices for Linux tools, we could have some system-wide / user-specific / project-specific override concept, so the user can keep this configuration in a separate safe place. |
It's up to the cache server implementation, but yes, git bisect, and in general checking out old branches should hopefully hit the cache. bazel-remote lets you configure the cache size, after which it evicts things from the cache. It can also be configured to have cloud storage backends e.g. S3, which can themselves be configured via various eviction policies. Mill doesn't need to know about any of that, and will just re-build anything that it cannot find in the remote cache. We can leave it up to the backend to decide exactly how long things are remote cached for. |
@lihaoyi Hey there. We make Buildless, I saw your comment in the discussion. We'd love to help test this on the server implementation side. Our gRPC services for Bazel should be up and running soon. Would it be helpful to build locally and give it a shot, or is it too early? We spend a lot of time on build caching, so we may be able to provide some resources. For instance, we published the Bazel remote APIs as a Buf module, which might let you depend on the generated protos through a Maven dependency if you want to. This is generated directly from the protos and can be updated from that repo, which has now been handed over to the Buf and Bazel teams to maintain. On the headers and auth side, certainly adding regular headers would cover most basic remote caching auth needs (we support HTTP basic, Anyway, this looks awesome. I'm a lay follower of Mill but this gets me more interested 😄 |
This PR allows Mill builds in different folders or different computers to share their
out/
folder data, by means of a third-party remote-cache service referenced via--remote-cache-url
This means that you "never" need to build something twice across an entire fleet of machines. A target compiled on one machine can have its output downloaded and re-used on another, assuming the inputs are unchanged (detected via our normal
inputsHash
invalidation logic)This also means you never have to re-run tests unnecessarily (using
testCached
, which is aTarget
and thus cached), which means if you re-run a job due to a flaky test most of the other tests should be automatically skipped (since their result is in the cache) and only the specific flaky tests would need to re-runLimitations
Remote Caching assumes that all inputs are tracked. The remote cache cannot detect scenarios where un-tracked inputs cause a target with the same
inputsHash
result in different outcomes, e.g. by calling different versions of a CLI tool. I included a--remote-cache-salt
flag for a user to explicitly pass in whatever they want as an additional cache key, e.g. they could give developers on Mac-OSX and CI machines on Linux different salts to ensure they do not share cache entriesRemote caching has known security considerations; anyone with push access to the remote cache can send arbitrary binaries to be executed on the other machines pulling from it. All machines sharing a remote cache have to be within the same trust boundary. Other topologies include only pushing to the cache from trusted machines (e.g. CI running on master) but allowing pulling from untrusted machines (e.g. developer laptops)
Not everything benefits from remote caching; some targets are faster to compute locally that fetch over the network, and it's impossible to statically determine which ones.
--remote-cache-filter
allows the user to select what targets they want to cache. There are probably other ways we could try and tune things e.g. setting minimum-durations or max-output-sizes for cache uploads.None of these limitations are unique to Mill's implementation: every build tool with remote caching suffers from them, including Bazel. But is worth calling out for anyone who wishes to deploy such a system
Implementation
We use the bazel-remote-execution protocol for compatibility with bazel remote caches. This has become a de-facto standard, with multiple build tools supporting it as clients (Bazel, Buck, Pants, ...) and multiple backend implementations supporting it as servers (bazel-remote, Buildbarn, EngFlow, ...).
The Bazel remote execution protocol is extremely detailed, and does not completely match up with Mill's data model. For this PR we integrate with it relatively shallowly: on write, we PUT a single
ActionResult
to the Action Cache endpoint/ac/...
, which references a single output file which is a.tar.gz
of thefoo.{dest,json,log}
folders we PUT to the Content Addressable Store endpoint/cas/...
. On read, we do the reverse: grab the/ac/
data, use it to grab the/cas/
blob, and unpack it.This two-step process is necessary because the bazel remote cache API disallows "inline" requests that upload file contents to the
/ac/
metadata store. The Bazel-Remote implementation I tested against does not seem to complain, but better follow the spec anyway in case other servers are not so forgivingThe fact that we bundle all target data into one big
.tar.gz
blob does mean the cache is at a target-level granularity. The protocol allows finer grained stuff (e.g. sharing individual files), but we can leave support for that for future workWe limit the uploads in the
.dest
folder to only things references byPathRef
s. I use aDynamicVariable
to instrument the JSON serialization ofPathRef
s so I can gather up all thePathRef
s in a task's return value for this purpose. This is a similar approach as we discussed w.r.t.PathRef.validatedPaths
It took some fiddling to make input hashes consistent across different folders.
JsonFormatters.pathReadWrite
to serialize relative paths whenever the path is withinos.pwd
MillBuildRootModule
script-wrapper-code-generator to generated paths relative toos.pwd
os.pwd
is somewhat arbitrary here. It may make things a bit annoying for testing: we can only change theos.pwd
via subprocesses in integration tests and not unit tests. But the alternative of passing an implicitRepoRoot
everywhere in our codebase would make things more annoying for everyone else, so this may be the best optionThe remote cache will likely need a lot more configuration in future: certificates, authentication, HTTP proxy config, etc.. This stuff cannot live in any
build.sc
, even a meta-build, since it is necessary to evaluate thebuild.sc
's tasks, unless we accept that the meta-build is never going to be remote-cached. The alternative is to put it in some JSON/YAML file somewhereWe need some way to annotate things like
resolvedIvyDeps
to ensure they are not remote cached, so they can be downloaded anew on each machine.I pre-built and published the Java protobuf stubs from https://github.com/bazelbuild/remote-apis via a
bazelRemoteApis
target in Mill's own build file, versioned separately from the rest of Mill, similar to what we do for the Zinc compiler bridges. This shouldn't change very often, and when it does should generally be backwards compatible, so we shouldn't need to include it as a formal part of the Mill build graphI broke the caching-related logic (both remote and logic) out of
GroupEvaluator.scala
so it's easier to navigate around that part of the codebaseTesting
Tested manually with https://github.com/buchgr/bazel-remote
This is a proof of concept that demonstrates the ability for multiple different checkouts of the same repo to share the remote cache. After running the commands above, we can look at
1-simple-scala-copy/out/mill-profile.json
to see thatcompile
was cached, despite being run on a "clean" repository. Removing the--remote-cache-filter
and re-doing the above steps after removing theout/
folders demonstrates that everything is cached exceptmill.scalalib.ZincWorkerModule.worker
andrun
, which is to be expected. Not merge-ready, but it demonstrates the approach and can probably be cleaned up and fleshed out we decide to move forward with it.TODO
PathRef
s. These replacing the content-hashing with comparingmtime
timestamps, and are used for large binary files downloaded externally where the file changes almost-never (so a timestamp is good enough to ~never invalidate them) and are often large binary blobs (so hashing them everytime is wasteful and expensive).mtime
s won't work on remote caching because each machine will download them anew and get different download times.