-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cached
cache population daemon
#85
Conversation
10b9a4d
to
656a326
Compare
656a326
to
bc25ab3
Compare
bc25ab3
to
42a6646
Compare
6b084cc
to
3979c0a
Compare
DEV_TOKEN_PROJECT_1 ?= v2.public.eyJzdWIiOiIxIiwiaWF0IjoiMjAyMS0xMC0xNVQxMToyMDowMC4wMzVaIn2MQ14RfIGpoEycCuvRu9J3CZp6PppUXf5l5w8uKKydN3C31z6f6GgOEPNcnwODqBnX7Pjarpz4i2uzWEqLgQYD | ||
DEV_TOKEN_ADMIN ?= v2.public.eyJzdWIiOiJhZG1pbiJ9yt40HNkcyOUtDeFa_WPS6vi0WiE4zWngDGJLh17TuYvssTudCbOdQEkVDRD-mSNTXLgSRDXUkO-AaEr4ZLO4BQ | ||
DEV_TOKEN_PROJECT_1 ?= v2.public.eyJzdWIiOiIxIn2jV7FOdEXafKDtAnVyDgI4fmIbqU7C1iuhKiL0lDnG1Z5-j6_ObNDd75sZvLZ159-X98_mP4qvwzui0w8pjt8F | ||
DEV_SHARED_READER_TOKEN ?= v2.public.eyJzdWIiOiJzaGFyZWQtcmVhZGVyIn1CxWdB02s9el0Wt7qReARZ-7JtIb4Zj3D4Oiji1yXHqj0orkpbcVlswVUiekECJC16d1NrHwD2FWSwRORZn8gK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to re-sign the tokens cause I couldn't find the private key, included a new keypair in the repo now 👍
internal/pb/cache.proto
Outdated
|
||
option go_package = "github.com/gadget-inc/dateilager/pkg/pb"; | ||
|
||
service Cache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This GRPC service is not what we'll use in production -- it's just for testing, but it lets us ask the daemon to do its thing in tests and in development, very convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prod we're going to implement a CSI proto and that's what will be called directly by K8S?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup!
@@ -34,6 +34,9 @@ func NewTestCtx(t *testing.T, role auth.Role, projects ...int64) TestCtx { | |||
Project: project, | |||
}) | |||
|
|||
log := zaptest.NewLogger(t) | |||
zap.ReplaceGlobals(log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a zap logger that will log in tests for debugging, but because it's a zaptest.Logger
, it doesn't log unless you run go test -v
. Me likey.
@@ -948,7 +966,7 @@ func (f *Fs) GetCache(req *pb.GetCacheRequest, stream pb.Fs_GetCacheServer) erro | |||
ctx := stream.Context() | |||
trace.SpanFromContext(ctx) | |||
|
|||
_, err := requireProjectAuth(ctx) | |||
err := requireSharedReaderAuth(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want cached
to have an admin token or any specific project's token as its the only DL component that actually runs on k8s nodes next to sandboxed workloads. So, I added a new token type that can read the cache and that's it.
pkg/cli/cachedaemon.go
Outdated
flags.StringVar(&headlessHost, "headless-host", "", "Alternative headless hostname to use for round robin connections") | ||
flags.UintVar(&timeout, "timeout", 0, "GRPC client timeout (ms)") | ||
flags.IntVar(&port, "port", 5053, "cache API port") | ||
flags.StringVar(&stagingPath, "staging-path", "", "path for staging downloaded caches") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where cached
will put the downloaded cache, and what has to be on the same device as the paths it then later links the cache to.
6b4d7cc
to
7ee43f7
Compare
Makefile
Outdated
|
||
build-local-container: | ||
docker build -t dl-local:latest . | ||
docker build --load -t dl-local:dev . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes will always pull an image with the latest
tag, which we don't want to do, cause it can't be pulled from anywhere. we want it to just sit in the local image cache, so we name it something else
Makefile
Outdated
cd js && mkdir -p ./src/pb && npx protoc --experimental_allow_proto3_optional --ts_out ./src/pb --ts_opt long_type_bigint,ts_nocheck,eslint_disable,add_pb_suffix --proto_path ../internal/pb/ ../$^ | ||
cd js && mkdir -p ./src/pb && for proto in $^; do \ | ||
npx protoc --experimental_allow_proto3_optional --ts_out ./src/pb --ts_opt long_type_bigint,ts_nocheck,eslint_disable,add_pb_suffix --proto_path ../internal/pb/ ../$$proto; \ | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple protofiles broke $^
@@ -72,6 +73,7 @@ | |||
flake.packages.postgresql | |||
flake.packages.dev | |||
flake.packages.clean | |||
flake.packages.glibcLocales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the flake work ok on linux
a1a8dc3
to
20d0682
Compare
} | ||
|
||
func (c *Cached) PopulateDiskCache(ctx context.Context, req *pb.PopulateDiskCacheRequest) (*pb.PopulateDiskCacheResponse, error) { | ||
err := requireAdminAuth(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this require shared reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this actually modifies the filesystem at an arbitrary path using the permissions of the daemon, and because we don't actually want to use this for much other than testing locally, I figured better to make it restrictive and accept admin only. I think if we ever have a non-CSI use case for this we can lower it? Or, I can just delete this whole piece too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this pattern for dev only routes (especially for things like resetting a DB) and it's probably a good call here: https://github.com/gadget-inc/dateilager/blob/main/pkg/api/fs.go#L783-L785
func (c *Cached) Prepare(ctx context.Context) error { | ||
start := time.Now() | ||
newDir := path.Join(c.StagingPath, randomString()) | ||
version, count, err := c.Client.GetCache(ctx, newDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory the caching system can support being updated (pulling in a new cache version alongside an existing one) and we could do that to a node after it's been booted up.
But considering the speed of change of caches so far, I think a one shot is fine and new cache versions will appear as nodes cycle in and out.
In the future we likely want to expose some kind of GRPC endpoint to force every node to update it's cache to the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep definitely -- the Prepare
function is built to do that, where it uses a new subdir to stage the cache in over time so it will support the nice pattern of preparing a new one while still serving the old one I think that's an easy improvement but I tried to be disciplined and just get something working for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they are actually safe to build within the same directory.
The objects don't overlap (and if they do, DL should be smart enough to not write a hash that already exists) and then once all of the objects are written the version file is updated to contain a new row with the new version.
So you could fold multiple caches into one directory and it's race safe. That's why availableCacheVersions
is a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might make sense to still keep the caches in different directories so that we don't have an ever-growing list of files to hardlink into place for each specialization -- cool to leave as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we'll come back to the updating cache later, we've got a few bridges to cross before nailing that down.
948a34a
to
2c9be35
Compare
FYI, I'm going to push commits to this branch |
7e4b71a
to
a0296a6
Compare
Small changes to prep for caching daemon
…n demand - a long running background process we'll deploy as a daemonset on k8s nodes - moves the cache into place via hardlinking the golden copy from a shared location - operates using the host filesystem, and thus has pretty priviledged access
pkg/client/client.go
Outdated
} | ||
|
||
func NewClientConn(conn *grpc.ClientConn) *Client { | ||
return &Client{conn: conn, fs: pb.NewFsClient(conn)} | ||
return &Client{conn: conn, fs: pb.NewFsClient(conn), cache: pb.NewCacheClient(conn)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was playing around with this code this ended up being a bit weird.
We are using a single GRPC connection to talk both to the DL server and to the Cached server.
But in reality if you create a NewClientConn
with a GRPC connection the the server, the cached client commands won't work and vice versa.
I'm going to refactor this a bit.
This adds a new binary that implements a caching daemon -- a long running agent that sits beside DL workloads that's able to very quickly populate a cache for those workloads. I called it
cached
("cache-dee") in the sense that it is a daemon that caches. We intend to run this daemon in k8s as a daemonset, but that's in the next PR.The fundamental approach with this thing is to hardlink the cache from a shared spot into each cache user's working directory. The DL client running in the cache user's workspace will then hardlink what it can from the cache, and download whatever else it needs.
More explicitly:
/gadget/cache
within the container/gadget/app
, using/gadget/cache
as the cache, hardlinking a second time from there.Since hardlinks share content, this means that the cache is shared on disk between all the users, and we dont need to do an expensive file copy or take up more space. The downside is that they are actually the same files in each place they exist, so if one entry gets modified, the other would too. This is very bad, as it'd allow one sandboxed cache user to affect the other sandboxed users view of the cache. So, we rely on UNIX permissions to ensure that each cache user can't modify the shared cache in prod.
So, in service of running this sucker as a daemonset in CI, this introduces:
cached
This lets us mess around with it locally and sets the stage for #89 where we k8s-ify this thing.
You can 🎩 if you want with
make server
cached
withmake cached
make client-getcache-from-daemon
to have the client fetch the cache via the daemonmake client-rebuild-with-cache
to rebuild using the cache