Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme: add explanations about design #243

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 146 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -453,24 +453,165 @@ This tells Docker to use the

We built Plz following these principles:

- Code and data must be stored for future reference.
- Data that isn't reproducible is worthless.
- You don't know the value of your data at the time of creation.
- Whatever part of the running environment can be captured by Plz, we capture it
as to make jobs repeatable.
- Hardware is expensive.
- Code is a means to an end. What matters is the outcome you obtain from running
your code.
- Functionality is based on standard mechanisms like files and environment
variables. You don't need to add extra dependencies to your code or learn how
to read/write your data in specific ways.
- The tool must be flexible enough so that no unnecessary restrictions are
imposed by the architecture. You should be able to do with Plz whatever you
can do by running a program manually. It was surprising to find out how many
issues, mostly around running jobs in the cloud, could be solved only by
tweaking the configuration, without requiring any changes to the code.
imposed by its architecture. You should be able to do with Plz whatever you
can do by running a program manually. It was surprising to find out how much
of the friction around running jobs in the cloud could be solved, only by
tweaking the configuration and without requiring any changes to Plz code.

Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them
taking days to run in the most powerful instances available. We trust it to
start and terminate these instances as needed, and to manage our spot instances,
allowing us to get a much better price than if we were using on-demand instances
all the time.

## How does Plz help

If you didn't have Plz, the steps you'd need as to run your code in an AWS
instance would be:

- go to the AWS console and start an instance (or have created a launch template
and then use the cli)
- wait until the instance is up
- get the IP address of the instance from the console
- copy your code and data by ssh-ing to the instance
- ssh to the instance and run your job. Preferably inside docker so that a
dropped connection doesn't kill your job (but if you want to have docker you
have to take care of a `Dockerfile` and build the image)
- each time the connection drops or you turn off your computer, you need to ssh
again. If you didn't use docker, you lost your terminal and it's very likely
your job died
- watch your job until it finishes (and lose money at the point it has already
finished and your instance keeps running, if you don't look often enough)
- copy your results back to your machine by ssh-ing to the instance, being
disciplined about where you store them and making sure you can link them to
the (version of the) code that produced them if you'll have several runs that
you want to compare. Or, if you started from a program that was running
locally, change it to write to a non-ephemeral location
- if you care about your standard output/logs, gather and retrieve them somehow
- make a note of your results (like stats or accuracy), or copy files with
results

All of that gets simplified to `plz run`. If you stopped the output of `plz run`
(by hitting Ctrl-C, or turning off your computer) can do `plz output` to get the
output at any time.

If you want to rerun your job later (for instance, to try different parameters),
you would need to have saved a copy of the code (or have been very disciplined
with your git history, plus have tags or have commits for every single one-line
tweak you try --more about that [below](#why-is-Plz-the-way-it-is)), and
possibly also have the same data you have used. You'd need to retrieve the code
from wherever you have it (for instance, you may need to find the git branch,
and switch to it --possibly after creating a different copy of the repo if you
don't want to stop working on what you are doing it).

Another important factor is that it gives you a standard way to run your code.
Same as when you see a Makefile and you know that you can type `make`, when you
see a `plz.config.json` you know that you can do `plz run`. Then your code can
be launched in whatever machine your teammate happens to be sitting (specially
if the job runs in the cloud). Teammates need to install `plz`, sure, but your
team will know how to do it after a couple installs, and that's setting up one
program per team member, instead of one setup per project.

## Why is Plz the way it is

This section is an attempt to describe the rationale behind the high-level
architecture of Plz.

- why Docker: simplifies input and output, which results in concrete
simplifications like log handling: we obtain a stream of logs from the running
jobs just by calling the Docker API, with facilities to filter for time.
Running commands with ssh requires to either keep the connection as to gather
the output or redirecting the output and reading it later from a file. In
general, Docker doesn't provide only isolation, but also an environment where
the job runs autonomously with controlled inputs and outputs
- why not using git to store code snapshots (and use git to transfer code to the
instance): because it's very common that users want to make changes that they
don't necessarily want in their commit history. For instance, when users try
to make their job run in the cloud, or to run it at a different scale than
what they use to (for instance, to run the job with far more data than they do
locally), they might try several one-line tweaks. These commits (possibly
paired with messages that would be meaningless in a month, like ''Change
foobar from 0 to 1'') are hardly useful and pollute the repo history. Plz
could also create a different branch for each job but (in order to allow for
`plz rerun`) then these branches should be kept, would be listed in
`git branch`, etc. _A good summary answer to the question would be: because
users want to commit stuff that ''works'' (commits you can revert to, use for
reference, etc.) and you don't know whether something works until you've run
it._ The solution for code storage we implemented, using Docker images, is
quite simple to implement and understand as the docker API allows you to just
send the files as a tarball in order to create an image (if we were using git,
for the case of private repos, we would need to implement usage of git
credentials in the instance, which would actually be more complicated than
using Docker). Docker images are given a name so that they can be referenced
later, making `plz rerun` easy to implement as well. The code can be retrieved
by looking inside the image, which is a reliable source of truth, as it stores
the code that was actually running

### Could Plz be smaller?

- why do we need a controller/server: one reason is to manage locks (for
instance, to avoid two jobs requests using the same instance). It's true that
locking could be done just by using a redis server (so instead of a
controller/Plz server, the CLI could maybe point to a redis server taking care
of locks). That would force the tool to assume that everyone uses it
collaboratively (one could engineer an altered CLI that locks every instance,
etc.). We are having this assumption now, but we are not forced to keep it in
the future. Another reason for a controller is a feature that we have
considered for a while: to rerun jobs for spot instances that were terminated
because of being overbid. To that end, we need something to be running
permanently the cloud, as there might not be a CLI running at the point in
time where the spot instance is terminated. In general, it sounds like if you
want to do something serious about a bunch of instances that are running
permanently, eventually you'll need a coordinator/controller. Even if the
current features might not strictly require a controller, it's good that any
features that do require it won't need a major refactor. Needless to say, a
controller-less Plz cannot be obtained by just erasing the controller: a major
effort would be needed so that the tasks being done by the controller (setting
inputs, collecting outputs, etc.) are done by, for instance, a wrapper of the
program being run by the user
- why collecting information from the running program: while it would be
possible to leave to user programs the task to write to whatever non-ephemeral
storage they choose, that would put a burden on the Plz user to change their
program significantly, with respect to a program that they already run locally
(for instance, instead of writing local files, to use the AWS API to write to
S3). With the current Plz mechanism, as long as there is a single point in
your program where you can set the output directory (and, if you program
doesn't have such point, it's a good idea to implement it anyway) you can
write files and Plz will make them non-ephemeral for you. Also, with the
current mechanism team members know how to access the outcomes of your job
even if they don't know the details (`plz output` for ''blobs'' and
`plz measures` for structured outputs), and can read them using standard
tools, as every computer setup can process json and files (as opposed to, say,
running SQL queries in the cloud)
- why managing the instances ourselves/why not using Kubernetes: because
autoscaling mechanisms (either using Kubernetes or autoscaling groups) do not
cover the case of ''interactive users'' which want to run instances, see them
spawn when they launch a job and see them terminate when they stop it.
Autoscaling mechanisms have ''cooling times'' specified so that scaling
changes are not happening all the time, degrading performance, but they cause
operations not to be immediate/deterministic and that can be really annoying
when working interactively. We discovered all of these because our first
attempt was to use AWS autoscaling groups, and that Plz was a pain to use and
also to test manually (''did AWS get that we want to terminate this instance?
Let's wait, sometimes it takes 5 minutes to take it down''). With respect to
Kubernetes specifically, when we started Plz the Kubernetes implementation of
AWS (EKS) wasn't there. There is a feature for Kubernetes in the works. We
plan that users would be able to specify a Kubernetes cluster to which the
execution will be sent (to support the case of a non-interactive user), or, as
we currently do, specify an instance type, so that an instance will be started
and managed by Plz (to support the case of an interactive user)

## Future work

In the future, Plz is intended to:
Expand Down