Run persistent containers for users with web frontends, like VS Code and Jupyter Notebooks.
PrairieLearn currently allows code editing using an in-page ACE Editor and can compile and test code via external graders, which take student code and instructor-provided test code and execute them in an instructor-defined container, returning the test results back to the student. While excellent for testing small code snippets, this is not a very flexible environment for writing and debugging more complex programs.
This RFC proposes to give students persistent remote containers (called Workspaces) to work in, configured by instructors to provide a per-question environment with a specific set of compilers, debuggers, editors, etc. This remote container would be accessed via a web-based frontend, such as VS Code or Jupyter Notebooks.
Goals:
- Workspaces should be instructor-defined on a per-question basis.
- Workspaces should launch from a PL question with a set of initial files in the home directory, which may be dynamically generated by the question
server.pycode.- Workspaces should retain a readonly copy of these initial files to allow the student to reference/restore the initial state.
- The workspace should be accessible by the student via a web frontend (e.g., VS Code) that is served out of the workspace container.
- The student should have complete freedom to use the workspace as a development environment, including compiling and executing code.
- The files in the container home directory should be frequently autosaved to a persistent store, in case of container crashes (e.g., fork bombs).
- At any time, the student should be able to trigger a "Grade" action, which will pull an instructor-defined set of files from the container and run the usual PL grading code (either internal or external graders).
- Containers should auto-terminate after a period of inactivity (or force-terminated by the student), but should be able to be re-launched with the persistent home-directory files restored.
The server architecture has three conceptual components. The MVP will implement them in two servers:
- PL main servers (possibly refactored in the future)
- Web servers: render questions for the student with a "launch workspace" button in them.
- Manager servers: coordinate the launching of workspaces and proxy all traffic from the student browser through to the host machines.
- Host servers: run the actual containers.
These three components are implemented within the main PL executable, but for deployment we can run different fleets of servers that use a config option to only turn on specific functionality.
- When we want to launch a container, we bundle files, upload to S3, download to worker, spin up a container with files mounted to a known good location.
- Run filesystem watcher on mounted directory.
- When files change, we’ll do two things:
- We’ll upload the workspace state to S3.
- We’ll push the submission state into the database (somewhere, tbd) and store errors too?
- We’ll serve the page as two parts:
- An “outer part” that PrairieLearn controls - this gives us a place to show save status, and potentially show a grade button and immediate feedback in the long run. Can also show “I’ve bricked my container, pls help” button.
- An “inner part”, which is the page served by the container.
- On the main server:
- When we create a variant, we (maybe) create a workspace session (if it’s enabled for that question). This is at this point just an entry in a database table somewhere.
- We render some kind of button to launch a workspace instance for that session.
- The user clicks on that button.
- We get a request for a particular workspace instance.
- We check the authorization cookies to verify that the requesting user matches the authorized user for this workspace.
- Routes:
/workspace/<workspace_id>(referred to asworkspace_urllater in this document) - serves basic outer frame markup/workspace/<workspace_id>/frame/*- serves resources for outer frame/workspace/<workspace_id>/heartbeat/workspace/<workspace_id>/container/*- proxy*to inner frame
- On the host:
/workspace/<workspace_id>/container/*goes to the host that’s running this container.- Within the host, we’ll proxy that to the appropriate container.
- Each container will probably need port 80 bound to some random, unique port that we can target for forwards.
- The host will listen for three types of signals: launch, sync, and kill container.
- How to map requests to workspace hosts?
workspace_hoststable that stores current information about each host VM.
- How do we kill off old containers?
- Containers are killed after either:
- We don’t receive N heartbeats in a row.
- The user hasn’t saved for X amount of time.
- Containers are killed after either:
Need to make sure that cookies are inaccessible to client-side code (PrairieLearn#2503) and on the server (we need to configure our proxy to strip out at least the Cookie header, if not more things).
questions- Add a new
workspace_imagecolumn - Add a new
workspace_graded_filescolumn - Add a new
workspace_portcolumn
- Add a new
variants- Add a
workspace_idcolumn- Consider adding a
UNIQUEconstraint onworkspace_id
- Consider adding a
- Add a
workspace_hosts:id: a unique ID for this hostinstance_id: the AWS instance ID for this hosthostname: the hostname (IP address, DNS address, etc) for this host
workspaces: new tablesid: a unique ID for this workspaces3_bucket: The S3 bucket that this workspace's state lives ins3_root_key: The root "path" within the S3 bucketworkspace_host_id(nullable): The ID of the host that this workspace container is running on, if anystate: reflects the "state" of this workspace; one of the following:uninitialized: no resources have been created for this workspace yet; can transition toinitializingstopped: S3 resources for the workspace exist, but it is not running on a particular machine; can transition tolaunchinglaunching: we are allocating a host for this workspace and starting the appropriate container on that host; can transition torunningorstopped(if launching fails)running: the container for this workspace is running; can transition tostopped
workspace_logs: notable events/messages assocaiated with a particular workspace (state transitions, errors, explicit restarts, etc.)workspace_id: ID of the associated workspacedate: timestamp of the eventmessage: string messagelevel: the level of this particular log
Course staff will declare workspace config per question via workspaceOptions in info.json. To begin, the only options will be a Docker image, port number, and list of files to be graded:
{
"workspaceOptions": {
"image": "some-docker-image-name",
"port": 15000,
"gradedFiles": ["starter_code.h", "starter_code.c"]
}
}The home directory in the workspace will be determined by the workspace directory inside a question directory. In the future, we'll add the ability to dynamically generate files via server.py and place them into the home directory. This is not part of the MVP.
questions
`-- myquestion
+-- info.json
+-- question.html
+-- server.py
+-- clientFilesQuestion
+-- tests
| +-- correct_answer.c
| `-- test_run.py
|
`-- workspace
+-- .bashrc
+-- starter_code.h
`-- starter_code.c
The workspace options will need to be synced to the questions table via the usual syncing code.
What happens when we render a question with an associated workspace?
When a new variant of a question is created, the main server will create a corresponding workspace in the database associated with that particular variant. This database entry will contain a unique hash/id/something. However, we're not going to actually provision any containers, etc. for this workspace just yet. The state of the new workspace row will be uninitialized.
We'll introduce a new <pl-workspace> element that renders (to start with) a "Launch workspace" button. We should introduce a new workspace_url to data.options, and this element (or potentially other elements) can use this to render a button. workspace_url will be something of the form /workspace/[workspace_id].
Implementation notes:
- Workspace will be created and inserted into the DB in the
variants_insertsproc- In
variants_insert, query forquestions.workspace_imageto determine if it's necessary to create a workspaceworkspaceUrlshould be generated in the_buildQuestionUrlsfunction inlib/question.jsdata.options.workspace_urlshould be populated in therenderPanelfunction ofquestion-servers/freeform.js;workspaceUrlshould be available from thelocalsobject like the other URLs<pl-workspace>element should readdata.options.workspace_urland render a block-level link styled as a button that opens the workspace URL in a new tab
When this button is clicked, the URL at workspace_url will be opened in a new tab.
What happens when a user lands on a
workspace_url?
workspace_url pages will be served by the main PrairieLearn server (someday, we could split this into a separate autoscaled component).
When the main server gets a request to this url, we'll first check if we have an existing instance of a workspace by checking the state column of the appropriate workspace. There will be three cases here:
- Respond to the request with the basic markup for the outer frame
- Begin a transaction
- Obtain a lock on the row in the
workspacestable - Create archive of home directory
- Upload to appropriate key in S3:
workspace_{id}/. We'll save this archive twice under two names:initial.tar.gz: the initial state (this can later be restored)current.tar.gz: the "working" state of the workspace (this will be modified as the user interacts with the workspace)
- Update workpace state to
stopped
- Obtain a lock on the row in the
- Commit the transaction, thereby releasing the lock
- Send
change:statewebsocket message to client - Begin a new transaction
- Obtain a lock on the row in the
workspacestable - Allocate a host for this workspace's container
- Instruct host to begin loading images, S3 resources, etc
- Update workspace state to
launching
- Obtain a lock on the row in the
- Commit the transaction, thereby releasing the lock
- Send
change:statewebsocket message to client
- Respond to the request with the basic markup for the outer frame
- [Do whole launch container thing from the above section here; let's write this out in more detail later]
- Respond to the request with the basic markup for the outer frame
In either case, the client will receive exactly the same outer frame markup. The "outer frame" will initially render a loading screen and set up a websocket connection to a PL server.
This is explicitly modeled on the existing external grading websocket code (externalGradingLiveUpdate.ejs). All efforts should be made to keep the workspaces implementation of websockets consistent with the external grading implementation, as it has proved very robust in production.
init: sent from client to server to request initial state.change:state: sent from server to client to inform it of a state change.
uninitialized->stopped: We have created S3 resources for this workspace.stopped->launching: We are allocating a host for this workspace and loading the necessary image and S3 resources to the host.launching->running: The container for the workspace is running and ready to serve requests.launching->stopped: We failed to start a container for the workspace.running->stopped: The container for the workspace has stopped and cannot serve requests.
The client will be divided into two parts - the outer frame and the inner frame. The outer frame will render PrairieLearn-provided UI that is shared across all workspaces and will show things like a "restart container" button and the status of the container. The outer frame will also be responsible for rendering the inner frame, which is served from the workspace container.
The outer frame will include JavaScript that runs when the page "boots up". This JS will connect to the /workspace websocket and send an init event. The server will respond to this with the current state of the workspace (launching, running, or stopped). Future state changes will be delivered via change:state events as documented above.
If/when the status becomes running, we'll try to load the inner frame.
The "restart container" button should kill and relaunch the underlying workspace container. (This will be elaborated on later.)
The "Save" button should be removed from the PrairieLearn question page, and the "Save & grade" button should be renamed to just "Grade".
When the "Grade" button is clicked, the PrairieLearn web server will do one of three things:
- If the workspace state is
uninitialized, that is an invalid submission. - If the workspace state is
launchingorstopped, the PrairieLearn web server will pull the state of the workspace from S3.- If that request fails, this is an error.
- If the workspace state is
running, the PrairieLearn web server will query the workspace host directly for its files.- If that request fails or times out (this is not an error), fall back to S3.
- If the request to S3 fails, this is an error.
Once the PrairieLearn web server has the workspace state, it will create a submission with the gradedFiles (specified in info.json) saved to submitted_answer._files and kick off the normal grading process. This matches the PrairieLearn convention of storing files that is used by pl-file-upload, pl-file-editor, pl-file-preview, etc.
After the MVP, the UX could be improved to reduce the weirdness of needing to have two separate pages open at once, with editing and grading split between tabs.
There will be some number of EC2 instances responsible for running the containers that power workspaces. The workspace_hosts table will store metadata about each host, at a minimum the hostname where it can be reached. These will be referenced by the workspace_host_id column in the workspaces table.
Workspaces expose a simple API that allow them to be controlled and queried by a PrairieLearn web server. That API will be served from /api/v1/ and include the following routes:
POST /workspace/<workspace_id>/launch: Begins the asynchronous process of launching a workspace container. This entails:- Pull the Docker container
- Pull the workspace state from S3
- Place workspace state into a directory on disk
- Starting a container with the workspace state mounted as the home directory
- Change workspace state to
running - Emit
change:statewebsocket event
POST /workspace/<workspace_id>/stop: Tears down any resources associated with this container.GET /workspace/<workspace_id>/graded_files: Responds with a tarball including the set of graded files for this question.
Workspace hosts will also respond to /workspace/<workspace_id>/container/*, which mirrors the route on the PrairieLearn web server. When a workspace host receives a request to that path, it forwards * to the workspace container for that ID.
The workspace host will monitor the workspace state (which is mounted into the workspace container and will be written to when the workspace is saved). When the host detects a file change, it will upload the current workspace state to S3. The workspace host should check that it is still the current host for this workspace in workspace_host_id before syncing to S3.
If the underlying container dies, we set the workspace_host_id for that container to NULL and update its state to stopped.
- Since PrairieLearn will be serving a bunch of different roles depending on context, PrairieLearn's
server.jsshould be split up so that only code needed to serve a particular role is loaded. While we're refactoring, let's just make it better, do things async, etc. - There’s a distinction between workspace state and submission state - the former can include arbitrary files, the latter just includes whatever the question specifies.
- Define specifics of what happens on the client
- Define specifics of what happens on a host
- Procotol for communication between workspace hosts and web servers
- Protocol for user-initiated restart of a container
- What does this look like when running locally?
- How do we toggle the server mode (web vs host)?
- Figure out what happens to websockets when workspace moves to new host
- Algorithm for placing containers on workspace hosts
- Algorithm/implementation for autoscaling host fleet
- Edge cases: timeouts on containers, etc.