Skip to content

Latest commit

 

History

History
279 lines (208 loc) · 17.2 KB

File metadata and controls

279 lines (208 loc) · 17.2 KB

Workspaces

Summary

Run persistent containers for users with web frontends, like VS Code and Jupyter Notebooks.

Motivation and background

PrairieLearn currently allows code editing using an in-page ACE Editor and can compile and test code via external graders, which take student code and instructor-provided test code and execute them in an instructor-defined container, returning the test results back to the student. While excellent for testing small code snippets, this is not a very flexible environment for writing and debugging more complex programs.

This RFC proposes to give students persistent remote containers (called Workspaces) to work in, configured by instructors to provide a per-question environment with a specific set of compilers, debuggers, editors, etc. This remote container would be accessed via a web-based frontend, such as VS Code or Jupyter Notebooks.

Goals:

  • Workspaces should be instructor-defined on a per-question basis.
  • Workspaces should launch from a PL question with a set of initial files in the home directory, which may be dynamically generated by the question server.py code.
    • Workspaces should retain a readonly copy of these initial files to allow the student to reference/restore the initial state.
  • The workspace should be accessible by the student via a web frontend (e.g., VS Code) that is served out of the workspace container.
  • The student should have complete freedom to use the workspace as a development environment, including compiling and executing code.
  • The files in the container home directory should be frequently autosaved to a persistent store, in case of container crashes (e.g., fork bombs).
  • At any time, the student should be able to trigger a "Grade" action, which will pull an instructor-defined set of files from the container and run the usual PL grading code (either internal or external graders).
  • Containers should auto-terminate after a period of inactivity (or force-terminated by the student), but should be able to be re-launched with the persistent home-directory files restored.

Proposed solution

The server architecture has three conceptual components. The MVP will implement them in two servers:

  1. PL main servers (possibly refactored in the future)
    • Web servers: render questions for the student with a "launch workspace" button in them.
    • Manager servers: coordinate the launching of workspaces and proxy all traffic from the student browser through to the host machines.
  2. Host servers: run the actual containers.

These three components are implemented within the main PL executable, but for deployment we can run different fleets of servers that use a config option to only turn on specific functionality.

How this works

  • When we want to launch a container, we bundle files, upload to S3, download to worker, spin up a container with files mounted to a known good location.
  • Run filesystem watcher on mounted directory.
  • When files change, we’ll do two things:
    • We’ll upload the workspace state to S3.
    • We’ll push the submission state into the database (somewhere, tbd) and store errors too?

Frontend

  • We’ll serve the page as two parts:
    • An “outer part” that PrairieLearn controls - this gives us a place to show save status, and potentially show a grade button and immediate feedback in the long run. Can also show “I’ve bricked my container, pls help” button.
    • An “inner part”, which is the page served by the container.

Workspace container orchestration

  • On the main server:
    • When we create a variant, we (maybe) create a workspace session (if it’s enabled for that question). This is at this point just an entry in a database table somewhere.
    • We render some kind of button to launch a workspace instance for that session.
    • The user clicks on that button.
    • We get a request for a particular workspace instance.
    • We check the authorization cookies to verify that the requesting user matches the authorized user for this workspace.
    • Routes:
      • /workspace/<workspace_id> (referred to as workspace_url later in this document) - serves basic outer frame markup
      • /workspace/<workspace_id>/frame/* - serves resources for outer frame
      • /workspace/<workspace_id>/heartbeat
      • /workspace/<workspace_id>/container/* - proxy * to inner frame
  • On the host:
    • /workspace/<workspace_id>/container/* goes to the host that’s running this container.
    • Within the host, we’ll proxy that to the appropriate container.
    • Each container will probably need port 80 bound to some random, unique port that we can target for forwards.
    • The host will listen for three types of signals: launch, sync, and kill container.
  • How to map requests to workspace hosts?
    • workspace_hosts table that stores current information about each host VM.
  • How do we kill off old containers?
    • Containers are killed after either:
      • We don’t receive N heartbeats in a row.
      • The user hasn’t saved for X amount of time.

Need to make sure that cookies are inaccessible to client-side code (PrairieLearn#2503) and on the server (we need to configure our proxy to strip out at least the Cookie header, if not more things).

Design

Database

  • questions
    • Add a new workspace_image column
    • Add a new workspace_graded_files column
    • Add a new workspace_port column
  • variants
    • Add a workspace_id column
      • Consider adding a UNIQUE constraint on workspace_id
  • workspace_hosts:
    • id: a unique ID for this host
    • instance_id: the AWS instance ID for this host
    • hostname: the hostname (IP address, DNS address, etc) for this host
  • workspaces: new tables
    • id: a unique ID for this workspace
    • s3_bucket: The S3 bucket that this workspace's state lives in
    • s3_root_key: The root "path" within the S3 bucket
    • workspace_host_id (nullable): The ID of the host that this workspace container is running on, if any
    • state: reflects the "state" of this workspace; one of the following:
      • uninitialized: no resources have been created for this workspace yet; can transition to initializing
      • stopped: S3 resources for the workspace exist, but it is not running on a particular machine; can transition to launching
      • launching: we are allocating a host for this workspace and starting the appropriate container on that host; can transition to running or stopped (if launching fails)
      • running: the container for this workspace is running; can transition to stopped
  • workspace_logs: notable events/messages assocaiated with a particular workspace (state transitions, errors, explicit restarts, etc.)
    • workspace_id: ID of the associated workspace
    • date: timestamp of the event
    • message: string message
    • level: the level of this particular log

Questions

Course staff will declare workspace config per question via workspaceOptions in info.json. To begin, the only options will be a Docker image, port number, and list of files to be graded:

{
  "workspaceOptions": {
    "image": "some-docker-image-name",
    "port": 15000,
    "gradedFiles": ["starter_code.h", "starter_code.c"]
  }
}

The home directory in the workspace will be determined by the workspace directory inside a question directory. In the future, we'll add the ability to dynamically generate files via server.py and place them into the home directory. This is not part of the MVP.

questions
`-- myquestion
    +-- info.json
    +-- question.html
    +-- server.py
    +-- clientFilesQuestion
    +-- tests
    |   +-- correct_answer.c
    |   `-- test_run.py
    |
    `-- workspace
        +-- .bashrc
        +-- starter_code.h
        `-- starter_code.c

The workspace options will need to be synced to the questions table via the usual syncing code.

Student-facing question interface

What happens when we render a question with an associated workspace?

When a new variant of a question is created, the main server will create a corresponding workspace in the database associated with that particular variant. This database entry will contain a unique hash/id/something. However, we're not going to actually provision any containers, etc. for this workspace just yet. The state of the new workspace row will be uninitialized.

We'll introduce a new <pl-workspace> element that renders (to start with) a "Launch workspace" button. We should introduce a new workspace_url to data.options, and this element (or potentially other elements) can use this to render a button. workspace_url will be something of the form /workspace/[workspace_id].

Implementation notes:

  • Workspace will be created and inserted into the DB in the variants_insert sproc
  • In variants_insert, query for questions.workspace_image to determine if it's necessary to create a workspace
  • workspaceUrl should be generated in the _buildQuestionUrls function in lib/question.js
  • data.options.workspace_url should be populated in the renderPanel function of question-servers/freeform.js; workspaceUrl should be available from the locals object like the other URLs
  • <pl-workspace> element should read data.options.workspace_url and render a block-level link styled as a button that opens the workspace URL in a new tab

When this button is clicked, the URL at workspace_url will be opened in a new tab.

Accessing a workspace

What happens when a user lands on a workspace_url?

workspace_url pages will be served by the main PrairieLearn server (someday, we could split this into a separate autoscaled component).

When the main server gets a request to this url, we'll first check if we have an existing instance of a workspace by checking the state column of the appropriate workspace. There will be three cases here:

Workspace is in state uninitialized

  • Respond to the request with the basic markup for the outer frame
  • Begin a transaction
    • Obtain a lock on the row in the workspaces table
    • Create archive of home directory
    • Upload to appropriate key in S3: workspace_{id}/. We'll save this archive twice under two names:
      • initial.tar.gz: the initial state (this can later be restored)
      • current.tar.gz: the "working" state of the workspace (this will be modified as the user interacts with the workspace)
    • Update workpace state to stopped
  • Commit the transaction, thereby releasing the lock
  • Send change:state websocket message to client
  • Begin a new transaction
    • Obtain a lock on the row in the workspaces table
    • Allocate a host for this workspace's container
    • Instruct host to begin loading images, S3 resources, etc
    • Update workspace state to launching
  • Commit the transaction, thereby releasing the lock
  • Send change:state websocket message to client

Workspace is in state stopped

  • Respond to the request with the basic markup for the outer frame
  • [Do whole launch container thing from the above section here; let's write this out in more detail later]

Workspace is in state launching or running

  • Respond to the request with the basic markup for the outer frame

In either case, the client will receive exactly the same outer frame markup. The "outer frame" will initially render a loading screen and set up a websocket connection to a PL server.

Websocket protocol

This is explicitly modeled on the existing external grading websocket code (externalGradingLiveUpdate.ejs). All efforts should be made to keep the workspaces implementation of websockets consistent with the external grading implementation, as it has proved very robust in production.

  • init: sent from client to server to request initial state.
  • change:state: sent from server to client to inform it of a state change.

State machine transitions

  • uninitialized -> stopped: We have created S3 resources for this workspace.
  • stopped -> launching: We are allocating a host for this workspace and loading the necessary image and S3 resources to the host.
  • launching -> running: The container for the workspace is running and ready to serve requests.
  • launching -> stopped: We failed to start a container for the workspace.
  • running -> stopped: The container for the workspace has stopped and cannot serve requests.

Client

The client will be divided into two parts - the outer frame and the inner frame. The outer frame will render PrairieLearn-provided UI that is shared across all workspaces and will show things like a "restart container" button and the status of the container. The outer frame will also be responsible for rendering the inner frame, which is served from the workspace container.

The outer frame will include JavaScript that runs when the page "boots up". This JS will connect to the /workspace websocket and send an init event. The server will respond to this with the current state of the workspace (launching, running, or stopped). Future state changes will be delivered via change:state events as documented above.

If/when the status becomes running, we'll try to load the inner frame.

The "restart container" button should kill and relaunch the underlying workspace container. (This will be elaborated on later.)

PrairieLearn question page

The "Save" button should be removed from the PrairieLearn question page, and the "Save & grade" button should be renamed to just "Grade".

When the "Grade" button is clicked, the PrairieLearn web server will do one of three things:

  • If the workspace state is uninitialized, that is an invalid submission.
  • If the workspace state is launching or stopped, the PrairieLearn web server will pull the state of the workspace from S3.
    • If that request fails, this is an error.
  • If the workspace state is running, the PrairieLearn web server will query the workspace host directly for its files.
    • If that request fails or times out (this is not an error), fall back to S3.
    • If the request to S3 fails, this is an error.

Once the PrairieLearn web server has the workspace state, it will create a submission with the gradedFiles (specified in info.json) saved to submitted_answer._files and kick off the normal grading process. This matches the PrairieLearn convention of storing files that is used by pl-file-upload, pl-file-editor, pl-file-preview, etc.

After the MVP, the UX could be improved to reduce the weirdness of needing to have two separate pages open at once, with editing and grading split between tabs.

Workspace hosts

There will be some number of EC2 instances responsible for running the containers that power workspaces. The workspace_hosts table will store metadata about each host, at a minimum the hostname where it can be reached. These will be referenced by the workspace_host_id column in the workspaces table.

Workspaces expose a simple API that allow them to be controlled and queried by a PrairieLearn web server. That API will be served from /api/v1/ and include the following routes:

  • POST /workspace/<workspace_id>/launch: Begins the asynchronous process of launching a workspace container. This entails:
    • Pull the Docker container
    • Pull the workspace state from S3
    • Place workspace state into a directory on disk
    • Starting a container with the workspace state mounted as the home directory
    • Change workspace state to running
    • Emit change:state websocket event
  • POST /workspace/<workspace_id>/stop: Tears down any resources associated with this container.
  • GET /workspace/<workspace_id>/graded_files: Responds with a tarball including the set of graded files for this question.

Workspace hosts will also respond to /workspace/<workspace_id>/container/*, which mirrors the route on the PrairieLearn web server. When a workspace host receives a request to that path, it forwards * to the workspace container for that ID.

The workspace host will monitor the workspace state (which is mounted into the workspace container and will be written to when the workspace is saved). When the host detects a file change, it will upload the current workspace state to S3. The workspace host should check that it is still the current host for this workspace in workspace_host_id before syncing to S3.

If the underlying container dies, we set the workspace_host_id for that container to NULL and update its state to stopped.

Notes

  • Since PrairieLearn will be serving a bunch of different roles depending on context, PrairieLearn's server.js should be split up so that only code needed to serve a particular role is loaded. While we're refactoring, let's just make it better, do things async, etc.
  • There’s a distinction between workspace state and submission state - the former can include arbitrary files, the latter just includes whatever the question specifies.

Remaining work and open questions

  • Define specifics of what happens on the client
  • Define specifics of what happens on a host
  • Procotol for communication between workspace hosts and web servers
  • Protocol for user-initiated restart of a container
  • What does this look like when running locally?
  • How do we toggle the server mode (web vs host)?
  • Figure out what happens to websockets when workspace moves to new host
  • Algorithm for placing containers on workspace hosts
  • Algorithm/implementation for autoscaling host fleet
  • Edge cases: timeouts on containers, etc.