More insight into in-progress generation?

A possible issue we've seen in experiments with the prompt API is that it can take a long time to respond. There are two main contributors to this:

- Other parts of the browser, or websites, may be using the language model, such that a submitted prompt is behind them in the queue.
  - Note that beyond queue position, there is also queue "volume": e.g. 2 large prompts in the queue could take longer to process vs. 5 tiny prompts.
- Processing the input to the point where the model is about to start outputting tokens can take a while (Chrome's current model apparently does ~hundreds of tokens per second, so for medium-sized inputs it could be a few seconds).

Note that these issues also occur for Chrome's implementation of the writing assistance APIs (summarizer, writer, rewriter).

It's not clear how much we can or should do about this.

---

We could consider exposing some insight into these delaying factors. Here is one proposal that exposes all available data:

```js
const result = await session.prompt(messages, {
  monitor(m) {
    console.log(m.queuePosition);
    m.addEventListener("queuepositionchange", () => { /* ... */ });

    // and/or some measure of queue volume?
    console.log(m.queuedTokens);
    m.addEventListener("queuedtokenschange", () => { /* ... */ });

    m.addEventListener("inputprogress", progressEvent => {
      console.log(progressEvent.loaded); // number between 0 and 1
    });

    // Maybe we could have an outputprogress listener too, for people who want
    // to use prompt() instead of promptStreaming() but still get progress info.
  }
});
```

Note that exposing the accurate queue position, or queued tokens count, is a significant privacy leak. It allows monitoring the usage of language model-based APIs across all sites, and creates a trivial-to-exploit cross-site communications channel. This privacy leak might be acceptable in extensions (at least with some permissions granted), but is not acceptable on the web.

Exposing input progress is probably not a significant privacy leak, as it basically reflects hardware capabilities, which websites can already benchmark (e.g., using WebGL or WebGPU calls).

Possible less-powerful versions of this that we could expose include:

- A single boolean exposing whether the prompt is at the front of the queue, or not, instead of the accurate queue position. This might be noisy and constrained enough to ship on the web.
- A static number that measures the browser's estimation of input tokens processed per second.
- Summarizing this all into a single estimated number of milliseconds until output starts being produced. With appropriate bucketing and randomization, this might be noisy enough expose safely.

---

The reason I am unsure about all of these ideas is that I don't know what web developers would do with this information. It's not really clear how to turn these into something like a progress bar; even the estimated time proposal will end up being unreliable.

As far as I can tell, this sort of information is not available from other language model APIs.

Perhaps for that reason, I can state that from my interactions with language model-backed products, I haven't seen any sort of UI displayed that would be based on this sort of information.

In the end, the only type of UI I see is generic "please wait..." indeterminate progress bars. And the existing API provides enough information for those: as long as the promise is not settled or the stream is not started, you can display those UIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More insight into in-progress generation? #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More insight into in-progress generation? #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions