Symlink data loaders to save disk space #1841

tel · 2024-11-24T20:41:12Z

tel
Nov 24, 2024

I'm using parameterized pages and data loaders to create a report for each experiment I run, unfortunately the simplest approach I've found will duplicate each data file. These can be quite large and thus it's fairly wasteful to do this.

In particular, I've got an experimental pipeline that dumps data files for review into src/data/results/[date]/[name].parquet where [date] varies as the experiment is rerun and [name] is one of many named output files, each a few gigs.

To generate a report for each of these results, I have a folder src/reports/[date]/ with files report.md and [name].parquet.js. Inside of report.md I access these parquet files via the sql top matter, like sql: { inputs: ./inputs.parquet }. Then the [name].parquet.js data loader just looks like this:

import {parseArgs} from "node:util";
import {createReadStream} from "node:fs";
import {join} from "node:path";

const {values: {date, name}} = parseArgs({options: {name: {type: "string"}, date: {type: "string"}}});
const filename = join("src", "data", "results", date, `${name}.parquet`);
createReadStream(filename).pipe(process.stdout);

Effectively, just cat.

As I understand it, this means that at compile time Observable will run each of these data loaders and make a copy of all of the parquet files referenced in report.md. This works as intended, but as these files are often multiple gigabytes and there are dozens of them, it's wasteful.

Is there a way to have these data loaders merely link to the already extant data files?

Alternatively, is there another way to make this kind of design work? Ideally with good compatibility for DuckDB and Mosaic.

Thanks

Edit:

An additional thing came up in my workflow which could make this useful. In particular, I often rerun experiments over and over. This updates the modified time of the input files, but not of the copies. Thus, the copies remain cached even after they've expired.

Answered by mbostock

Nov 25, 2024

Is the issue that you can’t reference a parameter in the SQL front matter, and thus can’t compute the relative path from src/reports/[date]/report.md to src/data/results/[date]/[name].parquet?

Can you instead generate the data in src/reports/[date]/[name].parquet instead, so you don’t need to copy it? Then you can use ./inputs.parquet rather than needing a data loader.

Alternatively, you could use JavaScript to initialize DuckDB instead of using the SQL front matter, as you describe in #1846.

View full answer

Fil · 2024-11-25T09:24:21Z

Fil
Nov 25, 2024
Collaborator

If I understand correctly, you don't want to create these data files, since they already exist (and might even change over time).

In this case data loaders aren't useful, and instead you could reference the files directly with their full URL. The sql front-matter supports loading data from URLs such as https://example.com/reports/2024-12-01/inputs.parquet, which you could generate from your page loader at reports/[date]/index.md.

You'll have to figure out how to serve your parquet files from these URLs, but this is just standard web server config. You might also serve them from a different URL (https://data.example.com or even https://data.some-other-domain.com with proper CORS headers).

TBH I'm not really sure you need parametrized pages either in this case. You could have a single page with a drop-down menu input that lets you choose which experiment to load and display?

4 replies

tel Nov 25, 2024
Author

This is helpful, thank you for the suggestion!

I think I was biased toward using tools like FileAttachment and sql in the top matter, probably due to their availability in the documentation. It took me a few minutes of digging through source to discover how to fully get around using the top matter (see this issue). Thus far, I've also been avoiding writing an actual page loader as that seemed to be the most difficult approach and also will likely break any IDE integration I've got going.

Considering losing static declaration of these dependencies more closely, I think there are two downsides

static analysis allows the driving of the data loaders, which are unnecessary in this circumstance, and
static analysis sets up the page auto-refresh so that if I load these files via a URL the page will not auto-refresh when the data changes

If those are the only things I lose, then it's not a big deal. The first doesn't matter and the second is of minimal importance.

It's not clear to me how I could make a dropdown, however. I'd still need a data loader which basically calls ls on the results folder to get all the known experiment IDs. Observable won't be able to know that this loader is sensitive to those files, though, so it'll get cached and I'll have to manually bust that cache whenever new experiments are generated. I ran into this problem when exploring another approach and opened a second discussion topic on it.

So far, I've gotten this working by using parameterized pages along with a manual configuration of the DuckDB client as I can use observable.params in FileAttachment declarations.

Fil Nov 25, 2024
Collaborator

The “hot reloading” feature is only available in preview, with a goal to help the developer iterate quickly on their code — it's not meant to keep data fresh.

I think you will need to set up a system to either actively poll the data (check the list every few seconds, compare timestamps, etc. — I'm not sure if we already have an example of this approach), or more simply by pressing on a "refresh data" button). Or it could be a combinaison of both (poll, and when there's an update show a button).

Another possibility is to subscribe to a push/pull mechanism such as a websocket.

Framework is compatible with any of these approaches, but there is no specific built-in helpers; in particular, that's not what data loaders are for. (Data loaders create “snapshots” at build time, that give instant access to the minimal subset of information needed to make your charts.)

Since you are using DuckDB, a feature that could be helpful to read the list of available files at runtime is S3 globbing. This supposes that your parquet files are served from a S3-type server.

tel Nov 26, 2024
Author

These are all running locally at the moment. Ideally I don't need a further process running. Manual refreshes aren't so bad. It's just notable that this is lost. It feels interesting to have as a feature request the ability for data loaders to explicitly state their dependencies so they can be rerun. Or even just specified as never getting cached.

tel Nov 26, 2024
Author

Related #1848

mbostock · 2024-11-25T18:50:02Z

mbostock
Nov 25, 2024
Maintainer

Is the issue that you can’t reference a parameter in the SQL front matter, and thus can’t compute the relative path from src/reports/[date]/report.md to src/data/results/[date]/[name].parquet?

Can you instead generate the data in src/reports/[date]/[name].parquet instead, so you don’t need to copy it? Then you can use ./inputs.parquet rather than needing a data loader.

Alternatively, you could use JavaScript to initialize DuckDB instead of using the SQL front matter, as you describe in #1846.

2 replies

tel Nov 25, 2024
Author

I was trying to think about that approach. I generate many data folders but want to have them all share the same report structure. I could drop the data files in places that overlap with the parameterized folder structure. That might work.

I have generally been using the manual DuckDB setup to get around this. I wrote this question prior to figuring out how to install the DuckDB client both for sql literals and the Mosaic coordinator.

tel Nov 26, 2024
Author

This worked great!

The git ignore rules are a little tricky as I want it to ignore everything except a directory labeled [id]. This conflicts with the gitignore file pattern rules which use brackets to denote character range matches and, as far as I can tell so far, cannot be escaped. I ended up just explicitly blacklisting .parquet files.

Symlink data loaders to save disk space #1841

Uh oh!

Uh oh!

tel Nov 24, 2024

Replies: 2 comments · 6 replies

Uh oh!

Fil Nov 25, 2024 Collaborator

Uh oh!

Uh oh!

tel Nov 25, 2024 Author

Uh oh!

Fil Nov 25, 2024 Collaborator

Uh oh!

tel Nov 26, 2024 Author

Uh oh!

tel Nov 26, 2024 Author

Uh oh!

Uh oh!

mbostock Nov 25, 2024 Maintainer

Uh oh!

tel Nov 25, 2024 Author

Uh oh!

tel Nov 26, 2024 Author

tel
Nov 24, 2024

Replies: 2 comments 6 replies

Fil
Nov 25, 2024
Collaborator

tel Nov 25, 2024
Author

Fil Nov 25, 2024
Collaborator

tel Nov 26, 2024
Author

tel Nov 26, 2024
Author

mbostock
Nov 25, 2024
Maintainer

tel Nov 25, 2024
Author

tel Nov 26, 2024
Author