Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions spices/SPICE-0018-pkldoc-io-improvements.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
= `pkldoc` I/O Improvements

* Proposal: link:./SPICE-0018-pkldoc-io-improvements.adoc[SPICE-0018]
* Status: Accepted or Rejected
* Implemented in: Pkl <version>
* Category: Tooling

== Introduction

In order to re-generate pkldoc, too much metadata is read and written.
This causes storage, memory, and I/O to balloon for large enough documentation sites.

This SPICE describes a fix, reducing the amount of metadata changes needed when re-generating documentation.

== Motivation

The following problems currently exist when generating pkldoc:

. In order to build the "known versions" and "known usages" metadata, the `package-data.json` file of every single package is required. This is an expensive operation, which grows linear to the number of package versions.
. When generating runtime data, every single runtime data file is re-written.
. When adding a new package version, the runtime data file for every historical version is changed. This results in too much I/O, and too much storage used (as the number of versions increases, the runtime data file of every single version increases).

== Proposed Solution

A series of changes will be made to reduce the amount of data written, as well as the amount of I/O needed.

. Separate runtime data into package-level runtime data and package-version level runtime data.
. Generate package-level runtime data by consuming the previously generated package-level runtime data.
. Eliminate known subtype and known usage information at a cross-package level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This greatly limits the usefulness of this information. To my knowledge, Javadoc, KDoc, Scaladoc and Pkl’s IntelliJ plugin all support cross-package information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pkldoc would still support cross-package information, but only "up" (who am I extending, what types am I using), just not "down" (who is using me).

As far as I know, neither Javadoc, KDoc, nor Scaladoc show information downwards.

IMO: I don't know how useful this information is, either. For commonly used enough types, this information becomes quite noisy.

Also: cross-package references like this are actually broken right now (they were intended to work, but they don't).
So, this actually isn't a regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, neither Javadoc, KDoc, nor Scaladoc show information downwards.

From what I can see they all do, at least for subtypes.

Example (see “all known subinterfaces” etc.): https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/util/Collection.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting! TIL.

But, this type of information is still only maintained some sort of local level. pkldoc is somewhat different, in that we can possibly generate docs of many, many packages. Imagine if the JDK's docs showed known subclasses of Collection for 3rd-party libs; that list would grow massively pretty quickly.

Copy link
Contributor

@odenix odenix Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this type of information is still only maintained some sort of local level.

It’s as local or global as the user running Javadoc wants it to be. A Javadoc site can document any number of Java modules. It can also include information from any number of upstream Javadoc websites, which of course can’t depend on code documented by the current Javadoc site (only the other way around). I thought Pkldoc worked in the same way.

The JDK’s Javadoc documents dozens of modules, hundreds of packages, and thousands of classes. Even though every JDK version has its own Javadoc site, a single site is probably larger than Pkl Pantry’s multi-version Pkldoc.

Javadoc also tracks usages, but on a separate page: https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/util/class-use/Collection.html

If you’d like to reduce the amount of processing and information presented, one option is to track subtypes across packages and to drop usages altogether. This feels more useful to me than tracking some info across packages and other info only within a package.

Another option is to limit usage tracking to the package or package+module level.

Copy link
Member Author

@bioball bioball Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkldoc is somewhat different; in some cases, pkldoc sites serve as a central package documentation site (think https://pkg.go.dev/ or https://docs.rs). And in these cases, the site can be quite big; probably already several orders of magnitude larger than package_docs.

Copy link
Contributor

@odenix odenix Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. “Arbitrarily” tracking some info only within packages is very surprising/limiting. If there’s no way around this, perhaps it should be reflected in the title, e.g., “subtypes within package”. Alternatively, this could be an option only used for central package doc sites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose we can have a flag that determines whether known types/usages is tracked across packages or not. And if not, we can change the label to "Subtypes within package", etc.

By the way, this is actually broken today (cross-package known usage/subtype doesn't work).

. Generate the search index by consuming the previously generated search index.
. Generate the main page by consuming only the "current" package-data.json files.

[#_detailed_design]
== Detailed design

=== Runtime data changes

Runtime data will be separated into package-verison level runtime data, and package level runtime data.

The package level runtime data will be written to a directory called `_`, whereas package-version level runtime data will be written to a directory whose name matches the version.

Here is an example folder structure, showing the `_` and `4.5.6` directory as the package level and package-version level runtime data files, respectively.

[source]
----
data/my/package/
├── 4.5.6
│ ├── MyModule
│ │ ├── MyClass.json
│ │ └── index.json
│ └── index.json
└── _
├── MyModule
│ ├── MyClass.json
│ └── index.json
└── index.json
----

There are three types of information captured:

1. Known versions
2. Known subtypes
3. Known usages

The known versions data will be recorded in the package level JSON runtime data.
The other two kinds of data will be recorded in the package-version level runtime data.

The known-subtypes and known-usages relationships will only be recorded inter-package.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean “intra-package”?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks!

The known-usages at the package level will be preserved.

When generating documentation for new package versions, any dependencies that are used which are also part of the site will have their known-usage information updated.

==== Runtime data format changes

The existing runtime data files will be used to generate new runtime data files.
To improve machine readability of these files, they are changed from `.js` to `.json` files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change turn these files into a public API? Is all contained information relevant/suitable for a public API? Have you considered having a separate public API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkldoc itself doesn't really care whether these should be a public API or not.
I guess whether these URLs end up being a "public API" depends on who is running the web server.

For our own package docs site, I don't think we would support this being used as an API.

BTW: one alternative that I thought about for managing metadata is to use a database (e.g. provide a jdbc connection string when running pkldoc). However, this is a much more involved change that would be quite a big rewrite of pkldoc. I can add some notes about this in the "alternatives considered".

We're also considering having a package index at some point, which would enable use-cases like: "bump my package to the latest version". This is orthogonal to pkldoc, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t (only) mean if these files are served but if a (local) consumer can rely on their presence and format (since you mentioned “improve machine readability”). I’m asking
because I’m not sure if pkldoc’s internal metadata format should double as a public format with strict compatibility guarantees etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

No; these files are only meant for pkldoc itself to consume. It's not meant to be an API for any other use-cases, and it's possible that a future migration will change the current JSON files in a breaking way.


Previous:

.index.js
[source,js]
----
runtimeData.links('known-versions', '[{ "text": "4.5.6", "href": "../4.5.6/index.html"}]');
----

New:

.index.json
[source.json]
----
{
"knownVersions": [
{
"text": "4.5.6",
"href": "../4.5.6/index.html"
}
]
}
----

==== HTML/JS changes

The generated page HTML and `pkldoc.js` file will be changed to accommodate this.

This extra script will be added to every HTML head:

[source,html]
----
<script type="module">
import perPackageData from "../../data/com.package1/_/index.json" with { type: "json" }
import perPackageVersionData from "../../data/com.package1/1.2.3/index.json" with { type: "json" }

runtimeData.knownVersions(perPackageData.knownVersions, "1.2.3");
runtimeData.knownUsagesOrSubtypes("known-subtypes", perPackageVersionData.knownSubtypes);
runtimeData.knownUsagesOrSubtypes("known-usages", perPackageVersionData.knownUsages);
</script>
----

This script uses https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules#loading_non-javascript_resources[JavaScript modules] to load JSON as an import.
This feature is supported in all major browsers.

The pkldoc.js file will be changed accordingly.

==== Generating new runtime data

To generate new runtime data, the per-package information and per-package-version data need to be updated.

To generate new package level runtime data, the existing package level runtime data is read and parsed.

=== Search index generation changes

Instead of parsing all existing `package-data.json` files, the existing search index is generated by consuming the previous search index.

=== Main page generation changes

The main page is generated by parsing the "current" `package-data.json` files.

[#_data_migration]
=== Data migration

The structure of a pkldoc website changes.
To record the data version, a schema version file is introduced at path `.pkldoc/VERSION`.

If this file is missing, a pkldoc website is considered to be version 1.
The changes reflected by this SPICE is captured as version 2.

If running `pkldoc` against a version 1 website, an error is thrown.

A new migration command is introduced, which migrates an existing pkldoc website from version 1 to version 2.

This command is called using flag `--migrate`.

[source,shell]
----
pkldoc --migrate
----

== Compatibility

This is a breaking change, with a migration needed (see <<_data_migration>>).

== Future directions

N/A

== Alternatives considered

N/A

== Acknowledgements

N/A