Skip to content

feat: store richtext data as HTML rather than a Prosemirror tree#1239

Merged
tefkah merged 46 commits into
mainfrom
tfk/html-richtext
May 29, 2025
Merged

feat: store richtext data as HTML rather than a Prosemirror tree#1239
tefkah merged 46 commits into
mainfrom
tfk/html-richtext

Conversation

@tefkah

@tefkah tefkah commented May 12, 2025

Copy link
Copy Markdown
Member

Issue(s) Resolved

High-level Explanation of PR

Alternatively/in addition, i have a triplet of videos.

  1. just goes over the text of this PR (optional)
summary.mp4
  1. shows the serialization/deserialization working with the editor (optional)
demo.mp4
  1. shows/goes over the changes to the context editor (recommended. gets cut off slightly at the end)
editor-stuff.mp4

This PR changes how we store RichText: rather than storing the full ProseMirror tree, we first convert it to HTML and only store that.

Some reasons to do this:

  • Easier to convert format: rather than having to create $N$ prosemirror-to-X tools, we only need to use the (native) DOMSerializer, and then are able to create (much easier) $N-1$ (the 1 being HTML) html-to-X tools. This could integrate better with our existing rehype/remark converters.
  • Much easier to display a "pretty good" version of RichText. Rather than any clients (like the site builder) needing to ship and implement some prosemirror-to-html functionality, we can "just" do a quick __dangerouslySetInnerHTML to get 85% of the way there
  • Much easier to index for fulltext search! kind of trivial to remove all the <> from HTML and index that, much harder to do so for a json blob.
  • html is cool, and prosemirror is cringe

Schema

The RichText schema is now just a plain String schema. I wanted to add some regex maybe but decided against it, as we could still handle "" as valid HTML.

FTS

Previously the prosemirror tree was indexed for full text search extremely naively. I wanted to make the body more searchable, so we now do something like

  • on update/creation, check if pubvalue is RichText
  • if so, remove all <.*>, and index the remainder
  • now only the content of the body gets searched

This could be improved by weighting eg headings and non-headings differently, which is something we should do in the future. That way results will be ranked higher if eg ponies occurs in an h1 rather than in a span tag.

Where does the conversion happen

This was the most difficult part to figure out for me and I've gone back and forth quite a few times on where exactly the stored HTML should be converted to ProseMirror and vice versa when the created ProseMirror tree from the editor should be converted back to HTML again to be stored. At first I had this all on the client, then I had it on server and then I had it on the client again.

Finally, I decided that the conversion from ProseMirror to HTML should happen on the server again, but that this conversion only happens on pages/actions dealing with the ContextEditor (ie the PubEditor). Everywhere else treats a RichText field as HTML.

Without trying to go on a tangent, the reasoning is roughly

  • only the things that need to think about prosemirror think about it
  • converting ProseMirror to HTML on the client can be slow, especially if we were to do that on onChange. so a clean "ContextEditor converts back and forth between HTML and ProseMirror` doesn't really cut it

Hacks

(see 3rd video for demo/some explanation)

Although previously this was meant as an optimization in the case that the Context Editor was converting between HTML and ProseMirror, I kind of left it in because it does make the Editor faster.
Instead of tracking the state of the Editor in the form by doing field.onChange, we now instead expose a ref which retrieves the value of the Context Editor when it is needed: on submission time of the form.

This is kind of evil and not at all React-y, but it does make the editor much faster, at least twice as fast for big documents.

I think we should keep it even though it kind of goes against our React-y model, as we can still treat everything inside of the ContextEditor React-y. One reason I think why this is justified is because the Context Editor is not really a controlled component, but with the onChange hook it kind of makes it seem like it is, ie we don't pass the field.value back into the ProseMirror Editor. The Context Editor controls its own state anyway and we just need a way to get that state out, because we can't manipulate it outside of the Context Editor and expect it to stay the same anyway.

Now this might be a problem in the future and we might want to revert it when dealing with updates to transcluded values, eg the ContextAtoms. Maybe we do want to update the value of a transcluded starter:abstract in the Editor when we change the starter:abstract value in the form field above. But we can burn that bridge when we get to it!

Test Plan

  1. Go to legacy

  2. CMD+K Ponies, go to the one pub that pubs up

  3. Observe the beautiful html content

  4. Create a new pub

  5. Try to cram as many different nodes in the richtext body as you can

  6. Save + reload

  7. hopefully they stay the same!

Screenshots (if applicable)

Notes

@github-actions

github-actions Bot commented May 12, 2025

Copy link
Copy Markdown
PR Preview Action v1.6.1
Preview removed because the pull request was closed.
2025-05-29 18:30 UTC

@tefkah tefkah added the preview Auto-deploys a preview application label May 12, 2025
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 12, 2025 17:23 Inactive

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically this allows for much better indexing of html by first stripping all the tags essentially.
this could be improved by creating to sets to index: headings and body, and weighting the headings more heavily in the search algo. i didn't do that for now as that's kinda overkill.

],
toDOM: () => ["math-inline", { class: "math-node" }, 0],
toDOM: (node: Node) => renderMath(node, "math-inline"),
} satisfies NodeSpec;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im very happy that this works: basically math is now stored as MathML! this would maybe prove to be a problem, but luckily katex stores the raw "input" as an <annotation> node, which we can use to still easily parse the HTML (katex does not provide any mathml-to-katex functionality).

we could also still allow maybe raw katex as the body, that would make importing custom html a bit easier, eg

<math-display>
     \int_a^b x^2dx
</math-display>

Comment on lines +14 to +23
// this is not nice, i would like to avoid manually calling `document.createElement`
// as we now need to keep track of setting `global.document` when this is called on the server
// forturnately, we only really server render the HTML when importing Legacy text content
// could be easily solved if `toDOM` was passed the document object, but ProseMirror does not seem to intend providing this
// https://discuss.prosemirror.net/t/getting-a-hold-of-the-document-used-by-prosemirror-from-todom/8392
if (!global.document) {
throw new Error(
"document not found. Trying to serialize math in a non-browser environment. To do this, set `global.document` to eg a `JSDOM` document before serializing."
);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very annoying, if you know of a better way to have prosemirror convert a raw html string to a node let me know!

Comment on lines 22 to 43

export function getJsonSchemaByCoreSchemaType(coreSchemaType: CoreSchemaType, config?: unknown) {
switch (coreSchemaType) {
case CoreSchemaType.Boolean:
return Boolean;
case CoreSchemaType.DateTime:
return DateTime;
case CoreSchemaType.Email:
return Email;
case CoreSchemaType.FileUpload:
return FileUpload;
case CoreSchemaType.MemberId:
return MemberId;
case CoreSchemaType.Null:
return Null;
case CoreSchemaType.Number:
return Number;
case CoreSchemaType.NumericArray:
return getNumericArrayWithMinMax(config);
case CoreSchemaType.RichText:
return RichText;
case CoreSchemaType.String:
return String;
case CoreSchemaType.StringArray:
return getStringArrayWithMinMax(config);
case CoreSchemaType.URL:
return URL;
case CoreSchemaType.Vector3:
return Vector3;
default:
const _exhaustiveCheck: never = coreSchemaType;
return _exhaustiveCheck;
}
const schemaMap = {
[CoreSchemaType.Boolean]: () => Boolean,
[CoreSchemaType.DateTime]: () => DateTime,
[CoreSchemaType.Email]: () => Email,
[CoreSchemaType.FileUpload]: () => FileUpload,
[CoreSchemaType.MemberId]: () => MemberId,
[CoreSchemaType.Null]: () => Null,
[CoreSchemaType.Number]: () => Number,
[CoreSchemaType.NumericArray]: (config) => getNumericArrayWithMinMax(config),
[CoreSchemaType.RichText]: () => RichText,
[CoreSchemaType.String]: () => String,
[CoreSchemaType.StringArray]: (config) => getStringArrayWithMinMax(config),
[CoreSchemaType.URL]: () => URL,
[CoreSchemaType.Vector3]: () => Vector3,
} as const satisfies Record<CoreSchemaType, (config?: unknown) => (typeof Schemas)[CoreSchemaType]>;

export function getJsonSchemaByCoreSchemaType<C extends CoreSchemaType>(
coreSchemaType: C,
config?: unknown
) {
return schemaMap[coreSchemaType](config) as ReturnType<(typeof schemaMap)[C]>;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just provides better type info

Comment on lines +36 to +84
export function ShowMore({
children,
className,
variant,
showMoreText = "Show more",
showLessText = "Show less",
animate = true,
...props
}: ShowMoreProps) {
const [expanded, setExpanded] = useState(false);

const toggleExpanded = () => setExpanded((prev) => !prev);

return (
<div className={cn("w-fit", className)} {...props}>
<div
className={cn(
showMoreVariants({ variant }),
expanded && "max-h-full",
animate && "transition-[max-height] duration-300 ease-in-out",
!expanded &&
"after:absolute after:bottom-0 after:left-0 after:h-16 after:w-full after:bg-gradient-to-t after:from-background after:to-transparent"
)}
>
{children}
</div>
<div className="mt-2 flex justify-center">
<Button
variant="ghost"
size="sm"
onClick={toggleExpanded}
className="flex items-center gap-1"
>
{expanded ? (
<>
{showLessText}
<ChevronUp size={16} />
</>
) : (
<>
{showMoreText}
<ChevronDown size={16} />
</>
)}
</Button>
</div>
</div>
);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: ideally this does not show "show more" if the thing is smaller than the max size

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this should maybe be a proper <details><summary> element for accessibilty

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is this article, which is one of the more complex ones we have JOTE
ideally we would have a nice large body that tests every feature of the editor

https://journal.trialanderror.org/pub/jointregeneration/release/6

@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 22, 2025 12:41 Inactive
}
const value = getter.current?.getCurrentState();

valuesPayload[slug] = value && serializeProseMirrorDoc(value.doc);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also just convert it to HTML here, let me know your opinion on that! that way, the server action does not need to be able to handle prosemirror at all, and we can maybe simplify the validation code somewhat

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for converting it on the client! That should take some responsibility/load off the server.

Comment on lines +7 to 30
import { htmlToProsemirrorServer, prosemirrorToHTMLServer } from "../editor/serialize-server";

const validateAgainstContextEditorSchema = (value: unknown) => {
try {
if (typeof value === "string") {
const node = htmlToProsemirrorServer(value);

node.check();
// return renderNodeToHTML(node);
return { success: true, value };
}

const node = baseSchema.nodeFromJSON(value);

// TODO: reenable this
node.check();
return true;
} catch {
return false;

const html = prosemirrorToHTMLServer(node);

return { success: true, value: html };
} catch (e) {
return { success: false, error: e };
}
};

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is somewhat of a big change, as it turns the validation into parsing + coercion. i don't think this is incorrect per se, but it's quite a large change.

this is not necessary per se anymore: we could, on form submission, make the client convert the prosemirror tree to html instead. let me know your opinion here!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy either way. I definitely like the idea of the client doing the ProseMirror->HTML conversion if that would take load off of the server. But I'm okay with this too.

Comment on lines +2 to +22
CREATE OR REPLACE FUNCTION strip_html_tags(text_with_tags text)
RETURNS text AS $$
BEGIN
RETURN regexp_replace(
regexp_replace(
regexp_replace(
text_with_tags,
'<[^>]+>', -- removes HTML tags
' ',
'gi'
),
'&[^;]+;', -- removes HTML entities
' ',
'gi'
),
'\s+', -- collapse multiple spaces
' ',
'g'
);
END;
$$ LANGUAGE plpgsql IMMUTABLE;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immutable means its a pure function. postgres allegedly does some stuff with that info

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i can legally include this, bc the license is CC BY. might need to include a header that does the attribution

Comment on lines +98 to +110
// this is slightly evil and should not be taken as an example of good api design
// it basically allows you to retrieve the value of the editor from the parent component
// whenever you want, rather than having to do it through `onChange` (which would also cause a re-render)
// this makes (some) sense in this case bc the editor is almost fully uncontrolled:
// you cannot pass in an `EditorState` that holds the actual value
useImperativeHandle(
props.getterRef,
() => ({
getCurrentState: () => editorState,
}),
[editorState]
);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nifty, but a bit hacky. as i mention in the body, we might want to make the context editor more of a controlled component in the future in order to sync transluded pubfields, eg if i change abstract:title in the form i might expect it to change automatically here. we should first figure out transclusion tho

Comment on lines +23 to +31
let toBeProcessedContent =
node instanceof Node ? node.content : Node.fromJSON(baseSchema, node).content;

const fragment = DOMSerializer.fromSchema(baseSchema).serializeFragment(toBeProcessedContent, {
document: doc,
});

const container = doc.createElement("div");
container.appendChild(fragment);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no sure if there's a more elegant way to do this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From about ten minutes of searching, no, it doesn't seem like there's a much better way.

Comment on lines +36 to +84
export function ShowMore({
children,
className,
variant,
showMoreText = "Show more",
showLessText = "Show less",
animate = true,
...props
}: ShowMoreProps) {
const [expanded, setExpanded] = useState(false);

const toggleExpanded = () => setExpanded((prev) => !prev);

return (
<div className={cn("w-fit", className)} {...props}>
<div
className={cn(
showMoreVariants({ variant }),
expanded && "max-h-full",
animate && "transition-[max-height] duration-300 ease-in-out",
!expanded &&
"after:absolute after:bottom-0 after:left-0 after:h-16 after:w-full after:bg-gradient-to-t after:from-background after:to-transparent"
)}
>
{children}
</div>
<div className="mt-2 flex justify-center">
<Button
variant="ghost"
size="sm"
onClick={toggleExpanded}
className="flex items-center gap-1"
>
{expanded ? (
<>
{showLessText}
<ChevronUp size={16} />
</>
) : (
<>
{showMoreText}
<ChevronDown size={16} />
</>
)}
</Button>
</div>
</div>
);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this should maybe be a proper <details><summary> element for accessibilty

@tefkah tefkah marked this pull request as ready for review May 22, 2025 13:29
@tefkah tefkah requested review from 3mcd, allisonking and kalilsn and removed request for 3mcd May 22, 2025 13:29
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 08:22 Inactive
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 10:05 Inactive
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 10:13 Inactive
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 10:18 Inactive
Minio's latest client (https://github.com/minio/minio/releases/tag/RELEASE.2025-05-24T17-08-30Z) had a breaking change which made the old `mc config add` no longer work
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 10:54 Inactive
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 13:31 Inactive
@3mcd 3mcd temporarily deployed to gh-654103159-pr-1239 May 27, 2025 14:55 Inactive

@3mcd 3mcd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work man. No big requests, just a question or two!

* Refs which are able to retrieve the current state of the context editor
* on demand, rather than having the form state manage the value (which is slower)
*/
contextEditorGetters: ContextEditorGetters;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments!

Comment thread core/app/components/forms/elements/ContextEditorElement.tsx Outdated
}
const value = getter.current?.getCurrentState();

valuesPayload[slug] = value && serializeProseMirrorDoc(value.doc);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for converting it on the client! That should take some responsibility/load off the server.

const isNew = !v.valueId;
const isChanged = v.valueId && !isEqualWith(arrayDefaults[v.valueId], v);
return isNew || isChanged;
});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you break this up into multiple statements instead of cramming it into a single, harder to read ternary!

opts?: {
toJson?: boolean;
}
): T => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta say, I'm a fan of using type aliases even for function with small signatures. While this isn't difficult to read, it does take extra time to scan/parse for me personally. I'd much rather both the generic and opts signatures be lifted into a type alias above. Feel free to ignore -- just my personal preference!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc it's you i changed it <3

but i agree in the case of the generic here: that was kind of hard to read.
for one-of types like the opts here i personally find just inlining it easier to read, but i don't feel that strongly about it!

Comment thread core/lib/server/pub.ts
}

return q.execute();
return autoCache(dbQuery).execute();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +7 to 30
import { htmlToProsemirrorServer, prosemirrorToHTMLServer } from "../editor/serialize-server";

const validateAgainstContextEditorSchema = (value: unknown) => {
try {
if (typeof value === "string") {
const node = htmlToProsemirrorServer(value);

node.check();
// return renderNodeToHTML(node);
return { success: true, value };
}

const node = baseSchema.nodeFromJSON(value);

// TODO: reenable this
node.check();
return true;
} catch {
return false;

const html = prosemirrorToHTMLServer(node);

return { success: true, value: html };
} catch (e) {
return { success: false, error: e };
}
};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy either way. I definitely like the idea of the client doing the ProseMirror->HTML conversion if that would take load off of the server. But I'm okay with this too.

Comment on lines +23 to +31
let toBeProcessedContent =
node instanceof Node ? node.content : Node.fromJSON(baseSchema, node).content;

const fragment = DOMSerializer.fromSchema(baseSchema).serializeFragment(toBeProcessedContent, {
document: doc,
});

const container = doc.createElement("div");
container.appendChild(fragment);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From about ten minutes of searching, no, it doesn't seem like there's a much better way.

@3mcd 3mcd had a problem deploying to gh-654103159-pr-1239 May 29, 2025 18:18 Error
@tefkah tefkah merged commit b5f6f5d into main May 29, 2025
17 checks passed
@tefkah tefkah deleted the tfk/html-richtext branch May 29, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Auto-deploys a preview application

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants