feat: store richtext data as HTML rather than a Prosemirror tree#1239
Conversation
|
There was a problem hiding this comment.
basically this allows for much better indexing of html by first stripping all the tags essentially.
this could be improved by creating to sets to index: headings and body, and weighting the headings more heavily in the search algo. i didn't do that for now as that's kinda overkill.
| ], | ||
| toDOM: () => ["math-inline", { class: "math-node" }, 0], | ||
| toDOM: (node: Node) => renderMath(node, "math-inline"), | ||
| } satisfies NodeSpec; |
There was a problem hiding this comment.
im very happy that this works: basically math is now stored as MathML! this would maybe prove to be a problem, but luckily katex stores the raw "input" as an <annotation> node, which we can use to still easily parse the HTML (katex does not provide any mathml-to-katex functionality).
we could also still allow maybe raw katex as the body, that would make importing custom html a bit easier, eg
<math-display>
\int_a^b x^2dx
</math-display>| // this is not nice, i would like to avoid manually calling `document.createElement` | ||
| // as we now need to keep track of setting `global.document` when this is called on the server | ||
| // forturnately, we only really server render the HTML when importing Legacy text content | ||
| // could be easily solved if `toDOM` was passed the document object, but ProseMirror does not seem to intend providing this | ||
| // https://discuss.prosemirror.net/t/getting-a-hold-of-the-document-used-by-prosemirror-from-todom/8392 | ||
| if (!global.document) { | ||
| throw new Error( | ||
| "document not found. Trying to serialize math in a non-browser environment. To do this, set `global.document` to eg a `JSDOM` document before serializing." | ||
| ); | ||
| } |
There was a problem hiding this comment.
this is very annoying, if you know of a better way to have prosemirror convert a raw html string to a node let me know!
|
|
||
| export function getJsonSchemaByCoreSchemaType(coreSchemaType: CoreSchemaType, config?: unknown) { | ||
| switch (coreSchemaType) { | ||
| case CoreSchemaType.Boolean: | ||
| return Boolean; | ||
| case CoreSchemaType.DateTime: | ||
| return DateTime; | ||
| case CoreSchemaType.Email: | ||
| return Email; | ||
| case CoreSchemaType.FileUpload: | ||
| return FileUpload; | ||
| case CoreSchemaType.MemberId: | ||
| return MemberId; | ||
| case CoreSchemaType.Null: | ||
| return Null; | ||
| case CoreSchemaType.Number: | ||
| return Number; | ||
| case CoreSchemaType.NumericArray: | ||
| return getNumericArrayWithMinMax(config); | ||
| case CoreSchemaType.RichText: | ||
| return RichText; | ||
| case CoreSchemaType.String: | ||
| return String; | ||
| case CoreSchemaType.StringArray: | ||
| return getStringArrayWithMinMax(config); | ||
| case CoreSchemaType.URL: | ||
| return URL; | ||
| case CoreSchemaType.Vector3: | ||
| return Vector3; | ||
| default: | ||
| const _exhaustiveCheck: never = coreSchemaType; | ||
| return _exhaustiveCheck; | ||
| } | ||
| const schemaMap = { | ||
| [CoreSchemaType.Boolean]: () => Boolean, | ||
| [CoreSchemaType.DateTime]: () => DateTime, | ||
| [CoreSchemaType.Email]: () => Email, | ||
| [CoreSchemaType.FileUpload]: () => FileUpload, | ||
| [CoreSchemaType.MemberId]: () => MemberId, | ||
| [CoreSchemaType.Null]: () => Null, | ||
| [CoreSchemaType.Number]: () => Number, | ||
| [CoreSchemaType.NumericArray]: (config) => getNumericArrayWithMinMax(config), | ||
| [CoreSchemaType.RichText]: () => RichText, | ||
| [CoreSchemaType.String]: () => String, | ||
| [CoreSchemaType.StringArray]: (config) => getStringArrayWithMinMax(config), | ||
| [CoreSchemaType.URL]: () => URL, | ||
| [CoreSchemaType.Vector3]: () => Vector3, | ||
| } as const satisfies Record<CoreSchemaType, (config?: unknown) => (typeof Schemas)[CoreSchemaType]>; | ||
|
|
||
| export function getJsonSchemaByCoreSchemaType<C extends CoreSchemaType>( | ||
| coreSchemaType: C, | ||
| config?: unknown | ||
| ) { | ||
| return schemaMap[coreSchemaType](config) as ReturnType<(typeof schemaMap)[C]>; | ||
| } |
There was a problem hiding this comment.
this just provides better type info
| export function ShowMore({ | ||
| children, | ||
| className, | ||
| variant, | ||
| showMoreText = "Show more", | ||
| showLessText = "Show less", | ||
| animate = true, | ||
| ...props | ||
| }: ShowMoreProps) { | ||
| const [expanded, setExpanded] = useState(false); | ||
|
|
||
| const toggleExpanded = () => setExpanded((prev) => !prev); | ||
|
|
||
| return ( | ||
| <div className={cn("w-fit", className)} {...props}> | ||
| <div | ||
| className={cn( | ||
| showMoreVariants({ variant }), | ||
| expanded && "max-h-full", | ||
| animate && "transition-[max-height] duration-300 ease-in-out", | ||
| !expanded && | ||
| "after:absolute after:bottom-0 after:left-0 after:h-16 after:w-full after:bg-gradient-to-t after:from-background after:to-transparent" | ||
| )} | ||
| > | ||
| {children} | ||
| </div> | ||
| <div className="mt-2 flex justify-center"> | ||
| <Button | ||
| variant="ghost" | ||
| size="sm" | ||
| onClick={toggleExpanded} | ||
| className="flex items-center gap-1" | ||
| > | ||
| {expanded ? ( | ||
| <> | ||
| {showLessText} | ||
| <ChevronUp size={16} /> | ||
| </> | ||
| ) : ( | ||
| <> | ||
| {showMoreText} | ||
| <ChevronDown size={16} /> | ||
| </> | ||
| )} | ||
| </Button> | ||
| </div> | ||
| </div> | ||
| ); | ||
| } |
There was a problem hiding this comment.
todo: ideally this does not show "show more" if the thing is smaller than the max size
There was a problem hiding this comment.
also this should maybe be a proper <details><summary> element for accessibilty
There was a problem hiding this comment.
this is this article, which is one of the more complex ones we have JOTE
ideally we would have a nice large body that tests every feature of the editor
https://journal.trialanderror.org/pub/jointregeneration/release/6
| } | ||
| const value = getter.current?.getCurrentState(); | ||
|
|
||
| valuesPayload[slug] = value && serializeProseMirrorDoc(value.doc); |
There was a problem hiding this comment.
we could also just convert it to HTML here, let me know your opinion on that! that way, the server action does not need to be able to handle prosemirror at all, and we can maybe simplify the validation code somewhat
There was a problem hiding this comment.
I'm all for converting it on the client! That should take some responsibility/load off the server.
| import { htmlToProsemirrorServer, prosemirrorToHTMLServer } from "../editor/serialize-server"; | ||
|
|
||
| const validateAgainstContextEditorSchema = (value: unknown) => { | ||
| try { | ||
| if (typeof value === "string") { | ||
| const node = htmlToProsemirrorServer(value); | ||
|
|
||
| node.check(); | ||
| // return renderNodeToHTML(node); | ||
| return { success: true, value }; | ||
| } | ||
|
|
||
| const node = baseSchema.nodeFromJSON(value); | ||
|
|
||
| // TODO: reenable this | ||
| node.check(); | ||
| return true; | ||
| } catch { | ||
| return false; | ||
|
|
||
| const html = prosemirrorToHTMLServer(node); | ||
|
|
||
| return { success: true, value: html }; | ||
| } catch (e) { | ||
| return { success: false, error: e }; | ||
| } | ||
| }; |
There was a problem hiding this comment.
this is somewhat of a big change, as it turns the validation into parsing + coercion. i don't think this is incorrect per se, but it's quite a large change.
this is not necessary per se anymore: we could, on form submission, make the client convert the prosemirror tree to html instead. let me know your opinion here!
There was a problem hiding this comment.
I'm happy either way. I definitely like the idea of the client doing the ProseMirror->HTML conversion if that would take load off of the server. But I'm okay with this too.
| CREATE OR REPLACE FUNCTION strip_html_tags(text_with_tags text) | ||
| RETURNS text AS $$ | ||
| BEGIN | ||
| RETURN regexp_replace( | ||
| regexp_replace( | ||
| regexp_replace( | ||
| text_with_tags, | ||
| '<[^>]+>', -- removes HTML tags | ||
| ' ', | ||
| 'gi' | ||
| ), | ||
| '&[^;]+;', -- removes HTML entities | ||
| ' ', | ||
| 'gi' | ||
| ), | ||
| '\s+', -- collapse multiple spaces | ||
| ' ', | ||
| 'g' | ||
| ); | ||
| END; | ||
| $$ LANGUAGE plpgsql IMMUTABLE; |
There was a problem hiding this comment.
immutable means its a pure function. postgres allegedly does some stuff with that info
There was a problem hiding this comment.
i think i can legally include this, bc the license is CC BY. might need to include a header that does the attribution
| // this is slightly evil and should not be taken as an example of good api design | ||
| // it basically allows you to retrieve the value of the editor from the parent component | ||
| // whenever you want, rather than having to do it through `onChange` (which would also cause a re-render) | ||
| // this makes (some) sense in this case bc the editor is almost fully uncontrolled: | ||
| // you cannot pass in an `EditorState` that holds the actual value | ||
| useImperativeHandle( | ||
| props.getterRef, | ||
| () => ({ | ||
| getCurrentState: () => editorState, | ||
| }), | ||
| [editorState] | ||
| ); | ||
|
|
There was a problem hiding this comment.
this is nifty, but a bit hacky. as i mention in the body, we might want to make the context editor more of a controlled component in the future in order to sync transluded pubfields, eg if i change abstract:title in the form i might expect it to change automatically here. we should first figure out transclusion tho
| let toBeProcessedContent = | ||
| node instanceof Node ? node.content : Node.fromJSON(baseSchema, node).content; | ||
|
|
||
| const fragment = DOMSerializer.fromSchema(baseSchema).serializeFragment(toBeProcessedContent, { | ||
| document: doc, | ||
| }); | ||
|
|
||
| const container = doc.createElement("div"); | ||
| container.appendChild(fragment); |
There was a problem hiding this comment.
no sure if there's a more elegant way to do this
There was a problem hiding this comment.
From about ten minutes of searching, no, it doesn't seem like there's a much better way.
| export function ShowMore({ | ||
| children, | ||
| className, | ||
| variant, | ||
| showMoreText = "Show more", | ||
| showLessText = "Show less", | ||
| animate = true, | ||
| ...props | ||
| }: ShowMoreProps) { | ||
| const [expanded, setExpanded] = useState(false); | ||
|
|
||
| const toggleExpanded = () => setExpanded((prev) => !prev); | ||
|
|
||
| return ( | ||
| <div className={cn("w-fit", className)} {...props}> | ||
| <div | ||
| className={cn( | ||
| showMoreVariants({ variant }), | ||
| expanded && "max-h-full", | ||
| animate && "transition-[max-height] duration-300 ease-in-out", | ||
| !expanded && | ||
| "after:absolute after:bottom-0 after:left-0 after:h-16 after:w-full after:bg-gradient-to-t after:from-background after:to-transparent" | ||
| )} | ||
| > | ||
| {children} | ||
| </div> | ||
| <div className="mt-2 flex justify-center"> | ||
| <Button | ||
| variant="ghost" | ||
| size="sm" | ||
| onClick={toggleExpanded} | ||
| className="flex items-center gap-1" | ||
| > | ||
| {expanded ? ( | ||
| <> | ||
| {showLessText} | ||
| <ChevronUp size={16} /> | ||
| </> | ||
| ) : ( | ||
| <> | ||
| {showMoreText} | ||
| <ChevronDown size={16} /> | ||
| </> | ||
| )} | ||
| </Button> | ||
| </div> | ||
| </div> | ||
| ); | ||
| } |
There was a problem hiding this comment.
also this should maybe be a proper <details><summary> element for accessibilty
Minio's latest client (https://github.com/minio/minio/releases/tag/RELEASE.2025-05-24T17-08-30Z) had a breaking change which made the old `mc config add` no longer work
3mcd
left a comment
There was a problem hiding this comment.
Awesome work man. No big requests, just a question or two!
| * Refs which are able to retrieve the current state of the context editor | ||
| * on demand, rather than having the form state manage the value (which is slower) | ||
| */ | ||
| contextEditorGetters: ContextEditorGetters; |
| } | ||
| const value = getter.current?.getCurrentState(); | ||
|
|
||
| valuesPayload[slug] = value && serializeProseMirrorDoc(value.doc); |
There was a problem hiding this comment.
I'm all for converting it on the client! That should take some responsibility/load off the server.
| const isNew = !v.valueId; | ||
| const isChanged = v.valueId && !isEqualWith(arrayDefaults[v.valueId], v); | ||
| return isNew || isChanged; | ||
| }); |
There was a problem hiding this comment.
I like how you break this up into multiple statements instead of cramming it into a single, harder to read ternary!
| opts?: { | ||
| toJson?: boolean; | ||
| } | ||
| ): T => { |
There was a problem hiding this comment.
I gotta say, I'm a fan of using type aliases even for function with small signatures. While this isn't difficult to read, it does take extra time to scan/parse for me personally. I'd much rather both the generic and opts signatures be lifted into a type alias above. Feel free to ignore -- just my personal preference!
There was a problem hiding this comment.
bc it's you i changed it <3
but i agree in the case of the generic here: that was kind of hard to read.
for one-of types like the opts here i personally find just inlining it easier to read, but i don't feel that strongly about it!
| } | ||
|
|
||
| return q.execute(); | ||
| return autoCache(dbQuery).execute(); |
| import { htmlToProsemirrorServer, prosemirrorToHTMLServer } from "../editor/serialize-server"; | ||
|
|
||
| const validateAgainstContextEditorSchema = (value: unknown) => { | ||
| try { | ||
| if (typeof value === "string") { | ||
| const node = htmlToProsemirrorServer(value); | ||
|
|
||
| node.check(); | ||
| // return renderNodeToHTML(node); | ||
| return { success: true, value }; | ||
| } | ||
|
|
||
| const node = baseSchema.nodeFromJSON(value); | ||
|
|
||
| // TODO: reenable this | ||
| node.check(); | ||
| return true; | ||
| } catch { | ||
| return false; | ||
|
|
||
| const html = prosemirrorToHTMLServer(node); | ||
|
|
||
| return { success: true, value: html }; | ||
| } catch (e) { | ||
| return { success: false, error: e }; | ||
| } | ||
| }; |
There was a problem hiding this comment.
I'm happy either way. I definitely like the idea of the client doing the ProseMirror->HTML conversion if that would take load off of the server. But I'm okay with this too.
| let toBeProcessedContent = | ||
| node instanceof Node ? node.content : Node.fromJSON(baseSchema, node).content; | ||
|
|
||
| const fragment = DOMSerializer.fromSchema(baseSchema).serializeFragment(toBeProcessedContent, { | ||
| document: doc, | ||
| }); | ||
|
|
||
| const container = doc.createElement("div"); | ||
| container.appendChild(fragment); |
There was a problem hiding this comment.
From about ten minutes of searching, no, it doesn't seem like there's a much better way.
Issue(s) Resolved
High-level Explanation of PR
Alternatively/in addition, i have a triplet of videos.
summary.mp4
demo.mp4
editor-stuff.mp4
This PR changes how we store
RichText: rather than storing the fullProseMirrortree, we first convert it toHTMLand only store that.Some reasons to do this:
prosemirror-to-Xtools, we only need to use the (native)DOMSerializer, and then are able to create (much easier)html-to-Xtools. This could integrate better with our existingrehype/remarkconverters.RichText. Rather than any clients (like the site builder) needing to ship and implement someprosemirror-to-htmlfunctionality, we can "just" do a quick__dangerouslySetInnerHTMLto get 85% of the way there<>from HTML and index that, much harder to do so for a json blob.Schema
The
RichTextschema is now just a plainStringschema. I wanted to add some regex maybe but decided against it, as we could still handle""as valid HTML.FTS
Previously the prosemirror tree was indexed for full text search extremely naively. I wanted to make the body more searchable, so we now do something like
<.*>, and index the remainderThis could be improved by weighting eg headings and non-headings differently, which is something we should do in the future. That way results will be ranked higher if eg
poniesoccurs in anh1rather than in aspantag.Where does the conversion happen
This was the most difficult part to figure out for me and I've gone back and forth quite a few times on where exactly the stored HTML should be converted to
ProseMirrorand vice versa when the createdProseMirrortree from the editor should be converted back to HTML again to be stored. At first I had this all on the client, then I had it on server and then I had it on the client again.Finally, I decided that the conversion from
ProseMirrorto HTML should happen on the server again, but that this conversion only happens on pages/actions dealing with the ContextEditor (ie thePubEditor). Everywhere else treats aRichTextfield as HTML.Without trying to go on a tangent, the reasoning is roughly
ProseMirrorto HTML on the client can be slow, especially if we were to do that ononChange. so a clean "ContextEditor converts back and forth between HTML and ProseMirror` doesn't really cut itHacks
(see 3rd video for demo/some explanation)
Although previously this was meant as an optimization in the case that the Context Editor was converting between HTML and
ProseMirror, I kind of left it in because it does make the Editor faster.Instead of tracking the state of the Editor in the form by doing
field.onChange, we now instead expose arefwhich retrieves the value of the Context Editor when it is needed: on submission time of the form.This is kind of evil and not at all React-y, but it does make the editor much faster, at least twice as fast for big documents.
I think we should keep it even though it kind of goes against our React-y model, as we can still treat everything inside of the ContextEditor React-y. One reason I think why this is justified is because the Context Editor is not really a controlled component, but with the
onChangehook it kind of makes it seem like it is, ie we don't pass thefield.valueback into theProseMirrorEditor. The Context Editor controls its own state anyway and we just need a way to get that state out, because we can't manipulate it outside of the Context Editor and expect it to stay the same anyway.Now this might be a problem in the future and we might want to revert it when dealing with updates to transcluded values, eg the
ContextAtoms. Maybe we do want to update the value of a transcludedstarter:abstractin the Editor when we change thestarter:abstractvalue in the form field above. But we can burn that bridge when we get to it!Test Plan
Go to legacy
CMD+KPonies, go to the one pub that pubs upObserve the beautiful html content
Create a new pub
Try to cram as many different nodes in the richtext body as you can
Save + reload
hopefully they stay the same!
Screenshots (if applicable)
Notes