Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content: Handle multiple math blocks in <p> #1156

Merged
merged 2 commits into from
Feb 14, 2025

Conversation

rajveermalviya
Copy link
Member

Fixes: #1130

@rajveermalviya rajveermalviya added the maintainer review PR ready for review by Zulip maintainers label Dec 13, 2024
@PIG208
Copy link
Member

PIG208 commented Jan 2, 2025

My test message renders like this:

Screenshot

image

I think this is a result of parsing each span as a MathBlockNode, which is probably fine, but visually looks a bit off with all the borders. Making an alternative design should be out of scope for the fix, though.

// it seems like a glitch in the server's Markdown processing,
// so hopefully there just aren't any further such glitches.
if (child case dom.Element(localName: 'br')) continue;
if (child case dom.Text(text: '\n' || '\n\n')) continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this no longer checks if the <br>\n is at the end. Would it be unexpected to have them appear in the middle?

\n\n on the other hand, appear to be somehow separate from the quirk commented here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated the handling of both these cases.

'<span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">b</span></span></span></span></span></p>', [
MathBlockNode(texSource: 'a'),
MathBlockNode(texSource: 'b'),
]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also have a test with multiple math blocks in a quote.

@PIG208
Copy link
Member

PIG208 commented Jan 2, 2025

Thanks for the PR! Apparently I've missed this one as it has been sitting for some time. Overall this works fine for me. Left some comments.

@rajveermalviya rajveermalviya force-pushed the pr-multiple-tex-nodes branch 2 times, most recently from 9d222da to f4dde28 Compare January 6, 2025 17:49
@rajveermalviya
Copy link
Member Author

Thanks for the review @PIG208. Pushed a new revision, PTAL.

@rajveermalviya rajveermalviya requested a review from PIG208 January 6, 2025 17:51
Copy link
Member

@PIG208 PIG208 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Left some small comments.

@@ -522,6 +539,26 @@ class ContentExample {
'<br>\n</p>\n</blockquote>',
[QuotationNode([MathBlockNode(texSource: r'\lambda')])]);

static const mathBlocksMultipleInQuote = ContentExample(
'math blocks, multiple in quote',
"````quote\n```math\na\n\nb```````",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a \n between ``` and ````

"````quote\n```math\na\n\nb```\n````"

// each node is interleaved by '\n\n'. Whitespaces are ignored in HTML
// on web but each node has `display: block`, which renders each node
// on a newline. So do the same here.
if (child case dom.Text(text: '\n\n')) continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how the first sentence is relevant. Regarding "do the same here", continue itself does not seem to address the part about rendering each node with display: block, as that is more relevant to blocks.add(MathBlockNode(texSource: texSource, debugHtmlNode: debugHtmlNode));.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more info about how it tries to replicate web behavior.

@PIG208 PIG208 added the integration review Added by maintainers when PR may be ready for integration label Jan 8, 2025
@PIG208 PIG208 removed their assignment Jan 8, 2025
@PIG208 PIG208 requested a review from gnprice January 8, 2025 01:59
@PIG208
Copy link
Member

PIG208 commented Jan 8, 2025

Looks good to me! Thanks for the update.

@PIG208 PIG208 removed the maintainer review PR ready for review by Zulip maintainers label Jan 8, 2025
Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajveermalviya for taking care of this, and thanks @PIG208 for the previous reviews!

Generally this looks good. Comments below, including one fun edge case in the actual content syntax.

Comment on lines 1685 to 1700
result.addAll(parseMathBlocks(node));
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state this loop maintains is a bit more complex than this — there's result but also imageNodes, and when adding to result we need to flush imageNodes first, like is done at the bottom of the loop body.

For an example message where that makes a difference:
https://chat.zulip.org/#narrow/channel/7-test-here/topic/Greg/near/2035891

https://upload.wikimedia.org/wikipedia/commons/7/78/Verregende_bloem_van_een_Helenium_%27El_Dorado%27._22-07-2023._%28d.j.b%29.jpg
```math
a
```
https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/Zaadpluizen_van_een_Clematis_texensis_%27Princess_Diana%27._18-07-2023_%28actm.%29_02.jpg/1280px-Zaadpluizen_van_een_Clematis_texensis_%27Princess_Diana%27._18-07-2023_%28actm.%29_02.jpg

It should be image, then math block, then image, but this version makes it math block, then a row with both images.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another comparison, see the _isPossibleInlineNode case in the function above this, which has a loop with a very similar structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of which, let's make the same changes in that other method parseImplicitParagraphBlockContentList as in this method parseBlockContentList, so that they remain as similar as possible and differ only in the ways they need to differ for their different jobs.

Probably there should be a refactoring in this code's future so that these two methods have less duplication. But that refactoring will be easier if we minimize the extent they've diverged from each other by then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think it's currently not possible to exercise the case where this logic would matter in parseImplicitParagraphBlockContentList, because it seems to be currently not possible to get a math block inside a list item in Zulip. But CommonMark allows it — see this example — and in principle it's a useful capability which probably should be allowed, so perhaps in the future it will be.)

blocks.add(UnimplementedBlockContentNode(htmlNode: child));
continue;
}
blocks.add(MathBlockNode(texSource: texSource, debugHtmlNode: debugHtmlNode));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debugHtmlNode should be child (or null, depending), rather than element, because child is what this particular MathBlockNode is parsed from.

@@ -1451,6 +1451,61 @@ class _ZulipContentParser {
return tableNode ?? UnimplementedBlockContentNode(htmlNode: tableElement);
}

List<BlockContentNode> parseMathBlocks(dom.Element element) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
List<BlockContentNode> parseMathBlocks(dom.Element element) {
List<BlockContentNode> parseMathBlocks(dom.NodeList nodes) {

After my preceding comment about debugHtmlNode, this function doesn't actually make any use of element except via element.nodes, so this simplifies things a bit.

})());
final debugHtmlNode = kDebugMode ? element : null;

final blocks = <BlockContentNode>[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My performance-paranoia for this parsing code doesn't like the idea of allocating this list, just to immediately empty it into the surrounding list, particularly in the common case where there's only going to be a single math block in the <p> element. (And here "common case" means "everything but the weird edge case".)

But I think avoiding that would require a larger refactor, so let's go ahead with this way. It's probably fine (that is, it's probably not a big performance impact in practice). And at least it only affects messages with math blocks in them, which isn't a super common feature in the first place.

Comment on lines 1488 to 1496
if (child case dom.Text(text: '\n\n')) continue;

if (child is! dom.Element) {
blocks.add(UnimplementedBlockContentNode(htmlNode: child));
continue;
}
if (child.className != 'katex-display') {
blocks.add(UnimplementedBlockContentNode(htmlNode: child));
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My performance-paranoia about this code also doesn't like having all these checks for the first child, where we already know the answers to them; particularly since that'll almost always be the only child.

But for this one, there is a clean solution, so let's do that. Before the loop, operate on just the first child; it'll need an as dom.Element, but none of the rest of these checks. And it'll duplicate the last few lines of the loop body, but that's OK.

Then the loop runs from 1 to < length, so will normally have no iterations at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the suggested optimization for the first child.

@rajveermalviya
Copy link
Member Author

Thanks for the review @gnprice! Pushed an update, PTAL.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! This logic generally all looks good; small comments below.

Would you also split the changes into two commits? I think that would help with thinking through exactly what cases are affected. Specifically:

  • A first commit is nearly NFC, and moves the logic from parseBlockContent up to its callers. I think this can be NFC apart from some unimplemented-content cases. (It wouldn't change which possible messages contain unimplemented content, but might change the details of what the unimplemented nodes look like.)
  • Then the second commit makes the desired substantive change here, supporting multiple blocks in one p element.

if (child case dom.Element(localName: 'span', className: 'katex-display')) {
final texSource = _parseMath(child, block: true);
if (texSource != null) {
result.add(MathBlockNode(texSource: texSource, debugHtmlNode: child));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, #1156 (comment) was maybe ambiguous: this should be child when in debug mode (rather than the parent node as it was in the previous revision), but debugHtmlNode should always be null when not in debug mode.

That way we avoid holding onto these DOM nodes after we're done parsing the content — I think they could be a significant amount of memory.

// The case with the `<br>\n` can happen when at the end of a quote;
// it seems like a glitch in the server's Markdown processing,
// so hopefully there just aren't any further such glitches.
return parseMathBlock(child);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this the only call site to parseMathBlock? That can be removed, then.

(Doing so in the same commit as the other changes is helpful for understanding those other changes, even, because it makes visible in the same diff the old code that the new code corresponds to.)

Comment on lines 1494 to 1490
// on web but each node has `display: block`, which renders each node
// on a newline. Since the emitted MathBlockNode are BlockContentNode,
// we skip these newlines here to replicate the same behavior as on web.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// on web but each node has `display: block`, which renders each node
// on a newline. Since the emitted MathBlockNode are BlockContentNode,
// we skip these newlines here to replicate the same behavior as on web.
// on web but each node has `display: block`, which renders each node
// on a new line. Since the emitted MathBlockNode are BlockContentNode,
// we skip these newlines here to replicate the same behavior as on web.

A "newline" is a particular character (or in some contexts a particular pair of characters); a "new line" is just a line that's new (for any of the meanings that "line" or "new" could have, depending on context). So "newline" is right in the last sentence here but not the one before it.

Comment on lines 1645 to 1643
if (node case
dom.Element(localName: 'p', nodes: [
dom.Element(localName: 'span', className: 'katex-display'), ...])) {
if (currentParagraph.isNotEmpty) consumeParagraph();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: condition should be indented a level deeper than body (so it doesn't look like the start of the body)

Comment on lines 1645 to 1647
if (node case
dom.Element(localName: 'p', nodes: [
dom.Element(localName: 'span', className: 'katex-display'), ...])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check className on the p element, too, like the existing logic did. In general our parser keeps to specific expectations of how className should look, as well as localName, in order for some HTML to be recognized as a pattern it knows about.

So combined with the nit about indentation:

Suggested change
if (node case
dom.Element(localName: 'p', nodes: [
dom.Element(localName: 'span', className: 'katex-display'), ...])) {
if (node case dom.Element(localName: 'p', className: '', nodes: [
dom.Element(localName: 'span', className: 'katex-display'), ...])) {

Comment on lines 1480 to 1482
if (nodes
case [..., dom.Element(localName: 'br'), dom.Text(text: '\n')]
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this fits on one line, right?

@rajveermalviya rajveermalviya force-pushed the pr-multiple-tex-nodes branch 3 times, most recently from 664ab54 to b6036c9 Compare February 13, 2025 07:27
@rajveermalviya
Copy link
Member Author

Thanks for the review @gnprice! Pushed an update, PTAL.

Copy link
Member

@gnprice gnprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! This all looks great; just a couple of small comments below about the commit structure.

Comment on lines 1603 to 1607
} else {
if (currentParagraph.isNotEmpty) consumeParagraph();
if (imageNodes.isNotEmpty) consumeImageNodes();
result.add(UnimplementedBlockContentNode(htmlNode: node));
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this first commit is not quite NFC because the behavior changes in this case. In the old code, I believe this would end up with a ParagraphNode containing an UnimplementedInlineContentNode (and maybe other nodes), instead of with this one UnimplementedBlockContentNode.

That's the non-NFC-ness I was expecting at #1156 (review) . The change is fine; just the commit message should be adjusted: remove the NFC tag, and have the body explain that it's NFC except for this particular behavior change.

(I did work through the logic just now and satisfy myself that there's no other behavior change in this commit. So this splitting of commits was helpful, thanks.)

Comment on lines 1599 to 1602
if (currentParagraph.isNotEmpty) consumeParagraph();
if (imageNodes.isNotEmpty) consumeImageNodes();
result.add(parseMathBlock(child));
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about moving the two "if not empty, consume" lines above this if/else, and the continue below it?

That would deduplicate them between the if and the else cases, showing that those parts don't vary. And it would also reduce the diff in the second commit.

@rajveermalviya
Copy link
Member Author

Thanks for the review @gnprice! Pushed an update, PTAL.

@gnprice
Copy link
Member

gnprice commented Feb 14, 2025

Thanks! Looks good; merging.

Prepares for zulip#1130, this commit is almost NFC with only
difference that, in an error case we previously emitted a
`ParagraphNode` containing an `UnimplementedInlineContentNode`
(along with any adjacent nodes), now we emit a single
`UnimplementedBlockContentNode` instead.
@gnprice gnprice force-pushed the pr-multiple-tex-nodes branch from 1d737e4 to 866faf5 Compare February 14, 2025 23:46
@gnprice gnprice merged commit 866faf5 into zulip:main Feb 14, 2025
@rajveermalviya rajveermalviya deleted the pr-multiple-tex-nodes branch February 17, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration review Added by maintainers when PR may be ready for integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle TeX blocks with blank lines
3 participants