Skip to content

Add HTML parsing features #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
67cd354
Apply 8.1 hotfixes from unmerged patch
mallardduck Oct 2, 2022
8c15b01
Initial HTML replacer code
mallardduck Oct 2, 2022
6a8cb4d
remove unused property
mallardduck Oct 2, 2022
b8af7e5
generate new emoji bytes
mallardduck Oct 2, 2022
b8bedbf
clean up code
mallardduck Oct 2, 2022
bb84284
Add test to cover image alt/title attributes
mallardduck Oct 2, 2022
eb121c0
refactor to use XPath to solve filtering text nodes problem
mallardduck Oct 2, 2022
92d9136
Remove try-guy now that it's unused
mallardduck Oct 2, 2022
4538750
refactor to ensure we allow HTML fragments too
mallardduck Oct 4, 2022
cd6e190
refactor tests to split up HTML pages and HTML fragments
mallardduck Oct 4, 2022
60c987f
Use internal tag as means of warning?
mallardduck Oct 4, 2022
f7616c0
Refactor method name to slightly better option
mallardduck Oct 4, 2022
e818b5c
fix code styles
mallardduck Oct 4, 2022
b1f83c7
make styleCI happy
mallardduck Oct 4, 2022
29f7d0a
Refactor to fix missed fragments and expand tests
mallardduck Oct 4, 2022
eeb5f0a
reorder code
mallardduck Oct 4, 2022
fcdd93d
Refactor new tests and add failing tests for current issues.
mallardduck Oct 17, 2022
2d77cdc
fix styles
mallardduck Oct 17, 2022
e0f2540
track the Pest helper file
mallardduck Oct 17, 2022
bc61a6a
fix pest file styles
mallardduck Oct 17, 2022
b3a57be
Add tests that cover the edge case I've been chasing
mallardduck Oct 17, 2022
62fdc48
Refactor how HTML fragments are handled
mallardduck Oct 17, 2022
3dd8fb5
Ensure extra spaces are not added
mallardduck Oct 17, 2022
b2bc8bb
Update tests with fixed results
mallardduck Oct 17, 2022
2e3278d
Manually correct snapshots to desired state
mallardduck Oct 17, 2022
55badec
Skip HTML fragment tests that cause errors
mallardduck Oct 17, 2022
c446d6f
Refactor exception
mallardduck Oct 17, 2022
1b5b3e0
remove dumper from composer file
mallardduck Oct 17, 2022
2ee59bf
Always use static builder method instead of new
mallardduck Oct 17, 2022
891c25f
Improve fragment parsing and enable more tests
mallardduck Oct 17, 2022
2b77947
Correct HTML pages without meta charset tag
mallardduck Oct 17, 2022
d5b6869
refactor UTF8 tag adding and enable test
mallardduck Oct 17, 2022
894b79f
Add test to cover when incorrect content type is corrected
mallardduck Oct 17, 2022
590ac97
Add ext-dom to suggested
mallardduck Oct 17, 2022
4468c8e
adjust styles
mallardduck Oct 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/composer.lock
/vendor/
.phpunit.result.cache
14 changes: 10 additions & 4 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,16 @@
"ext-mbstring": "*"
},
"require-dev": {
"pestphp/pest": "^0.3.0",
"pestphp/pest": "^1.21",
"s9e/regexp-builder": "^1.4",
"spatie/emoji": "^2.3.0",
"spatie/pest-plugin-snapshots": "^1.0"
"spatie/pest-plugin-snapshots": "^1.0",
"wa72/htmlpagedom": "^2.0 || ^3.0"
},
"suggest": {
"spatie/emoji": "*"
"ext-dom": "*",
"spatie/emoji": "*",
"wa72/htmlpagedom": "*"
},
"minimum-stability": "dev",
"prefer-stable": true,
Expand All @@ -38,7 +41,10 @@
}
},
"config": {
"sort-packages": true
"sort-packages": true,
"allow-plugins": {
"pestphp/pest-plugin": true
}
},
"scripts": {
"generate": "php ./generate.php",
Expand Down
9 changes: 9 additions & 0 deletions src/Exceptions/NoTextChildrenException.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<?php

namespace Astrotomic\Twemoji\Exceptions;

use Exception;

class NoTextChildrenException extends Exception
{
}
128 changes: 128 additions & 0 deletions src/HtmlReplacer.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
<?php

namespace Astrotomic\Twemoji;

use Astrotomic\Twemoji\Concerns\Configurable;
use Astrotomic\Twemoji\Exceptions\NoTextChildrenException;
use DOMDocument;
use RuntimeException;
use Wa72\HtmlPageDom\HtmlPageCrawler;

/**
* @internal This class is marked as Internal as it is considered Experimental. Code subject to change until warning removed.
*/
class HtmlReplacer
{
use Configurable;

private const UTF8_META = '<meta http-equiv="content-type" content="text/html; charset=utf-8" />';

private const FRAGMENT_TEMPLATE = <<<'HTML'
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body id="wrapper-template">
%s
</body>
</html>
HTML;

public function __construct()
{
if (! class_exists(HtmlPageCrawler::class)) {
throw new RuntimeException(
sprintf('Cannot use %s method unless `wa72/htmlpagedom` is installed.', __METHOD__)
);
}
}

public function parse(string $html): string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to support full HTML docs and HTML fragments, then this method should:

  1. Immediately determine if the input $html is a full DOM page, then
  2. either use HtmlPage (used here) and work based on the Body, or
  3. use the HtmlPageCrawler to parse the fragment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the body from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.

{
// Parse the HTML page or fragment...
$parsedHtmlRoot = HtmlPageCrawler::create($html);

if ($parsedHtmlRoot->isHtmlDocument()) {
// We will only transform the body...
$parsedHtml = $parsedHtmlRoot->filter('body');
} else {
return $this->parseFragment($html);
}

try {
$this->findAndTwmojifyTextNodes($parsedHtml);
} catch (NoTextChildrenException $e) {
return $html;
}

// Find the page head and check if meta header should be added
$htmlHead = $parsedHtmlRoot->filter('head');
$addHeader = false;
if ($htmlHead->getNode(0)->hasChildNodes()) {
$contentTypeMeta = $htmlHead->children('meta[http-equiv="content-type"][content]');
$metaNode = $contentTypeMeta->getNode(0);
if (
$metaNode === null ||
iterator_to_array($metaNode->attributes)['content']->textContent !== 'text/html; charset=utf-8'
) {
$this->addUtf8MetaTag($htmlHead);
$contentTypeMeta->remove();
}
} else {
$this->addUtf8MetaTag($htmlHead);
}

return $parsedHtmlRoot->saveHTML();
}

public function parseFragment(string $html): string
{
$wrappedFragment = sprintf(static::FRAGMENT_TEMPLATE, $html);

$parsedHtmlRoot = HtmlPageCrawler::create($wrappedFragment);
$parsedHtml = $parsedHtmlRoot->filter('body');

try {
$this->findAndTwmojifyTextNodes($parsedHtml);
} catch (NoTextChildrenException $e) {
return $html;
}

return trim($parsedHtmlRoot->filter('body')->getInnerHtml());
}

/**
* @throws NoTextChildrenException
*/
private function findAndTwmojifyTextNodes(HtmlPageCrawler $htmlContent): HtmlPageCrawler
{
// Use xpath to filter only the "TextNodes" within every "Element"
$textNodes = $htmlContent->filterXPath('.//*[normalize-space(text())]');

// If the filtered DOM fragment doesn't have TextNode children, return the input HTML.
if ($textNodes->count() === 0) {
throw new NoTextChildrenException();
}

$textNodes->each(function (HtmlPageCrawler $node) {
$twemojiContent = (new EmojiText($node->getInnerHtml()))
->base($this->base)
->type($this->type)
->toHtml();
$node->makeEmpty()->setInnerHtml($twemojiContent);

return $node;
});

return $textNodes;
}

private function addUtf8MetaTag($htmlHead): void
{
$doc = new DOMDocument();
$setUtf8Meta = $doc->createDocumentFragment();
$setUtf8Meta->appendXML(static::UTF8_META);
$htmlHead->append($setUtf8Meta);
}
}
3 changes: 2 additions & 1 deletion src/Twemoji.php
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ public function __construct(array $codepoints)

public static function emoji(string $emoji): self
{
$chars = preg_split('//u', $emoji, null, PREG_SPLIT_NO_EMPTY);
$chars = preg_split('//u', $emoji, -1, PREG_SPLIT_NO_EMPTY);

$codepoints = array_map(
fn (string $code): string => dechex(mb_ord($code)),
Expand Down Expand Up @@ -58,6 +58,7 @@ public function url(): string
);
}

#[\ReturnTypeWillChange]
public function jsonSerialize()
{
return $this->url();
Expand Down
2 changes: 1 addition & 1 deletion src/emoji_bytes.regexp

Large diffs are not rendered by default.

10 changes: 10 additions & 0 deletions tests/Pest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?php

use Astrotomic\Twemoji\HtmlReplacer;

function htmlReplacerPngParser(string $html): string
{
$htmlReplacer = (new HtmlReplacer())->png();

return $htmlReplacer->parse($html);
}
78 changes: 78 additions & 0 deletions tests/Unit/HtmlReplacerFragmentTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<?php

use function Spatie\Snapshots\assertMatchesTextSnapshot;

it('can convert a single emoji paragraph', function () {
assertMatchesTextSnapshot(htmlReplacerPngParser('<p>🚀</p>'));
});

it('will not convert an emoji within HTML attributes', function () {
assertMatchesTextSnapshot(htmlReplacerPngParser('<img src="" alt="🎉"/>'));
});

it('will not convert an emoji within SCRIPT tags', function () {
assertMatchesTextSnapshot(htmlReplacerPngParser("<script>document.innerHTML = '🤷‍♂️';</script>"));
});

it('can convert many Emoji in an HTML comment section', function () {
$commentsHtml = <<<'HTML'
<section class="comment-box">
<div class="comment-content">
<h2>Time for a ElePHPant RAVE!</h2>
<p>🐘🐘🐘🐘</p>
<p>🐘🐘🐘</p>
<p>🐘🐘🐘🐘🐘</p>
<p>🐘🐘</p>
</div>
<section class="sub-comments">
<section class="comment-box">
<div class="comment-content">
<h2>Time for a cRUSTation RAVE!</h2>
<p>🦀🦀🦀🦀</p>
<p>🦀🦀</p>
<p>🦀🦀🦀🦀</p>
<p>🦀</p>
</div>
</section>
<section class="comment-box">
<div class="comment-content">
<p>but what if the crabs and elephants rave together?!</p>
</div>
</section>
</section>
</section>
HTML;
assertMatchesTextSnapshot(htmlReplacerPngParser($commentsHtml));
});

it('can convert many Emoji in an HTML article', function () {
$commentsHtml = <<<'HTML'
<article>
<p>Lorem 😂😂 ipsum 🕵️‍♂️dolor sit✍️ amet, consectetur adipiscing😇😇🤙 elit, sed do eiusmod🥰 tempor 😤😤🏳️‍🌈incididunt ut 👏labore 👏et👏 dolore 👏magna👏 aliqua.</p>
<p>Ut enim ad minim 🐵✊🏿veniam,❤️😤😫😩💦💦 quis nostrud 👿🤮exercitation ullamco 🧠👮🏿‍♀️🅱️laboris nisi ut aliquip❗️ ex ea commodo consequat.</p>
<p>💯Duis aute💦😂😂😂 irure dolor 👳🏻‍♂️🗿in reprehenderit 🤖👻👎in voluptate velit esse cillum dolore 🙏🙏eu fugiat🤔 nulla pariatur.</p>
<p>🙅‍♀️🙅‍♀️Excepteur sint occaecat🤷‍♀️🤦‍♀️ cupidatat💅 non💃 proident,👨‍👧 sunt🤗 in culpa😥😰😨 qui officia🤩🤩 deserunt mollit 🧐anim id est laborum.🤔🤔</p>
</article>
HTML;
assertMatchesTextSnapshot(htmlReplacerPngParser($commentsHtml));
});

it('can handle text with an outer P tag', function () {
$textContent = '<p>This is some fancy-💃 Markdown/WYSIWYG text with surrounding &lt;p&gt; tags enabled. 🎉</p>';
assertMatchesTextSnapshot(htmlReplacerPngParser($textContent));
});

it('can handle text with an outer P tag and CODE tag', function () {
$textContent = '<p>This is some fancy-💃 Markdown/WYSIWYG text with surrounding <code>&lt;p&gt;</code> tags enabled. 🎉</p>';
assertMatchesTextSnapshot(htmlReplacerPngParser($textContent));
});

it('can handle text without outer P tag and escaped HTML', function () {
$textContent = 'This is some fancy-💃 Markdown/WYSIWYG text with surrounding &lt;p&gt; tags disabled. 🎉';
assertMatchesTextSnapshot(htmlReplacerPngParser($textContent));
});

it('can handle text without outer P tag but inner HTML', function () {
$textContent = 'This is some fancy-💃 Markdown/WYSIWYG text with surrounding <code><p></code> tags disabled. 🎉';
assertMatchesTextSnapshot(htmlReplacerPngParser($textContent));
})->skip('Fails: Mutates the code content to close the p tag');
Loading