Skip to content

[html] Various performance optimizations #2019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

moffatman
Copy link

Should be twice as fast according to my use cases. Happy to discuss what you need for testing, (micro)benchmarking, preserving public API for old slow functions, or whatever else to merge it. I know I still need to do stuff like bump version, changelog, etc.

This was originally at dart-archive/html#251

That version has proper separation of each commit which may be more helpful, it's been squashed for the repo migration.


  • I’ve reviewed the contributor guide and applied the relevant portions to this PR.

Copy link

github-actions bot commented Feb 15, 2025

PR Health

Breaking changes ⚠️
Package Change Current Version New Version Needed Version Looking good?
html Breaking 0.15.5+1 0.15.6 0.16.0
Got "0.15.6" expected >= "0.16.0" (breaking changes)
⚠️

This check can be disabled by tagging the PR with skip-breaking-check.

Changelog Entry ✔️
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

Coverage ⚠️
File Coverage
pkgs/html/lib/parser.dart 💔 99 % ⬇️ 0 %
pkgs/html/lib/src/constants.dart 💚 93 % ⬆️ 0 %
pkgs/html/lib/src/html_input_stream.dart 💚 80 % ⬆️ 11 %
pkgs/html/lib/src/tokenizer.dart 💚 100 %
pkgs/html/lib/src/treebuilder.dart 💚 95 %
pkgs/html/lib/src/trie.dart 💔 Not covered
pkgs/html/tool/generate_trie.dart 💔 Not covered

This check for test coverage is informational (issues shown here will not fail the PR).

This check can be disabled by tagging the PR with skip-coverage-check.

API leaks ⚠️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
html HtmlTokenizer
Token
HtmlInputStream
TagToken
DoctypeToken
StringToken
TreeBuilder
ActiveFormattingElements
StartTagToken
TagAttribute
CommentToken
CharactersToken
SpaceCharactersToken
EndTagToken

This check can be disabled by tagging the PR with skip-leaking-check.

License Headers ⚠️
// Copyright (c) 2025, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
pkgs/html/lib/parser.dart
pkgs/html/lib/src/constants.dart
pkgs/html/lib/src/html_input_stream.dart
pkgs/html/lib/src/tokenizer.dart
pkgs/html/lib/src/treebuilder.dart
pkgs/html/test/parser_feature_test.dart
pkgs/html/test/trie_test.dart
pkgs/html/tool/generate_trie.dart

All source files should start with a license header.

Unrelated files missing license headers
Files
pkgs/bazel_worker/benchmark/benchmark.dart
pkgs/bazel_worker/example/client.dart
pkgs/bazel_worker/example/worker.dart
pkgs/benchmark_harness/integration_test/perf_benchmark_test.dart
pkgs/boolean_selector/example/example.dart
pkgs/clock/lib/clock.dart
pkgs/clock/lib/src/clock.dart
pkgs/clock/lib/src/default.dart
pkgs/clock/lib/src/stopwatch.dart
pkgs/clock/lib/src/utils.dart
pkgs/clock/test/clock_test.dart
pkgs/clock/test/default_test.dart
pkgs/clock/test/stopwatch_test.dart
pkgs/clock/test/utils.dart
pkgs/coverage/lib/src/coverage_options.dart
pkgs/coverage/test/collect_coverage_config_test.dart
pkgs/coverage/test/config_file_locator_test.dart
pkgs/html/example/main.dart
pkgs/html/lib/dom.dart
pkgs/html/lib/dom_parsing.dart
pkgs/html/lib/html_escape.dart
pkgs/html/lib/src/encoding_parser.dart
pkgs/html/lib/src/list_proxy.dart
pkgs/html/lib/src/query_selector.dart
pkgs/html/lib/src/token.dart
pkgs/html/lib/src/utils.dart
pkgs/html/test/dom_test.dart
pkgs/html/test/parser_test.dart
pkgs/html/test/query_selector_test.dart
pkgs/html/test/selectors/level1_baseline_test.dart
pkgs/html/test/selectors/level1_lib.dart
pkgs/html/test/selectors/selectors.dart
pkgs/html/test/support.dart
pkgs/html/test/tokenizer_test.dart
pkgs/pubspec_parse/test/git_uri_test.dart
pkgs/stack_trace/example/example.dart
pkgs/watcher/test/custom_watcher_factory_test.dart
pkgs/yaml_edit/example/example.dart

This check can be disabled by tagging the PR with skip-license-check.

@mosuem mosuem requested a review from HosseinYousefi April 17, 2025 12:41
Copy link
Member

@HosseinYousefi HosseinYousefi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! A 2x Improvement sounds great. I had some minor comments and you'll need to up the version to 0.15.6 and add a changelog and it's good to go!

@@ -0,0 +1,18 @@
import 'package:dart_style/dart_style.dart';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should live in html/tool/.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'const entitiesTrieRoot = $root;'.replaceAll('{}', '<int, dynamic>{}');
final formatted = DartFormatter().format(source);
// trimRight() because print adds its own newline
print(formatted.trimRight());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of printing to stdout, this could override lib/src/trie.dart

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}
final source =
'const entitiesTrieRoot = $root;'.replaceAll('{}', '<int, dynamic>{}');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a // AUTO GENERATED by 'tool/generate_trie.dart'. DO NOT EDIT! to the beginning of the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (final entity in entities.keys) {
var node = root;
for (final charCode in entity.codeUnits) {
node = (node[charCode] ??= <int, dynamic>{}) as Map<int, dynamic>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good enough approach. Another way would be to create a 2d array of STATE * ALPHABET which would be around 500KBs of memory but faster to traverse. Best is using double array tries but that's more complicated to implement. A potential fun project for later perhaps since a 2x improvement is great already!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never heard of those, interesting

@@ -424,7 +468,60 @@ const List<String> tableInsertModeElements = [
];

// TODO(jmesserly): remove these in favor of the test functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the TODO comment mentions, a possible improvement here is to have a test function that only checks 2 integer bounds instead of using Set.contains. Could be done in a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it doesn't quite contain all letters right? Since there are some symbols between upper and lowercase letters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well then 4 integer bound checks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -77,9 +77,6 @@ On line 4, column 3 of ParseError: Unexpected DOCTYPE. Ignored.
expect(parser.errors.length, 1);
final error = parser.errors[0];
expect(error.errorCode, 'unexpected-doctype');
expect(error.span!.start.line, 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand why we deleted these.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I no longer generate fileInfo unless generateSpans=true, we don't get any error.span any more for this test which has generateSpans=false

HosseinYousefi

This comment was marked as duplicate.

@HosseinYousefi
Copy link
Member

HosseinYousefi commented Apr 17, 2025

Somehow my comments were published twice, so I marked one as duplicated, and now there is only one! So please expand the comments under the first section that is marked as duplicated! I don't know how I can unmark it... Yay! Found an "unhide" option!

'const entitiesTrieRoot = $root;'.replaceAll('{}', '<int, dynamic>{}');
final formatted = DartFormatter().format(source);
final htmlDir = File(Platform.script.path).parent.parent;
File('${htmlDir.path}/lib/src/trie.dart').writeAsStringSync(formatted);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work on windows? You can use join from package:path to make it platform independent:

  File(p.join(htmlDir.path, 'lib', 'src', 'trie.dart').writeAsStringSync(formatted);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems that this file is not formatted using dart format causing the CI to complain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

github-actions bot commented Apr 22, 2025

Package publishing

Package Version Status Publish tag (post-merge)
package:bazel_worker 1.1.3 already published at pub.dev
package:benchmark_harness 2.3.1 already published at pub.dev
package:boolean_selector 2.1.2 already published at pub.dev
package:browser_launcher 1.1.3 already published at pub.dev
package:cli_config 0.2.1-wip WIP (no publish necessary)
package:cli_util 0.4.2 already published at pub.dev
package:clock 1.1.2 already published at pub.dev
package:code_builder 4.10.2-wip WIP (no publish necessary)
package:coverage 1.12.0 already published at pub.dev
package:csslib 1.0.2 already published at pub.dev
package:extension_discovery 2.1.0 already published at pub.dev
package:file 7.0.2-wip WIP (no publish necessary)
package:file_testing 3.1.0-wip WIP (no publish necessary)
package:glob 2.1.3 already published at pub.dev
package:graphs 2.3.3-wip WIP (no publish necessary)
package:html 0.15.6 ready to publish html-v0.15.6
package:io 1.1.0-wip WIP (no publish necessary)
package:json_rpc_2 4.0.0 ready to publish json_rpc_2-v4.0.0
package:markdown 7.3.1-wip WIP (no publish necessary)
package:mime 2.0.0 already published at pub.dev
package:oauth2 2.0.4-wip WIP (no publish necessary)
package:package_config 2.3.0-wip WIP (no publish necessary)
package:pool 1.5.2-wip WIP (no publish necessary)
package:pub_semver 2.2.0 already published at pub.dev
package:pubspec_parse 1.5.0 already published at pub.dev
package:source_map_stack_trace 2.1.3-wip WIP (no publish necessary)
package:source_maps 0.10.14-wip WIP (no publish necessary)
package:source_span 1.10.1 already published at pub.dev
package:sse 4.1.8 ready to publish sse-v4.1.8
package:stack_trace 1.12.1 already published at pub.dev
package:stream_channel 2.1.4 already published at pub.dev
package:stream_transform 2.1.2-wip WIP (no publish necessary)
package:string_scanner 1.4.1 already published at pub.dev
package:term_glyph 1.2.3-wip WIP (no publish necessary)
package:test_reflective_loader 0.2.3 already published at pub.dev
package:timing 1.0.2 already published at pub.dev
package:unified_analytics 8.0.1 already published at pub.dev
package:watcher 1.1.1 already published at pub.dev
package:yaml 3.1.3 already published at pub.dev
package:yaml_edit 2.2.2 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

- Format generate_trie.dart
- Use path join() in generate_trie.dart
- Add charsUntilAsciiLetter()
- Move all codes to Charcode {}
This is a noticeable speedup by scanning and avoiding creating a new string first.
@moffatman
Copy link
Author

I found another 40% speedup via some new commits

String toAsciiLowerCase() =>
String.fromCharCodes(codeUnits.map(_asciiToLower));
String toAsciiLowerCase() {
if (codeUnits.any((c) => c >= Charcode.kUpperA && c <= Charcode.kUpperZ)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These conditions can live as functions: codeUnits.any(isUpperCaseCode)

static const int kGreaterThan = 0x3E;

/// A
static const int kUpperA = 0x41;
Copy link
Member

@HosseinYousefi HosseinYousefi Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We usually don't use kFooBar when identifying constants in Dart, instead we just use fooBar.
https://dart.dev/effective-dart/style#dont-use-prefix-letters

: String.fromCharCode(_chars[_offset]);
final charCode = _chars[_offset];
if (charCode < 256) {
return asciiCharacters[charCode];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does changing this to String.fromCharCode(charCode) affect the performance much?

@@ -107,8 +107,9 @@ class HtmlInputStream {
}
}

final isSurrogatePair =
i + 1 < charsLength && _isLeadSurrogate(c) && _isTrailSurrogate(rawChars[i + 1]);
final isSurrogatePair = _isLeadSurrogate(c) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also order it in the original way but when you think about the fact that most characters aren't surrogate, this makes sense as it saves a check per character! Good find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants