Implement parent_nodes + nth_child #25

spicychickensauce · 2025-09-23T09:55:14Z

Hi there 👋

I have a particular problem which I'm trying to solve. I need to construct a chain of nth-child() css selector which will uniquely select an element which I got via other means. (Context is I'm working on a test framework which randomly interacts with a page).
I couldn't find a way to do that using the current API. The only potential way to do it would be to traverse the whole html tree while collecting the path along the way until I randomly encounter the desired element.

Instead, I went ahead and implemented parent_nodes/1 and equals?/2, which are enough to implement what I needed (see get_css_path in the test).
I think they would be good additions to the API of LazyHtml.

I also added the parent_node!/1 helper, but I'm not so sure if that should be part of the API.

Also, there seems to be no proper way to filter out text nodes and comment nodes.
The only way I found was LazyHTML.tag(n) == [], which feels a bit hacky. Maybe child_nodes/1 should accept a type filter? Or there could be a type/1 function?

I know I should have opened an issue first to discuss if you're even interested in this, but it was too much fun writing some c++ for a change, I couldn't resist 😄

jonatanklosko

Hey @spicychickensauce, thanks for the PR, I dropped a few comments regarding the API :)

lib/lazy_html.ex

jonatanklosko · 2025-09-24T09:39:16Z

lib/lazy_html.ex

+
+  """
+  @spec parent_nodes(t()) :: t()
+  def parent_nodes(lazy_html) do


The convention we follow is that the names are from perspective of a single node, so this should be parent_node. If the given %LazyHTML{} holds multiple nodes, it's just a batched version. We should not deduplicate nodes for the same reason.

Ok, will do 👍 (plus remove the singular helpers)

We should not deduplicate nodes for the same reason.

Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?
If so, I would strongly disagree, this seems pretty useless.
Also, if we remove equals? then a use has no was to remove those duplicates.

Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?

Correct.

In your use case it seems you target a specific element, so parent would always return either 1 or 0 elements.

The reason is API consistency, %LazyHTML{} holds a flat list of nodes and it is effectively a batch, so conceptually each operation applies to a single element, but is batched if there are multiple elements in the list.

My mental modal for LazyHTML is a document + a set of selected nodes. Or alternatively, a document plus the result of a CSS selector.

I would argue that a %LazyHTML{} that holds duplicate nodes should be an invalid state.
There is no css selector that returns multiple times the same node.
E.g. for this html:

<div> <span>1</span> <span>2</span> </div>

I think that query(html, "div") and query(html, "span") |> parent_node() should be equal.

Also, what about getting siblings?
Without deduplication, query(html, "span") |> parent_node() |> child_nodes() will return 4 elements.
And it gets worse if I want grand-siblings etc.

I don't think this would be inconsistent with the API, other things that operate in a batch do return a list (apart from child_nodes).

fragment = LazyHTML.from_fragment(~S""" <div> <div>1</div> <div>2</div> </div> """) fragment |> LazyHTML.query("div") |> LazyHTML.query("div")

Currently this returns %LazyHTML{} that includes "1" and "2" twice. So the current interpretation is not a set. It's a fair argument to say this behaviour is weird, on the other hand it's a contrived example.

@josevalim do you have an opinion here, should we always return a set of nodes?

I am fine with treating it as a set, I assume such can be done cheaply?

Whenever building new list (query, child_nodes), we will need an extra unordered_set to keep track of which element we already included in the new list. It only stores pointers, so it seems fine to me.

Ohh. Yeah, that is definitely surprising to me. Especially since it already filters out empty results..

lib/lazy_html.ex

jonatanklosko · 2025-09-24T09:45:36Z

lib/lazy_html.ex

+  The root node is always <html>, even if initialized via `from_fragment/1`:
+
+      iex> lazy_html = LazyHTML.from_fragment(~S|<div>root</div>|)
+      iex> LazyHTML.parent_nodes(lazy_html)
+      #LazyHTML<
+        1 node (from selector)
+        #1
+        <html><div>root</div></html>
+      >


I don't think that should be the case, for the end user we should make it such that the fragment root has no parent.

Yeah, this was more accidental given my c++ implementation.

I'll check if I find out how to differentiate from_fragment vs from_document in c++.

I looked into this, and I now think the current behavior is correct and preferable.

The reason is that I could no longer write the get_css_path function without knowing if the node I'm passing in is part of a document or a fragment.
With the current behavior I can treat them both the same. The reason is that the css selector for fragments operates as if the fragment was inside an root node. Which makes sense, as the root of a css selector has to be a single node.

If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.

@jonatanklosko do you agree? If so I think this PR is ready and I'll remove the get_css_path test

I don't think it's correct if LazyHTML.from_fragment("<div></div>") |> LazyHTML.parent_node() returns more content.

If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.

Typically it's something for the API user to track, since they are the one calling either from_fragment or from_document.

Hmm, ok. I think I can work around it by checking if the last parent is an html tag.

@jonatanklosko I have changed the implementation in my latest commit. You were right, this is better.

I couldn't find a way to identify if a document is a fragment or not in lexbor, so I tracked in manually at creation time. I think this is correct now, see the new tests.

@jonatanklosko Sorry for the ping. If just haven't gotten around to it yet, no worries, take your time.

I believe this was the last open issue, so I'm waiting on your approval here. After that I'll remove the "nth-child selector" test and then I think this PR is ready for a final review.

test/lazy_html_test.exs

jonatanklosko · 2025-09-24T09:53:38Z

Also, there seems to be no proper way to filter out text nodes and comment nodes.
The only way I found was LazyHTML.tag(n) == [], which feels a bit hacky. Maybe child_nodes/1 should accept a type filter? Or there could be a type/1 function?

You can do LazyHTML.filter(lazy_html, "*"), which filters the node list to only include elements matching the selector and this selector matches all of the elements. Rather than a type argument, we would have a separate child_elements function, but so far we are holding off with inflating the API in this way.

test/lazy_html_test.exs

spicychickensauce · 2025-09-24T12:36:53Z

Thanks @jonatanklosko and @josevalim for your review and feedback, greatly appreciated!
I'll address the points that need no further discussion and have left comments on the other points.

c_src/lazy_html.cpp

jonatanklosko

@spicychickensauce sorry for the delay, I dropped a few small comments and we can ship it :)

lib/lazy_html.ex

test/lazy_html_test.exs

c_src/lazy_html.cpp

test/lazy_html_test.exs

@jonatanklosko

Apply suggestion from @jonatanklosko Co-authored-by: Jonatan Kłosko <[email protected]>

jonatanklosko

Thank you :)

spicychickensauce · 2025-10-03T14:31:40Z

Nice, thanks you!

spicychickensauce added 4 commits September 23, 2025 11:32

Implement parent_nodes

ad68952

Implement equals?

f52b687

Implement parent_node helper

922e53b

Test construction of css path from node

f6a4964

jonatanklosko reviewed Sep 24, 2025

View reviewed changes

josevalim reviewed Sep 24, 2025

View reviewed changes

test/lazy_html_test.exs Outdated Show resolved Hide resolved

spicychickensauce added 2 commits September 24, 2025 14:44

Use singular parent_node as per library convention

a594dd9

Avoid using numbers as ids

50c0727

jonatanklosko reviewed Sep 24, 2025

View reviewed changes

c_src/lazy_html.cpp Outdated Show resolved Hide resolved

spicychickensauce force-pushed the construct-css-path branch from 5fcffdd to 07e1eec Compare September 25, 2025 09:23

spicychickensauce added 4 commits September 26, 2025 11:02

Implement nth_child

a98b1bc

Simplify get_css_path by using nth_child

83845cd

Use unordered_set instead of set

8669e1d

Remove equals?

7e9d14e

spicychickensauce force-pushed the construct-css-path branch from 07e1eec to 7e9d14e Compare September 26, 2025 09:05

spicychickensauce changed the title ~~Implement parent_nodes + equals?~~ Implement parent_nodes + nth_child Sep 26, 2025

jonatanklosko mentioned this pull request Sep 26, 2025

Always track unique nodes #26

Open

Make parent of fragment root nil instead of html

0594c3f

jonatanklosko reviewed Oct 3, 2025

View reviewed changes

lib/lazy_html.ex Outdated Show resolved Hide resolved

test/lazy_html_test.exs Outdated Show resolved Hide resolved

test/lazy_html_test.exs Outdated Show resolved Hide resolved

c_src/lazy_html.cpp Outdated Show resolved Hide resolved

test/lazy_html_test.exs Show resolved Hide resolved

spicychickensauce and others added 6 commits October 3, 2025 15:01

Improve documentation of nth_child

e2dbf91

Apply suggestion from @jonatanklosko Co-authored-by: Jonatan Kłosko <[email protected]>

Inline boolean expression

a4e7d69

Rename test helper to ancestor_chain

1b73ce7

Remove API guidance test function

323deb7

Don't include self in ancestor_chain

9062d95

Remove unnecessary flat_map

82de179

spicychickensauce force-pushed the construct-css-path branch from 0928fb2 to 82de179 Compare October 3, 2025 13:41

jonatanklosko approved these changes Oct 3, 2025

View reviewed changes

jonatanklosko merged commit ab877f2 into dashbitco:main Oct 3, 2025
6 checks passed

Implement parent_nodes + nth_child #25

Implement parent_nodes + nth_child #25

Uh oh!

Conversation

spicychickensauce commented Sep 23, 2025

Uh oh!

jonatanklosko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonatanklosko Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonatanklosko commented Sep 24, 2025

Uh oh!

Uh oh!

spicychickensauce commented Sep 24, 2025

Uh oh!

Uh oh!

jonatanklosko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonatanklosko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spicychickensauce commented Oct 3, 2025

Uh oh!

Uh oh!

jonatanklosko Sep 24, 2025 •

edited

Loading