Skip to content

Conversation

spicychickensauce
Copy link
Contributor

Hi there 👋

I have a particular problem which I'm trying to solve. I need to construct a chain of nth-child() css selector which will uniquely select an element which I got via other means. (Context is I'm working on a test framework which randomly interacts with a page).
I couldn't find a way to do that using the current API. The only potential way to do it would be to traverse the whole html tree while collecting the path along the way until I randomly encounter the desired element.

Instead, I went ahead and implemented parent_nodes/1 and equals?/2, which are enough to implement what I needed (see get_css_path in the test).
I think they would be good additions to the API of LazyHtml.

I also added the parent_node!/1 helper, but I'm not so sure if that should be part of the API.

Also, there seems to be no proper way to filter out text nodes and comment nodes.
The only way I found was LazyHTML.tag(n) == [], which feels a bit hacky. Maybe child_nodes/1 should accept a type filter? Or there could be a type/1 function?


I know I should have opened an issue first to discuss if you're even interested in this, but it was too much fun writing some c++ for a change, I couldn't resist 😄

Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @spicychickensauce, thanks for the PR, I dropped a few comments regarding the API :)

lib/lazy_html.ex Outdated
"""
@spec parent_nodes(t()) :: t()
def parent_nodes(lazy_html) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention we follow is that the names are from perspective of a single node, so this should be parent_node. If the given %LazyHTML{} holds multiple nodes, it's just a batched version. We should not deduplicate nodes for the same reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do 👍 (plus remove the singular helpers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not deduplicate nodes for the same reason.

Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?
If so, I would strongly disagree, this seems pretty useless.
Also, if we remove equals? then a use has no was to remove those duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?

Correct.

In your use case it seems you target a specific element, so parent would always return either 1 or 0 elements.

The reason is API consistency, %LazyHTML{} holds a flat list of nodes and it is effectively a batch, so conceptually each operation applies to a single element, but is batched if there are multiple elements in the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mental modal for LazyHTML is a document + a set of selected nodes. Or alternatively, a document plus the result of a CSS selector.

I would argue that a %LazyHTML{} that holds duplicate nodes should be an invalid state.
There is no css selector that returns multiple times the same node.
E.g. for this html:

<div>
  <span>1</span>
  <span>2</span>
</div>

I think that query(html, "div") and query(html, "span") |> parent_node() should be equal.

Also, what about getting siblings?
Without deduplication, query(html, "span") |> parent_node() |> child_nodes() will return 4 elements.
And it gets worse if I want grand-siblings etc.

I don't think this would be inconsistent with the API, other things that operate in a batch do return a list (apart from child_nodes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fragment = LazyHTML.from_fragment(~S"""
<div>
  <div>1</div>
  <div>2</div>
</div>
""")

fragment |> LazyHTML.query("div") |> LazyHTML.query("div")

Currently this returns %LazyHTML{} that includes "1" and "2" twice. So the current interpretation is not a set. It's a fair argument to say this behaviour is weird, on the other hand it's a contrived example.

@josevalim do you have an opinion here, should we always return a set of nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with treating it as a set, I assume such can be done cheaply?

Copy link
Member

@jonatanklosko jonatanklosko Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever building new list (query, child_nodes), we will need an extra unordered_set to keep track of which element we already included in the new list. It only stores pointers, so it seems fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh. Yeah, that is definitely surprising to me. Especially since it already filters out empty results..

lib/lazy_html.ex Outdated
Comment on lines 374 to 382
The root node is always <html>, even if initialized via `from_fragment/1`:
iex> lazy_html = LazyHTML.from_fragment(~S|<div>root</div>|)
iex> LazyHTML.parent_nodes(lazy_html)
#LazyHTML<
1 node (from selector)
#1
<html><div>root</div></html>
>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that should be the case, for the end user we should make it such that the fragment root has no parent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was more accidental given my c++ implementation.

I'll check if I find out how to differentiate from_fragment vs from_document in c++.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, and I now think the current behavior is correct and preferable.

The reason is that I could no longer write the get_css_path function without knowing if the node I'm passing in is part of a document or a fragment.
With the current behavior I can treat them both the same. The reason is that the css selector for fragments operates as if the fragment was inside an root node. Which makes sense, as the root of a css selector has to be a single node.

If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatanklosko do you agree? If so I think this PR is ready and I'll remove the get_css_path test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's correct if LazyHTML.from_fragment("<div></div>") |> LazyHTML.parent_node() returns more content.

If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.

Typically it's something for the API user to track, since they are the one calling either from_fragment or from_document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. I think I can work around it by checking if the last parent is an html tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatanklosko I have changed the implementation in my latest commit. You were right, this is better.

I couldn't find a way to identify if a document is a fragment or not in lexbor, so I tracked in manually at creation time. I think this is correct now, see the new tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatanklosko Sorry for the ping. If just haven't gotten around to it yet, no worries, take your time.

I believe this was the last open issue, so I'm waiting on your approval here. After that I'll remove the "nth-child selector" test and then I think this PR is ready for a final review.

@jonatanklosko
Copy link
Member

Also, there seems to be no proper way to filter out text nodes and comment nodes.
The only way I found was LazyHTML.tag(n) == [], which feels a bit hacky. Maybe child_nodes/1 should accept a type filter? Or there could be a type/1 function?

You can do LazyHTML.filter(lazy_html, "*"), which filters the node list to only include elements matching the selector and this selector matches all of the elements. Rather than a type argument, we would have a separate child_elements function, but so far we are holding off with inflating the API in this way.

@spicychickensauce
Copy link
Contributor Author

Thanks @jonatanklosko and @josevalim for your review and feedback, greatly appreciated!
I'll address the points that need no further discussion and have left comments on the other points.

@spicychickensauce spicychickensauce changed the title Implement parent_nodes + equals? Implement parent_nodes + nth_child Sep 26, 2025
Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spicychickensauce sorry for the delay, I dropped a few small comments and we can ship it :)

Copy link
Member

@jonatanklosko jonatanklosko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

@jonatanklosko jonatanklosko merged commit ab877f2 into dashbitco:main Oct 3, 2025
6 checks passed
@spicychickensauce
Copy link
Contributor Author

Nice, thanks you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants