-
Notifications
You must be signed in to change notification settings - Fork 7
Implement parent_nodes + nth_child #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parent_nodes + nth_child #25
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @spicychickensauce, thanks for the PR, I dropped a few comments regarding the API :)
lib/lazy_html.ex
Outdated
""" | ||
@spec parent_nodes(t()) :: t() | ||
def parent_nodes(lazy_html) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention we follow is that the names are from perspective of a single node, so this should be parent_node
. If the given %LazyHTML{}
holds multiple nodes, it's just a batched version. We should not deduplicate nodes for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do 👍 (plus remove the singular helpers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not deduplicate nodes for the same reason.
Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?
If so, I would strongly disagree, this seems pretty useless.
Also, if we remove equals?
then a use has no was to remove those duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, do you mean if I have multiple elements on the same level and I call parent_node(same_level_nodes) I should get back the same parent node n times?
Correct.
In your use case it seems you target a specific element, so parent would always return either 1 or 0 elements.
The reason is API consistency, %LazyHTML{}
holds a flat list of nodes and it is effectively a batch, so conceptually each operation applies to a single element, but is batched if there are multiple elements in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mental modal for LazyHTML is a document + a set of selected nodes. Or alternatively, a document plus the result of a CSS selector.
I would argue that a %LazyHTML{}
that holds duplicate nodes should be an invalid state.
There is no css selector that returns multiple times the same node.
E.g. for this html:
<div>
<span>1</span>
<span>2</span>
</div>
I think that query(html, "div")
and query(html, "span") |> parent_node()
should be equal.
Also, what about getting siblings?
Without deduplication, query(html, "span") |> parent_node() |> child_nodes()
will return 4 elements.
And it gets worse if I want grand-siblings etc.
I don't think this would be inconsistent with the API, other things that operate in a batch do return a list (apart from child_nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fragment = LazyHTML.from_fragment(~S"""
<div>
<div>1</div>
<div>2</div>
</div>
""")
fragment |> LazyHTML.query("div") |> LazyHTML.query("div")
Currently this returns %LazyHTML{}
that includes "1" and "2" twice. So the current interpretation is not a set. It's a fair argument to say this behaviour is weird, on the other hand it's a contrived example.
@josevalim do you have an opinion here, should we always return a set of nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with treating it as a set, I assume such can be done cheaply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever building new list (query
, child_nodes
), we will need an extra unordered_set
to keep track of which element we already included in the new list. It only stores pointers, so it seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh. Yeah, that is definitely surprising to me. Especially since it already filters out empty results..
lib/lazy_html.ex
Outdated
The root node is always <html>, even if initialized via `from_fragment/1`: | ||
iex> lazy_html = LazyHTML.from_fragment(~S|<div>root</div>|) | ||
iex> LazyHTML.parent_nodes(lazy_html) | ||
#LazyHTML< | ||
1 node (from selector) | ||
#1 | ||
<html><div>root</div></html> | ||
> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that should be the case, for the end user we should make it such that the fragment root has no parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was more accidental given my c++ implementation.
I'll check if I find out how to differentiate from_fragment
vs from_document
in c++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, and I now think the current behavior is correct and preferable.
The reason is that I could no longer write the get_css_path
function without knowing if the node
I'm passing in is part of a document or a fragment.
With the current behavior I can treat them both the same. The reason is that the css selector for fragments operates as if the fragment was inside an root node. Which makes sense, as the root of a css selector has to be a single node.
If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonatanklosko do you agree? If so I think this PR is ready and I'll remove the get_css_path
test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's correct if LazyHTML.from_fragment("<div></div>") |> LazyHTML.parent_node()
returns more content.
If you still think I should change it, then we need to add a new function that allows identifying how a LazyHTML was constructed.
Typically it's something for the API user to track, since they are the one calling either from_fragment
or from_document
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. I think I can work around it by checking if the last parent is an html
tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonatanklosko I have changed the implementation in my latest commit. You were right, this is better.
I couldn't find a way to identify if a document is a fragment or not in lexbor, so I tracked in manually at creation time. I think this is correct now, see the new tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonatanklosko Sorry for the ping. If just haven't gotten around to it yet, no worries, take your time.
I believe this was the last open issue, so I'm waiting on your approval here. After that I'll remove the "nth-child selector" test and then I think this PR is ready for a final review.
You can do |
Thanks @jonatanklosko and @josevalim for your review and feedback, greatly appreciated! |
5fcffdd
to
07e1eec
Compare
07e1eec
to
7e9d14e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spicychickensauce sorry for the delay, I dropped a few small comments and we can ship it :)
Apply suggestion from @jonatanklosko Co-authored-by: Jonatan Kłosko <[email protected]>
0928fb2
to
82de179
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
Nice, thanks you! |
Hi there 👋
I have a particular problem which I'm trying to solve. I need to construct a chain of
nth-child()
css selector which will uniquely select an element which I got via other means. (Context is I'm working on a test framework which randomly interacts with a page).I couldn't find a way to do that using the current API. The only potential way to do it would be to traverse the whole html tree while collecting the path along the way until I randomly encounter the desired element.
Instead, I went ahead and implemented
parent_nodes/1
andequals?/2
, which are enough to implement what I needed (seeget_css_path
in the test).I think they would be good additions to the API of LazyHtml.
I also added the
parent_node!/1
helper, but I'm not so sure if that should be part of the API.Also, there seems to be no proper way to filter out text nodes and comment nodes.
The only way I found was
LazyHTML.tag(n) == []
, which feels a bit hacky. Maybechild_nodes/1
should accept a type filter? Or there could be atype/1
function?I know I should have opened an issue first to discuss if you're even interested in this, but it was too much fun writing some c++ for a change, I couldn't resist 😄