-
Notifications
You must be signed in to change notification settings - Fork 389
feat: add utility for load and parse Sitemap and SitemapRequestLoader
#1169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.
- Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
- Refactored endpoint functions to include consistent parameters (scope, _receive, send).
- Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/unit/server.py | Refactored endpoint function signatures and routing logic; added new sitemap endpoint. |
tests/unit/_utils/test_sitemap.py | Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios. |
SitemapRequestLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:
- Refactoring the server routing to support dynamic endpoint functions with a unified signature.
- Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
- Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/server.py | Refactored endpoint routing to use a path-to-handler mapping. |
tests/unit/request_loaders/test_sitemap_request_loader.py | New tests ensuring proper sitemap request loader functionality. |
tests/unit/_utils/test_sitemap.py | Extensive tests for sitemap parsing and various sitemap formats. |
src/crawlee/request_loaders/_sitemap_request_loader.py | New implementation of SitemapRequestLoader with background sitemap loading. |
src/crawlee/request_loaders/init.py | Updated all to export SitemapRequestLoader. |
src/crawlee/_utils/robots.py | Extended RobotsTxtFile to support sitemap parsing and URL extraction. |
Comments suppressed due to low confidence (2)
tests/unit/server.py:120
- Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).
path_parts = URL(scope['path']).parts
src/crawlee/_utils/robots.py:89
- The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.
async def parse_sitemaps(self) -> Sitemap:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems promising, thanks 🙂
src/crawlee/_utils/robots.py
Outdated
return await Sitemap.load(sitemaps, proxy_url) | ||
|
||
async def parse_urls_from_sitemaps(self) -> list[str]: | ||
"""Parse the URLs from the sitemaps in the robots.txt file and return a list of `Sitemap` instances.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should return URLs found in those sitemaps, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Thanks for catching that.
src/crawlee/_utils/sitemap.py
Outdated
emit_nested_sitemaps: bool | ||
max_depth: int | ||
sitemap_retries: int | ||
timeout: float | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually do timedelta
for timeouts.
tests/unit/server.py
Outdated
"""Handle requests for the robots.txt file.""" | ||
await send_html_response(send, ROBOTS_TXT) | ||
|
||
|
||
async def get_sitemap_endpoint(scope: dict[str, Any], _receive: Receive, send: Send) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the endpoint just echoes whatever you send into it, maybe it doesn't need to be restricted to sitemaps? Wouldn't an all-purpose echo endpoint make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. thanks.
But I also saved the paths with sitemap
in url as it is important to see that .xml
, .xml.gz
and .txt
sitemaps are processed correctly.
src/crawlee/_utils/sitemap.py
Outdated
self._handler.items.clear() | ||
|
||
except Exception as e: | ||
logger.warning(f'Failed to parse XML data chunk: {e}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could show the whole stack trace using exc_info
?
|
||
# Loading state | ||
self._loading_task = asyncio.create_task(self._load_sitemaps()) | ||
self._loading_finished = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use self._loading_task.done()
instead of making another property?
|
||
|
||
@docs_group('Classes') | ||
class SitemapRequestLoader(RequestLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the class doesn't have any automatically persisted state, right? Can you make a follow up issue to implement that?
self, | ||
sitemap_urls: list[str], | ||
*, | ||
proxy_url: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to accept an HttpClient
instance here instead of just using httpx
. I know that the Python HttpClient
doesn't support streaming yet, but that's something we should fix anyway 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would also prefer for sitemap's utils to use HttpClient
.
So it's time to add a stream
method 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's time to add a
stream
method 😄
It won't get easier later 🙂 If you can do it in a separate PR, please do.
Description
SitemapRequestLoader
for comfortable working withSitemap
and easy integration into the frameworkSitemap
, loads, and stream parsingIssues
Sitemap
parser utility #1161Testing
SitemapRequestLoader