Skip to content

feat: add utility for load and parse Sitemap and SitemapRequestLoader #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Apr 22, 2025

Description

  • Add SitemapRequestLoader for comfortable working with Sitemap and easy integration into the framework
  • Add utility for working with Sitemap, loads, and stream parsing

Issues

Testing

  • Add tests for SitemapRequestLoader
  • Add new endpoints for the unicorn server for sitemaps tests

@Mantisus Mantisus requested a review from Copilot April 22, 2025 23:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

@Mantisus Mantisus self-assigned this Apr 22, 2025
@Mantisus Mantisus requested a review from Copilot May 30, 2025 12:16
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.

  • Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
  • Refactored endpoint functions to include consistent parameters (scope, _receive, send).
  • Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/unit/server.py Refactored endpoint function signatures and routing logic; added new sitemap endpoint.
tests/unit/_utils/test_sitemap.py Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios.

@Mantisus Mantisus changed the title feat: add Sitemap Utility feat: add utility for load and parse Sitemap and SitemapRequestLoader Jun 3, 2025
@Mantisus Mantisus requested a review from Copilot June 3, 2025 18:22
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:

  • Refactoring the server routing to support dynamic endpoint functions with a unified signature.
  • Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
  • Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/server.py Refactored endpoint routing to use a path-to-handler mapping.
tests/unit/request_loaders/test_sitemap_request_loader.py New tests ensuring proper sitemap request loader functionality.
tests/unit/_utils/test_sitemap.py Extensive tests for sitemap parsing and various sitemap formats.
src/crawlee/request_loaders/_sitemap_request_loader.py New implementation of SitemapRequestLoader with background sitemap loading.
src/crawlee/request_loaders/init.py Updated all to export SitemapRequestLoader.
src/crawlee/_utils/robots.py Extended RobotsTxtFile to support sitemap parsing and URL extraction.
Comments suppressed due to low confidence (2)

tests/unit/server.py:120

  • Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).
path_parts = URL(scope['path']).parts

src/crawlee/_utils/robots.py:89

  • The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.
async def parse_sitemaps(self) -> Sitemap:

@Mantisus Mantisus marked this pull request as ready for review June 3, 2025 18:44
@Mantisus Mantisus requested review from janbuchar and vdusek June 3, 2025 18:44
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems promising, thanks 🙂

return await Sitemap.load(sitemaps, proxy_url)

async def parse_urls_from_sitemaps(self) -> list[str]:
"""Parse the URLs from the sitemaps in the robots.txt file and return a list of `Sitemap` instances."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should return URLs found in those sitemaps, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Thanks for catching that.

emit_nested_sitemaps: bool
max_depth: int
sitemap_retries: int
timeout: float | None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually do timedelta for timeouts.

"""Handle requests for the robots.txt file."""
await send_html_response(send, ROBOTS_TXT)


async def get_sitemap_endpoint(scope: dict[str, Any], _receive: Receive, send: Send) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint just echoes whatever you send into it, maybe it doesn't need to be restricted to sitemaps? Wouldn't an all-purpose echo endpoint make more sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. thanks.

But I also saved the paths with sitemap in url as it is important to see that .xml, .xml.gz and .txt sitemaps are processed correctly.

self._handler.items.clear()

except Exception as e:
logger.warning(f'Failed to parse XML data chunk: {e}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could show the whole stack trace using exc_info?


# Loading state
self._loading_task = asyncio.create_task(self._load_sitemaps())
self._loading_finished = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use self._loading_task.done() instead of making another property?



@docs_group('Classes')
class SitemapRequestLoader(RequestLoader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the class doesn't have any automatically persisted state, right? Can you make a follow up issue to implement that?

self,
sitemap_urls: list[str],
*,
proxy_url: str | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to accept an HttpClient instance here instead of just using httpx. I know that the Python HttpClient doesn't support streaming yet, but that's something we should fix anyway 😁

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would also prefer for sitemap's utils to use HttpClient.

So it's time to add a stream method 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's time to add a stream method 😄

It won't get easier later 🙂 If you can do it in a separate PR, please do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add Sitemap parser utility
2 participants