feat: add utility for load and parse Sitemap and `SitemapRequestLoader` #1169

Mantisus · 2025-04-22T23:42:04Z

Description

Add SitemapRequestLoader for comfortable working with Sitemap and easy integration into the framework
Add utility for working with Sitemap, loads, and stream parsing

Issues

Closes: add Sitemap parser utility #1161

Testing

Add tests for SitemapRequestLoader
Add new endpoints for the unicorn server for sitemaps tests

Copilot

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Copilot

Pull Request Overview

This PR introduces a sitemap utility feature that integrates new routing logic for various sitemap formats and refactors endpoint signatures for consistency.

Updated request routing in tests/unit/server.py to use a dictionary mapping paths to endpoint handler functions.
Refactored endpoint functions to include consistent parameters (scope, _receive, send).
Added a new get_sitemap_endpoint to serve sitemap content and implemented extensive tests in tests/unit/_utils/test_sitemap.py.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/unit/server.py	Refactored endpoint function signatures and routing logic; added new sitemap endpoint.
tests/unit/_utils/test_sitemap.py	Added comprehensive tests covering XML, gzipped, plain text, and invalid sitemap scenarios.

tests/unit/server.py

Copilot

Pull Request Overview

This PR introduces a new utility for loading and parsing sitemaps and adds the SitemapRequestLoader to facilitate integrating sitemap-based requests into the framework. Key changes include:

Refactoring the server routing to support dynamic endpoint functions with a unified signature.
Adding comprehensive tests for sitemap loading, including gzip and plain text variants.
Implementing the SitemapRequestLoader and integrating it with the existing request loader framework.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unit/server.py	Refactored endpoint routing to use a path-to-handler mapping.
tests/unit/request_loaders/test_sitemap_request_loader.py	New tests ensuring proper sitemap request loader functionality.
tests/unit/_utils/test_sitemap.py	Extensive tests for sitemap parsing and various sitemap formats.
src/crawlee/request_loaders/_sitemap_request_loader.py	New implementation of SitemapRequestLoader with background sitemap loading.
src/crawlee/request_loaders/init.py	Updated all to export SitemapRequestLoader.
src/crawlee/_utils/robots.py	Extended RobotsTxtFile to support sitemap parsing and URL extraction.

Comments suppressed due to low confidence (2)

tests/unit/server.py:120

Switching from prefix-based matching to extracting a specific part from the URL may affect routing behavior; please verify that this logic meets all desired routing cases (e.g. deeper nested paths).

path_parts = URL(scope['path']).parts

src/crawlee/_utils/robots.py:89

The docstring for 'parse_sitemaps' indicates it returns a list of Sitemap instances, but the implementation returns a single Sitemap instance; please update the docstring to accurately reflect the return type.

async def parse_sitemaps(self) -> Sitemap:

janbuchar

Seems promising, thanks 🙂

janbuchar · 2025-06-07T19:55:53Z

src/crawlee/_utils/robots.py

+        return await Sitemap.load(sitemaps, proxy_url)
+
+    async def parse_urls_from_sitemaps(self) -> list[str]:
+        """Parse the URLs from the sitemaps in the robots.txt file and return a list of `Sitemap` instances."""


I think it should return URLs found in those sitemaps, right?

Yep. Thanks for catching that.

janbuchar · 2025-06-07T20:06:05Z

src/crawlee/_utils/sitemap.py

+    emit_nested_sitemaps: bool
+    max_depth: int
+    sitemap_retries: int
+    timeout: float | None


We usually do timedelta for timeouts.

janbuchar · 2025-06-07T20:18:53Z

tests/unit/server.py

    """Handle requests for the robots.txt file."""
    await send_html_response(send, ROBOTS_TXT)


+async def get_sitemap_endpoint(scope: dict[str, Any], _receive: Receive, send: Send) -> None:


If the endpoint just echoes whatever you send into it, maybe it doesn't need to be restricted to sitemaps? Wouldn't an all-purpose echo endpoint make more sense?

Great idea. thanks.

But I also saved the paths with sitemap in url as it is important to see that .xml, .xml.gz and .txt sitemaps are processed correctly.

janbuchar · 2025-06-07T20:26:43Z

src/crawlee/_utils/sitemap.py

+            self._handler.items.clear()
+
+        except Exception as e:
+            logger.warning(f'Failed to parse XML data chunk: {e}')


Perhaps we could show the whole stack trace using exc_info?

janbuchar · 2025-06-07T20:59:00Z

src/crawlee/request_loaders/_sitemap_request_loader.py

+
+        # Loading state
+        self._loading_task = asyncio.create_task(self._load_sitemaps())
+        self._loading_finished = False


Can't you use self._loading_task.done() instead of making another property?

janbuchar · 2025-06-07T21:01:44Z

src/crawlee/request_loaders/_sitemap_request_loader.py

+
+
+@docs_group('Classes')
+class SitemapRequestLoader(RequestLoader):


So the class doesn't have any automatically persisted state, right? Can you make a follow up issue to implement that?

janbuchar · 2025-06-07T21:15:08Z

src/crawlee/request_loaders/_sitemap_request_loader.py

+        self,
+        sitemap_urls: list[str],
+        *,
+        proxy_url: str | None = None,


I'd prefer to accept an HttpClient instance here instead of just using httpx. I know that the Python HttpClient doesn't support streaming yet, but that's something we should fix anyway 😁

Yes, I would also prefer for sitemap's utils to use HttpClient.

So it's time to add a stream method 😄

So it's time to add a stream method 😄

It won't get easier later 🙂 If you can do it in a separate PR, please do.

Mantisus added 2 commits April 21, 2025 17:48

init sitemap

6cf67ba

implementation

c96572a

Mantisus requested a review from Copilot April 22, 2025 23:42

Copilot AI reviewed Apr 22, 2025

View reviewed changes

Mantisus self-assigned this Apr 22, 2025

Mantisus added 5 commits April 24, 2025 21:26

update

e7063a5

optimization uvicorn paths

fcbca23

Merge branch 'master' into sitemap

1c284ac

add tests

f0b089c

Merge branch 'master' into sitemap

43f204d

Mantisus requested a review from Copilot May 30, 2025 12:16

Copilot AI reviewed May 30, 2025

View reviewed changes

tests/unit/server.py Outdated Show resolved Hide resolved

Mantisus added 4 commits May 30, 2025 12:40

integrate sitemap to robots.txt

c2dbb73

Merge branch 'master' into sitemap

8eb1eaa

add implementation SitemapRequestLoader

3279aa6

add tests

b1910f1

Mantisus changed the title ~~feat: add Sitemap Utility~~ feat: add utility for load and parse Sitemap and SitemapRequestLoader Jun 3, 2025

Mantisus requested a review from Copilot June 3, 2025 18:22

Copilot AI reviewed Jun 3, 2025

View reviewed changes

Mantisus added 2 commits June 3, 2025 18:30

update docs

65e4a38

fix uvicorn path

d432941

Mantisus marked this pull request as ready for review June 3, 2025 18:44

Mantisus requested review from janbuchar and vdusek June 3, 2025 18:44

janbuchar reviewed Jun 7, 2025

View reviewed changes

Mantisus added 4 commits June 9, 2025 10:00

Merge branch 'master' into sitemap

df554fd

unification echo_content

4d61c12

update endpoints

04cd366

clear extra property in SitemapRequestLoader

a446bb1

Mantisus mentioned this pull request Jun 10, 2025

feat: Add stream method for HttpClient #1241

Open



		@docs_group('Classes')
		class SitemapRequestLoader(RequestLoader):

feat: add utility for load and parse Sitemap and SitemapRequestLoader #1169

Are you sure you want to change the base?

feat: add utility for load and parse Sitemap and SitemapRequestLoader #1169

Uh oh!

Conversation

Mantisus commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

feat: add utility for load and parse Sitemap and `SitemapRequestLoader` #1169

feat: add utility for load and parse Sitemap and `SitemapRequestLoader` #1169

Mantisus commented Apr 22, 2025 •

edited

Loading