Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ErrorSnapshotter to ErrorTracker #1125

Merged
merged 18 commits into from
Apr 7, 2025
Merged

feat: Add ErrorSnapshotter to ErrorTracker #1125

merged 18 commits into from
Apr 7, 2025

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Mar 31, 2025

Description

Added ErrorSnapshotter that can take page snapshot (screenshot or html) on each first encountered unique error.
Added documentation describing how to use it.

Issues

Testing

Added unit tests.
Example PlaywrightCrawler based actor run with ErrorSnapshotter: https://console.apify.com/actors/C0lWh1UCQvgdArp6R/runs/UNuaiRWBDgxiJau0U#storage

image

@github-actions github-actions bot added this to the 111th sprint - Tooling team milestone Mar 31, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 31, 2025
@Pijukatel Pijukatel requested a review from Copilot March 31, 2025 15:04
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an ErrorSnapshotter to capture page snapshots (HTML and screenshot) on the first encountered error and updates error tracking to support asynchronous error handling with snapshot capture. Key changes include:

  • Introducing the ErrorSnapshotter class and integrating it within ErrorTracker.
  • Updating tests for both Playwright and HTTP crawlers to validate snapshot functionality.
  • Refactoring error tracking calls to be asynchronous across the codebase.

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/server_endpoints.py Adds HTML response constants for server endpoint tests.
tests/unit/server.py Refactors inline HTML responses to use defined constants.
tests/unit/crawlers/_playwright/test_playwright_crawler.py Adds tests for snapshot retrieval and error snapshots in the Playwright crawler.
tests/unit/crawlers/_http/test_http_crawler.py Adds tests to verify snapshot functionality in the HTTP crawler and updates error snapshot test.
tests/unit/_statistics/test_error_tracker.py Updates tests to use async error tracker methods.
src/crawlee/statistics/_statistics.py Introduces a new parameter to control error snapshot saving.
src/crawlee/statistics/_error_tracker.py Refactors the error tracker to support async snapshot capture via ErrorSnapshotter.
src/crawlee/statistics/_error_snapshotter.py Implements the ErrorSnapshotter class to capture and store HTML and JPEG snapshots.
src/crawlee/crawlers/_playwright/_playwright_pre_nav_crawling_context.py Adds a get_snapshot method to capture page content and screenshot.
src/crawlee/crawlers/_playwright/_playwright_crawler.py Modifies context yielding to capture errors for early snapshot collection.
src/crawlee/crawlers/_basic/_context_pipeline.py Updates pipeline middleware signature to support exception propagation.
src/crawlee/crawlers/_basic/_basic_crawler.py Updates error tracker calls to be asynchronous in retry and failure scenarios.
src/crawlee/crawlers/_abstract_http/_http_crawling_context.py Adds a get_snapshot method returning HTML from HTTP responses.
src/crawlee/_types.py Defines the PageSnapshot data class and updates the BasicCrawlingContext interface.
Comments suppressed due to low confidence (2)

tests/unit/crawlers/_http/test_http_crawler.py:712

  • [nitpick] The use of the variable 'key_info' after the for-loop may be unclear. Consider capturing the key from the dictionary explicitly for clarity.
assert key_info.key.endswith('.html')

src/crawlee/statistics/_error_tracker.py:73

  • The variable 'new_error_group_message' is initialized to an empty string and never updated, which could lead to confusion. Consider removing it or updating its value if it is intended for wildcard similarity matching.
new_error_group_message = ''  # In case of wildcard similarity match

@Pijukatel Pijukatel added the enhancement New feature or request. label Apr 1, 2025
@Pijukatel Pijukatel changed the title feat: Add ErrorSnapshotter feat: Add ErrorSnapshotter to ErrorTracker Apr 1, 2025
@Pijukatel Pijukatel marked this pull request as ready for review April 1, 2025 11:05
@Pijukatel Pijukatel requested review from vdusek and janbuchar April 1, 2025 11:05
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek merged commit 9666092 into master Apr 7, 2025
23 checks passed
@vdusek vdusek deleted the error-snapshotter branch April 7, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ErrorTracker
3 participants