-
Notifications
You must be signed in to change notification settings - Fork 390
refactor!: Introduce new storage client system #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
f285707
dd9be6e
89bfa5b
4050c75
26f46e2
c83a36a
c967fe5
7df046f
3555565
946d1e2
9f10b95
0864ff8
af0d129
9998a58
fdee111
79cdfc0
3d2fd73
e818585
65db9ac
2b786f7
2cb04c5
0c8c4ec
ce1eeb1
4c05cee
8c80513
70bc071
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
--- | ||
id: upgrading-to-v1 | ||
title: Upgrading to v1 | ||
--- | ||
|
||
This page summarizes the breaking changes between Crawlee for Python v0.6 and v1.0. | ||
|
||
## Storage clients | ||
|
||
In v1.0, we are introducing a new storage clients system. We have completely reworked their interface, | ||
making it much simpler to write your own storage clients. This allows you to easily store your request queues, | ||
key-value stores, and datasets in various destinations. | ||
|
||
### New storage clients | ||
|
||
Previously, the `MemoryStorageClient` handled both in-memory storage and file system persistence, depending | ||
on configuration. In v1.0, we've split this into two dedicated classes: | ||
|
||
- `MemoryStorageClient` - stores all data in memory only. | ||
- `FileSystemStorageClient` - persists data on the file system, with in-memory caching for improved performance. | ||
|
||
For details about the new interface, see the `BaseStorageClient` documentation. You can also check out | ||
the [Storage clients guide](https://crawlee.dev/python/docs/guides/) for more information on available | ||
storage clients and instructions on writing your own. | ||
|
||
### Memory storage client | ||
|
||
Before: | ||
|
||
```python | ||
from crawlee.configuration import Configuration | ||
from crawlee.storage_clients import MemoryStorageClient | ||
|
||
configuration = Configuration(persist_storage=False) | ||
storage_client = MemoryStorageClient.from_config(configuration) | ||
``` | ||
|
||
Now: | ||
|
||
```python | ||
from crawlee.storage_clients import MemoryStorageClient | ||
|
||
storage_client = MemoryStorageClient() | ||
``` | ||
|
||
### File-system storage client | ||
|
||
Before: | ||
|
||
```python | ||
from crawlee.configuration import Configuration | ||
from crawlee.storage_clients import MemoryStorageClient | ||
|
||
configuration = Configuration(persist_storage=True) | ||
storage_client = MemoryStorageClient.from_config(configuration) | ||
``` | ||
|
||
Now: | ||
|
||
```python | ||
from crawlee.storage_clients import FileSystemStorageClient | ||
|
||
storage_client = FileSystemStorageClient() | ||
``` | ||
|
||
The way you register storage clients remains the same: | ||
|
||
```python | ||
from crawlee import service_locator | ||
from crawlee.crawlers import ParselCrawler | ||
from crawlee.storage_clients import MemoryStorageClient | ||
|
||
storage_client = MemoryStorageClient() | ||
|
||
# Either via the service locator: | ||
service_locator.set_storage_client(storage_client) | ||
|
||
# Or provide it directly to the crawler: | ||
crawler = ParselCrawler(storage_client=storage_client) | ||
``` | ||
|
||
### Breaking changes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be fair to mention that when you call for example |
||
|
||
The `persist_storage` and `persist_metadata` fields have been removed from the `Configuration` class. | ||
Persistence is now determined solely by the storage client class you use. | ||
|
||
### Writing custom storage clients | ||
|
||
The storage client interface has been fully reworked. Collection storage clients have been removed - now there is | ||
one storage client class per storage type (`RequestQueue`, `KeyValueStore`, and `Dataset`). Writing your own storage | ||
clients is now much simpler, allowing you to store your request queues, key-value stores, and datasets in any | ||
destination you choose. | ||
|
||
## Dataset | ||
|
||
- There are two new methods: | ||
- `purge` | ||
- `list_items` | ||
- The `from_storage_object` method has been removed - use the `open` method with `name` or `id` instead. | ||
- The `get_info` and `storage_object` properties have been replaced by the new `metadata` property. | ||
- The `set_metadata` method has been removed. | ||
- The `write_to_json` and `write_to_csv` methods have been removed - use `export_to` instead. | ||
|
||
## Key-value store | ||
|
||
- There are three new methods: | ||
- `purge` | ||
- `delete_value` | ||
- `list_keys` | ||
- The `from_storage_object` method has been removed - use the `open` method with `name` or `id` instead. | ||
- The `get_info` and `storage_object` properties have been replaced by the new `metadata` property. | ||
- The `set_metadata` method has been removed. | ||
|
||
## Request queue | ||
|
||
- There are two new methods: | ||
- `purge` | ||
- `add_requests` (renamed from `add_requests_batched`) | ||
- The `from_storage_object` method has been removed - use the `open` method with `name` or `id` instead. | ||
- The `get_info` and `storage_object` properties have been replaced by the new `metadata` property. | ||
- The `set_metadata` method has been removed. | ||
- `resource_directory` from `RequestQueueMetadata` removed – use `path_to_...` property. | ||
- `RequestQueueHead` model replaced with `RequestQueueHeadWithLocks`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
from __future__ import annotations | ||
|
||
METADATA_FILENAME = '__metadata__.json' | ||
"""The name of the metadata file for storage clients.""" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -158,7 +158,23 @@ class Request(BaseModel): | |
``` | ||
""" | ||
|
||
model_config = ConfigDict(populate_by_name=True) | ||
model_config = ConfigDict(populate_by_name=True, extra='allow') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because of the persistance of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't a subclass of the model be more robust? |
||
|
||
id: str | ||
"""A unique identifier for the request. Note that this is not used for deduplication, and should not be confused | ||
with `unique_key`.""" | ||
|
||
unique_key: Annotated[str, Field(alias='uniqueKey')] | ||
"""A unique key identifying the request. Two requests with the same `unique_key` are considered as pointing | ||
to the same URL. | ||
|
||
If `unique_key` is not provided, then it is automatically generated by normalizing the URL. | ||
For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `unique_key` | ||
of `http://www.example.com/something`. | ||
|
||
Pass an arbitrary non-empty text value to the `unique_key` property to override the default behavior | ||
and specify which URLs shall be considered equal. | ||
""" | ||
|
||
url: Annotated[str, BeforeValidator(validate_http_url), Field()] | ||
"""The URL of the web page to crawl. Must be a valid HTTP or HTTPS URL, and may include query parameters | ||
|
@@ -207,22 +223,6 @@ class Request(BaseModel): | |
handled_at: Annotated[datetime | None, Field(alias='handledAt')] = None | ||
"""Timestamp when the request was handled.""" | ||
|
||
unique_key: Annotated[str, Field(alias='uniqueKey')] | ||
"""A unique key identifying the request. Two requests with the same `unique_key` are considered as pointing | ||
to the same URL. | ||
|
||
If `unique_key` is not provided, then it is automatically generated by normalizing the URL. | ||
For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `unique_key` | ||
of `http://www.example.com/something`. | ||
|
||
Pass an arbitrary non-empty text value to the `unique_key` property | ||
to override the default behavior and specify which URLs shall be considered equal. | ||
""" | ||
|
||
id: str | ||
"""A unique identifier for the request. Note that this is not used for deduplication, and should not be confused | ||
with `unique_key`.""" | ||
|
||
@classmethod | ||
def from_url( | ||
cls, | ||
|
@@ -398,6 +398,11 @@ def forefront(self) -> bool: | |
def forefront(self, new_value: bool) -> None: | ||
self.crawlee_data.forefront = new_value | ||
|
||
@property | ||
def was_already_handled(self) -> bool: | ||
"""Indicates whether the request was handled.""" | ||
return self.handled_at is not None | ||
|
||
|
||
class RequestWithLock(Request): | ||
"""A crawling request with information about locks.""" | ||
|
Uh oh!
There was an error while loading. Please reload this page.