-
Notifications
You must be signed in to change notification settings - Fork 390
refactor!: Introduce new storage client system #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
8758ca9
to
6b7b8bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Publishing some comments, not finished with reviewing the whole change yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a loooot of work. Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good. Just please add section to the upgrading_to_v0x.md to summarize all the breaking changes in this.
That's excellent work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot. I do need to revisit the request queue related code though, it feels like we're throwing out the baby with the bathwater.
src/crawlee/storage_clients/_file_system/_key_value_store_client.py
Outdated
Show resolved
Hide resolved
src/crawlee/storage_clients/_file_system/_request_queue_client.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the storage client system by removing legacy implementations and utilities, consolidating configuration, and updating documentation and examples to use the new clients.
- Consolidated storage-related settings in
Configuration
and removed deprecated options. - Replaced legacy file utilities with
infer_mime_type
,atomic_write
, and export-to-stream functions. - Updated service locator to default to
FileSystemStorageClient
and revised examples to use new storage clients.
Reviewed Changes
Copilot reviewed 92 out of 92 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/crawlee/configuration.py | Simplified storage configuration fields to align with new clients. |
src/crawlee/_utils/file.py | Removed old file helpers; added atomic writes, MIME inference, and stream exports. |
src/crawlee/_service_locator.py | Changed default storage client to FileSystemStorageClient . |
docs/deployment/code_examples/google/google_example.py | Updated cloud function example to use MemoryStorageClient . |
docs/guides/request_loaders.mdx | Documentation updated to reflect handled_count and total_count properties. |
Comments suppressed due to low confidence (1)
docs/deployment/code_examples/google/google_example.py:19
- The example uses
timedelta
but does not import it. Addfrom datetime import timedelta
at the top of the file to avoid a NameError.
request_handler_timeout=timedelta(seconds=30),
d7b19ee
to
7f2e6b0
Compare
9bad9db
to
65a1361
Compare
Description
Configuration.persist_storage
andConfiguration.persist_metadata
options were removed.purge_on_start
, ortoken
andbase_api_url
for the Apify client) are configured via theConfiguration
.purge
method (which clears all items but preserves the storage and metadata) and adrop
method (which removes the entire storage, metadata included).Dataset
id
name
metadata
open
purge
(new method)drop
push_data
get_data
iterate_items
list_items
(new method)export_to
from_storage_object
method has been removed - Use theopen
method withname
orid
instead.get_info
->metadata
propertystorage_object
->metadata
propertyset_metadata
method has been removed (it wasn't propage to clients)write_to_json
-> method has been removed, useexport_to
insteadwrite_to_csv
-> method has been removed, useexport_to
insteadKey-value store
id
name
metadata
open
purge
(new method)drop
get_value
set_value
delete_value
(new method, Apify platform's set_value support setting an empty value to a key, so having a separate method for deleting is useful)iterate_keys
list_keys
(new method)get_public_url
get_auto_saved_value
persist_autosaved_values
from_storage_object
method has been removed - Use theopen
method withname
orid
instead.get_info
->metadata
propertystorage_object
->metadata
propertyset_metadata
method has been removed (it wasn't propage to clients)Request queue
id
name
metadata
open
purge
(new method)drop
add_request
add_requests_batched
->add_requests
fetch_next_request
get_request
mark_request_as_handled
reclaim_request
is_empty
is_finished
from_storage_object
method has been removed - Use theopen
method withname
orid
instead.get_info
->metadata
propertystorage_object
->metadata
propertyset_metadata
method has been removed (it wasn't propage to clients)get_handled_count
method had been removed - Usemetadata.handled_request_count
instead.get_total_count
method has been removed - Usemetadata.total_request_count
instead.resource_directory
from theRequestQueueMetadata
was removed, usepath_to...
property instead.RequestQueueHead
model has been removed - UseRequestQueueHeadWithLocks
instead.add_requests
containforefront
arg (Apify API supports it)BaseDatasetClient
metadata
open
purge
drop
push_data
get_data
iterate_items
BaseKeyValueStoreClient
metadata
open
purge
drop
get_value
set_value
delete_value
iterate_keys
get_public_url
BaseRequestQueueClient
metadata
open
purge
drop
add_requests_batch
->add_batch_of_requests
(one backend method for 2 frontend methods)get_request
fetch_next_request
mark_request_as_handled
reclaim_request
is_empty
RequestQueueHeadWithLocks
->RequestQueueHead
BatchRequestsOperationResponse
->AddRequestsResponse
_sequence
field in the FS Request)Issues
MemoryStorageClient
andFilesystemStorageClient
#92creation_management
module #147push_data
annotations to useJsonSerializable
type #1191Testing
file-system
andmemory
), ensuring every storage test runs against every client implementation.Checklist