-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slugify and truncate the default collection name #106
Open
Gallaecio
wants to merge
3
commits into
zytedata:main
Choose a base branch
from
Gallaecio:slufigy-collection-ids
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one may be problematic for another reason - all spiders with unicode names of the same length would silently get the same collection name, and reuse the fingerprint DB unintentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not care about human readability of the collection names, we could replace characters with their UTF-8 hexadecimal (e.g. å → C3A5) or some other Unicode character ID.
Otherwise, I think we are back to slugify or at least Unidecode, with the caveat that the results may change over time since we have no control over those libraries. We could pin them in
zyte-spider-templates-project
, but eventually some user may find this problematic.We could also use the spider numeric ID, since collections are project-specific. We can extract it easily from the middle of the job ID. To keep backward-compatibility with 0.11, we could do this only on spiders with spider names that are invalid collection names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder what are the restrictions on the spider names. The solution could also be to bring the formats closer together (e.g. add more validation for virtual spider names).
Hm, using text-unidecode could be pretty stable. Unlike unidecode or python-slugify, it didn't have a release in many-many years :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice! And its maintainer feels familiar :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about just disabling the default name creation and require an explicit collection name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think it is worth considering. While it is a slightly worse initial user experience, it is the most reliable approach, no need for us to worry about any long-term issues with the default name generation.
We could make it a backward-compatible change by making it required only in the UI with some JSON schema customization, and logging a deprecation error (because warnings are not visible in the UI easily) to encourage users to set it on spiders created with 0.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure about requiring a collection name - it's reliable, but it's an additional step. I wonder how often would user face issues in practice with the current approach (unidecode), and if it worths an additional step.
What do you think about logging a warning if the collection name is not set explicitly, and the resulting name is mangled? With a suggestion to set it explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should help, yes. Though with the current UI, I wonder how many people will notice the warning. On the other hand, I expect much fewer people to actually run into an issue.