Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple mirror sync with extensive include/exclude pattern matching #639

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

YYYasin19
Copy link
Collaborator

Hi,

this is a draft PR to check out how much effort it would be to

  • allow quetz to mirror packages from multiple channels
  • define more elaborate include/exclude list patterns on mirror channels

We've had the need for these features for a long time and, until now, worked around them with internal tooling.

Reasoning

Mirror Servers are not only used for load distribution but also access management to packages.
There are multiple common cases where it might be important to have more control over the sync feature:

  • Company provides access to a mirrored server that has package versions that are permitted (e.g. because of outstanding security checks or licensing issues)
  • A software company distributes IP (aka code) as packages to customers and needs to share this IP

Changes

The needed changes to allow all these are:

  • A more fine-grained filter on packages (e.g. my_cool_package=1.5.*)
  • Sync should also remove packages when they're not in the includelist anymore
  • Allow a channel to mirror multiple channels

I'm happy to discuss other (maybe even easier) ways of implementing this! 😊

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 85.71% and project coverage change: -0.05 ⚠️

Comparison is base (282f8ab) 83.14% compared to head (77ddf40) 83.09%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #639      +/-   ##
==========================================
- Coverage   83.14%   83.09%   -0.05%     
==========================================
  Files          78       78              
  Lines        6153     6153              
==========================================
- Hits         5116     5113       -3     
- Misses       1037     1040       +3     
Impacted Files Coverage Δ
quetz/frontend.py 37.03% <0.00%> (ø)
quetz/tasks/mirror.py 88.27% <75.00%> (-1.04%) ⬇️
quetz/cli.py 74.68% <80.00%> (ø)
quetz/utils.py 81.86% <81.48%> (ø)
quetz/dao.py 88.54% <100.00%> (ø)
quetz/db_models.py 95.18% <100.00%> (ø)
quetz/main.py 86.46% <100.00%> (ø)
quetz/rest_models.py 99.41% <100.00%> (ø)
quetz/tasks/assertions.py 75.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

quetz/utils.py Outdated
@@ -39,6 +39,47 @@ def check_package_membership(package_name, includelist, excludelist):
return True


def _include_pattern_match(name, version, build, pattern):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like a MatchSpec - I think we should just use that :)

That means we should either depend on conda or vendor the MatchSpec part from conda into quetz. I think there were also discussions on making MatchSpec standalone but not sure if they ever got anywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!
We actually have conda as a dependency already because of conda-build and mamba apparently.

I'm now using MatchSpec and since it also works for just the package name, we might be able to get rid of the additional fields include_pattern_list and just use the include_list again, making the transition even simpler.

@janjagusch janjagusch added the enhancement New feature or request label Jun 29, 2023
@janjagusch janjagusch changed the title multiple mirror sync with extensive include/exclude pattern matching Add multiple mirror sync with extensive include/exclude pattern matching Jun 29, 2023
@YYYasin19
Copy link
Collaborator Author

rebased ✅

Copy link
Collaborator

@AndreasAlbertQC AndreasAlbertQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YYYasin19 thank you for this PR, great stuff! I reviewed ~half of the PR, feel free to already answer whatever makes sense. I hope to be able to pick up reviewing the rest of the PR later today.

docs/source/using/mirroring.rst Show resolved Hide resolved
docs/source/using/mirroring.rst Outdated Show resolved Hide resolved
@@ -20,7 +20,7 @@ check-imports = ["quetz"]
ignore = ["W004"]

[tool.tbump.version]
current = "0.9.2"
current = "0.10.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my intuition here would be that the version bump should not be part of this PR, but determined once the release is made. @janjagusch opinions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just used it for deployment reasons on the dev environment so we can revert this before merging

@@ -1,2 +1,2 @@
version_info = (0, 9, 2, "", "")
version_info = (0, 10, 0, "", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same q as above

quetz/dao.py Outdated Show resolved Hide resolved
@@ -235,38 +261,24 @@ def handle_repodata_package(
total_size += size
file.file.seek(0)

dao.assert_size_limits(channel_name, total_size)
# create package in database
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uncommented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if we create packages already create packages from the channel metadata here (which seems to be some kind of.. shortcut?) than we have problems filtering them out later.

some solutions

  1. call this code if no includelist/excludelist is set -> then we already sync everything
  2. do the filtering based on includelist/excludelist here as well -> meh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented no. 1 for now

quetz/tasks/mirror.py Show resolved Hide resolved
quetz/tasks/mirror.py Show resolved Hide resolved
quetz/utils.py Outdated Show resolved Hide resolved

"""
examples:
- includelist: ["numpy", "pandas"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list mode is just syntactic sugar so users with a single mirror channel don't need to type a dict, right? I feel like the extra complication might not be worth it in this case, and I would prefer to just have a single way of doing this rather than two every so slightly different ones. What do you thikn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also use the list when I mirror multiple channels but still have one list of packages I want to let through or exclude.
Also, it makes this stay backward-compatible 😬

Copy link
Collaborator

@AndreasAlbertQC AndreasAlbertQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the second half -> let's meet up and talk some time

quetz/tasks/mirror.py Outdated Show resolved Hide resolved
is_uptodate = None
for _check in version_checks:
is_uptodate = _check(package_name, metadata)
is_uptodate = _check(repo_package_name, metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why repo_package_name instead of package_name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latter is a bit clearer about that it's something like numpy-1.23.4-py39hefdcf20_0.tar.bz2 and not just numpy

quetz/tasks/mirror.py Outdated Show resolved Hide resolved
quetz/tasks/mirror.py Outdated Show resolved Hide resolved
"""
logger.debug(f"Removing {len(remove_batch)} packages: {remove_batch}")
removal_performed = False
package_specs_remove = set([p[1].split("-")[0] for p in remove_batch])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if the package name contains hyphens? E.g. abseil-cpp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't!
Are you aware of any tooling that might support this and prevent me from writing some weird unreadable regex that no one wants to touch again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've used Dist from conda.
-> We'll need to have a discussion about if we want to include conda as a dependency (which it already is in our dev environment) or vendor

remote_repo = RemoteRepository(new_channel.mirror_channel_url, session)

user_id = auth.assert_user()
auth.assert_user()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doe we not need the user ID anymore here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was needed for create_packages_from_channeldata (will comment on that below)

@YYYasin19
Copy link
Collaborator Author

idea: We could, based on the "register yourself as a mirror on the source channel feature" also implement a trigger that sends a sync action to each "listening" mirror to update itself with the new package, i.e.: mirror1 mirrors the channel channel1. We add a package to channel1 which then sends a synchronize_packages action to all its mirrors.


T = TypeVar('T')
URLType = Annotated[str, Field(pattern="^(http|https)://.+")]
URLType = NewType('URLType', constr(pattern="^(http|https)://.+|None|^(\\[.*\\])+$"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have a URL type and define a pattern? https://docs.pydantic.dev/latest/usage/types/urls/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, my thought was that due to implementation of AnyHttpUrl in Pydantic

AnyHttpUrl = Annotated[Url, UrlConstraints(allowed_schemes=['http', 'https'])]

this won't work as well.
But afaik, they only require Python 3.7 which we do as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants