Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple mirror sync with extensive include/exclude pattern matching #639

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
cb4f948
start: allow multiple mirrors per package, extensive include/exclude …
YYYasin19 Jun 1, 2023
267b94a
use condas MatchSpec utility
YYYasin19 Jun 9, 2023
557cec0
package spec relies on = while our repo package names use '-'
YYYasin19 Jun 9, 2023
df8a217
validate that package was therefore before delete attempt is issued
YYYasin19 Jun 9, 2023
5815902
update: include condas matchspec
YYYasin19 Jun 15, 2023
de27969
refactor: extract repodata retrieval || add: package matching on more…
YYYasin19 Jun 15, 2023
dbd3f75
add: support for excludelist
YYYasin19 Jun 29, 2023
1d4af06
add: new database model for accessing channel urls
YYYasin19 Jun 29, 2023
0f07838
fix: not set mirror_channel_url
YYYasin19 Jun 29, 2023
a5e3847
fix: mirror_channel_urlS
YYYasin19 Jun 29, 2023
8217368
add: new tests for validating pattern checking
YYYasin19 Jun 29, 2023
c3c5467
refactor: channel membership is completely refactored
YYYasin19 Jul 18, 2023
b2af339
add + refactor: option to remove local packages that would _not_ fit …
YYYasin19 Jul 18, 2023
76e7c29
refactoring: clean up code, extract methods, document some others
YYYasin19 Aug 17, 2023
1f6be80
debug: update version to validate deployment
YYYasin19 Aug 22, 2023
8b4b965
add custom URLType for validating urls
YYYasin19 Aug 29, 2023
223e29a
fix rest models
YYYasin19 Aug 29, 2023
3ca604c
Update docs/source/using/mirroring.rst
YYYasin19 Sep 8, 2023
854d471
Update quetz/dao.py
YYYasin19 Sep 8, 2023
cf5cc7a
fix
YYYasin19 Sep 8, 2023
f6480b7
validate that proxy mode only allows a single mirror url
YYYasin19 Sep 8, 2023
e0f3bf1
add more commments
YYYasin19 Sep 8, 2023
ce554b6
add docs
YYYasin19 Sep 8, 2023
299d89b
add comment
YYYasin19 Sep 8, 2023
1afd522
rename method
YYYasin19 Sep 8, 2023
0c83779
add type hints
YYYasin19 Sep 8, 2023
a3b39db
use dist for parsing
YYYasin19 Sep 8, 2023
fc5c4bf
sooner package creation based on cond
YYYasin19 Sep 8, 2023
ff90938
simplify package removal from database
YYYasin19 Sep 8, 2023
5f951d0
update comments
YYYasin19 Sep 8, 2023
4108791
replaced Pydantic Field Validation with Annotated by constr
YYYasin19 Sep 8, 2023
2ed0e86
use enum instead of concrete values for mirror mode
YYYasin19 Sep 8, 2023
183f482
use enum instead of concrete values for mirror mode
YYYasin19 Sep 8, 2023
b51ef67
add docstring
YYYasin19 Sep 8, 2023
5ded7dd
add docstring
YYYasin19 Sep 8, 2023
c0acdbc
fix import
YYYasin19 Sep 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions docs/source/using/mirroring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,18 @@ Creating a mirror channel is similar to creating proxy channels except that you
"mirror_mode": "mirror"
}

To mirror packages from multiple source channels, provide a list of URLs in the ``mirror_channel_url`` attribute:

.. code:: json

{
"name": "mirror-channel",
"private": false,
"mirror_channel_url": ["https://conda.anaconda.org/btel", "https://conda.anaconda.org/conda-forge"],
"mirror_mode": "mirror"
}

Note that setting multiple mirror channel urls will work for mirror channels only. Proxy channels can only mirror a single channel and will therefore use only the first url in the list.
YYYasin19 marked this conversation as resolved.
Show resolved Hide resolved


.. code:: bash
Expand All @@ -80,7 +92,7 @@ Mirror channels are read only (you can not add or change packages in these chann

curl ${QUETZ_HOST}/api/channels/mirror-channel/packages

You can also postpone the synchronising the channel by adding ``{"actions": []}`` to the request:
You can also postpone the synchronisation of the channel by adding ``{"actions": []}`` to the request:

.. code:: bash

Expand All @@ -94,6 +106,8 @@ You can also postpone the synchronising the channel by adding ``{"actions": []}`
"mirror_mode":"mirror",
"actions": []}'

Otherwise, this will be done automatically after the channel is created.


Synchronising mirror channel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -118,8 +132,8 @@ If you don't want to mirror all packages for a channel or can't mirror packages
:excludelist: Don't download packages in list.
:proxylist: Parse package metadata, but redirect downloads to upstream server for packages in list.


It's possible to change metadata after creating a channel using the PATCH ``/api​/channels​/{channel_name}`` endpoint:
After creating a mirror channel you can only modify some attributes, e.g. ``private``, ``size_limit``, ``metadata`` and ``ttl``.
The ``metadata`` attribute` holds the ``includelist``, ``excludelist`` or ``proxylist`` attributes, which can therefore be changed after creating a channel using the PATCH ``/api​/channels​/{channel_name}`` endpoint:

.. code:: json

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ check-imports = ["quetz"]
ignore = ["W004"]

[tool.tbump.version]
current = "0.9.2"
current = "0.10.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my intuition here would be that the version bump should not be part of this PR, but determined once the release is made. @janjagusch opinions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just used it for deployment reasons on the dev environment so we can revert this before merging

regex = '''
(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)
((?P<channel>a|b|rc|.dev)(?P<release>\d+))?
Expand Down
2 changes: 1 addition & 1 deletion quetz/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
version_info = (0, 9, 2, "", "")
version_info = (0, 10, 0, "", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same q as above

__version__ = '.'.join(filter(lambda s: len(s) > 0, map(str, version_info)))
28 changes: 24 additions & 4 deletions quetz/dao.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,10 +311,12 @@ def create_channel(
"only ASCII characters should be used in channel name"
)

# note: due to backwards compatibility, the rest model still calls it
# 'mirror_channel_url' but the db model calls it 'mirror_channel_urls'
channel = Channel(
name=data.name,
description=data.description,
mirror_channel_url=data.mirror_channel_url,
mirror_channel_urls=data.mirror_channel_url,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require a DB migration for existing databases, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, looks a bit complicated but it used the setter even when called over the constructor so this part

    @mirror_channel_urls.setter
    def mirror_channel_urls(self, value):
        print("yo im being set with ", value)
        if isinstance(value, str):
            self.mirror_channel_url = value
        elif isinstance(value, list):
            self.mirror_channel_url = ";".join(value)
        else:
            self.mirror_channel_url = None  # type: ignore

is actually run which just populates the mirror_channel_url again.

mirror_mode=data.mirror_mode,
private=data.private,
ttl=data.ttl,
Expand All @@ -332,6 +334,19 @@ def create_channel(

return channel

def remove_package(self, package_name: str, channel_name: Optional[str] = None):
"""
Remove package from database but only from the given channel if specified.
Due to cascading behaviour, this will also remove all package versions and
package members.
"""
query = self.db.query(Package).filter(Package.name == package_name)
if channel_name:
query = query.filter(Package.channel_name == channel_name)

query.delete()
self.db.commit()

def cleanup_channel_db(
self,
channel_name: str,
Expand All @@ -344,7 +359,7 @@ def cleanup_channel_db(
Package.channel_name == channel_name
)
if package_name:
all_packages = all_packages.filter(
all_packages = all_packages.join(PackageVersion).filter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this change does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use an attribute of another table without specifying a join on that table, sqlalchemy will create a cross join.

SELECT ... 
FROM packages, package_versions # should packages JOIN package_versions ON...
WHERE packages.channel_name = ? AND package_versions.package_name = ?

I can't remember the exact details, but this caused the wrong packages to be removed.

PackageVersion.package_name == package_name
)
for each_package in all_packages:
Expand Down Expand Up @@ -379,7 +394,7 @@ def cleanup_channel_db(
Package.channel_name == channel_name
)
if package_name:
all_packages = all_packages.filter(
all_packages = all_packages.join(PackageVersion).filter(
PackageVersion.package_name == package_name
)
for each_package in all_packages:
Expand Down Expand Up @@ -412,7 +427,7 @@ def cleanup_channel_db(
Package.channel_name == channel_name
)
if package_name:
all_packages = all_packages.filter(
all_packages = all_packages.join(PackageVersion).filter(
PackageVersion.package_name == package_name
)
for x, each_package in enumerate(all_packages):
Expand Down Expand Up @@ -1009,6 +1024,11 @@ def get_channel_datas(self, channel_name: str):
)

def assert_size_limits(self, channel_name: str, size: int):
"""
validate that adding a package of size `size` to channel `channel_name`
YYYasin19 marked this conversation as resolved.
Show resolved Hide resolved
does not exceed the channel size limit.
raises: QuotaError
"""
channel_size, channel_size_limit = (
self.db.query(Channel.size, Channel.size_limit)
.filter(Channel.name == channel_name)
Expand Down
21 changes: 20 additions & 1 deletion quetz/db_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ class Package(Base):
),
viewonly=True,
lazy="select",
cascade="all,delete-orphan",
)

@property
Expand Down Expand Up @@ -200,14 +201,32 @@ class Channel(Base):
)
description = Column(String)
private = Column(Boolean, default=False)
mirror_channel_url = Column(String)
mirror_channel_url = Column("mirror_channel_url", String)
mirror_mode = Column(String)
channel_metadata = Column(String, server_default='{}', nullable=False)
timestamp_mirror_sync = Column(Integer, default=0)
size = Column(BigInteger, default=0)
size_limit = Column(BigInteger, default=None)
ttl = Column(Integer, server_default=f'{60 * 60 * 10}', nullable=False) # 10 hours

@property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this code serves to work around the fact that you want to have a list in python but the DB schema currently has a string. Wouldn't it be better to change the data type in the DB?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to change the database type since it would also complicate the schema for such a small change.
Obviously we don't have any concrete numbers but I guess most users would only want to mirror one other channel instead of multiple ones.

Are you in favor of having a mapping to a list of mirror channel urls?

def mirror_channel_urls(self):
if self.mirror_channel_url is None:
return []
elif ";" in self.mirror_channel_url:
return self.mirror_channel_url.split(";")
else:
return [self.mirror_channel_url]

@mirror_channel_urls.setter
def mirror_channel_urls(self, value):
if isinstance(value, str):
self.mirror_channel_url = value
elif isinstance(value, list):
self.mirror_channel_url = ";".join(value)
else:
self.mirror_channel_url = None # type: ignore

packages = relationship(
'Package', back_populates='channel', cascade="all,delete", uselist=True
)
Expand Down
2 changes: 1 addition & 1 deletion quetz/deps.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ def __call__(

auth.assert_channel_read(channel)

mirror_url = channel.mirror_channel_url
mirror_url = channel.mirror_channel_urls

is_proxy = mirror_url and channel.mirror_mode == "proxy"
is_mirror = mirror_url and channel.mirror_mode == "mirror"
Expand Down
80 changes: 58 additions & 22 deletions quetz/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,9 +704,15 @@ def post_channel(
if not new_channel.mirror_channel_url:
auth.assert_create_channel()

is_mirror = new_channel.mirror_channel_url and new_channel.mirror_mode == "mirror"
is_mirror = (
new_channel.mirror_channel_url
and new_channel.mirror_mode == rest_models.MirrorMode.mirror
)

is_proxy = new_channel.mirror_channel_url and new_channel.mirror_mode == "proxy"
is_proxy = (
new_channel.mirror_channel_url
and new_channel.mirror_mode == rest_models.MirrorMode.proxy
)

if is_mirror:
auth.assert_create_mirror_channel()
Expand Down Expand Up @@ -742,29 +748,32 @@ def post_channel(
else:
size_limit = None

# create database model (channel) from rest model (new_channel)
channel = dao.create_channel(new_channel, user_id, authorization.OWNER, size_limit)
pkgstore.create_channel(new_channel.name)
if not is_proxy:
indexing.update_indexes(dao, pkgstore, new_channel.name)

# register mirror
if is_mirror and register_mirror:
mirror_url = str(new_channel.mirror_channel_url)
mirror_url = mirror_url.replace("get", "api/channels")
headers = {"x-api-key": mirror_api_key} if mirror_api_key else {}
api_endpoint = str(request.url.replace(query=None)) + '/' + new_channel.name
request.url
response = session.post(
mirror_url + '/mirrors',
json={
"url": api_endpoint.replace("api/channels", "get"),
"api_endpoint": api_endpoint,
"metrics_endpoint": api_endpoint.replace("api", "metrics"),
},
headers=headers,
)
if response.status_code != 201:
logger.warning(f"could not register mirror due to error {response.text}")
for mirror_url in channel.mirror_channel_urls:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind expanding the comments here to document what the code does and why? All of this only works / makes snese if the mirrored channel is on another quetz instance, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding as well.

mirror_url = mirror_url.replace("get", "api/channels")
headers = {"x-api-key": mirror_api_key} if mirror_api_key else {}
api_endpoint = str(request.url.replace(query=None)) + '/' + new_channel.name
request.url
response = session.post(
mirror_url + '/mirrors',
json={
"url": api_endpoint.replace("api/channels", "get"),
"api_endpoint": api_endpoint,
"metrics_endpoint": api_endpoint.replace("api", "metrics"),
},
headers=headers,
)
if response.status_code != 201:
logger.warning(
f"could not register mirror due to error {response.text}"
)

for action in actions:
task.execute_channel_action(
Expand Down Expand Up @@ -1479,6 +1488,25 @@ def _assert_filename_package_name_consistent(file_name: str, package_name: str):
after=after_log(logger, logging.WARNING),
)
def _extract_and_upload_package(file, channel_name, channel_proxylist):
"""
Extracts information from a conda package and uploads it to a package store.

This is automatically reried on error (indicating an unsuccessful upload)
The wait time between retries is exponentially increasing.

Parameters:
file (object): The conda package file object to be uploaded.
channel_name (str): The name of the channel where the package will be uploaded.
channel_proxylist (list): A list of package names that are proxied.
If the package is in this list, it will not be uploaded.

Returns:
conda_info: contains extracted information from the conda package.

Raises:
Exception: If there is an error in extracting conda info from the package
or in uploading the package to the store.
"""
try:
conda_info = CondaInfo(file.file, file.filename)
except Exception as e:
Expand Down Expand Up @@ -1798,14 +1826,22 @@ def serve_path(
except ValueError:
pass

if is_package_request and channel.mirror_channel_url:
if is_package_request and channel.mirror_channel_urls:
# if we exclude the package from syncing, redirect to original URL
# use the first mirror url entry for the redirect
channel_proxylist = json.loads(channel.channel_metadata).get('proxylist', [])
if channel_proxylist and package_name and package_name in channel_proxylist:
return RedirectResponse(f"{channel.mirror_channel_url}/{path}")
# proxying packages only works for the first url in the list
redirect_url = channel.mirror_channel_urls[0]
return RedirectResponse(f"{redirect_url}/{path}")

if channel.mirror_channel_url and channel.mirror_mode == "proxy":
repository = RemoteRepository(channel.mirror_channel_url, session)
# note: proxy mode only works with one mirror url (checked on channel creation)
if (
channel.mirror_channel_urls
and channel.mirror_mode == rest_models.MirrorMode.proxy
):
proxy_url = channel.mirror_channel_urls[0]
repository = RemoteRepository(proxy_url, session)
if not pkgstore.file_exists(channel.name, path):
download_remote_file(repository, pkgstore, channel.name, path)
elif path.endswith(".json"):
Expand Down
4 changes: 4 additions & 0 deletions quetz/pkgstores.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ def support_redirect(self) -> bool:

@abc.abstractmethod
def create_channel(self, name):
"""
create backend resources for a channel with a given name,
e.g. create a directory or bucket.
"""
pass

@abc.abstractmethod
Expand Down
Loading