-
Notifications
You must be signed in to change notification settings - Fork 29
feat(vcs): new data model #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: vcs-staging
Are you sure you want to change the base?
feat(vcs): new data model #192
Conversation
* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.
9f1e07b
to
449f41d
Compare
79ba5a6
to
fc8faf7
Compare
fc8faf7
to
66c42c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?
"provider_id", | ||
name="uq_vcs_repositories_provider_provider_id", | ||
), | ||
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id
, we should probably do it.
Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the For example the |
existing_type=sa.Integer(), | ||
existing_nullable=True, | ||
) | ||
op.alter_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this column and why we modify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stores the provider-specific ID of the webhook if the repository has been activated. On GitHub, this ID is always an integer so until now we have stored it as an integer. It also happens to be an integer on GitLab. But we don't know that this will be the case for all VCSes so we change it here to be stored as a string which is more flexible.
op.add_column( | ||
"vcs_repositories", | ||
sa.Column( | ||
"default_branch", sa.String(255), nullable=False, server_default="master" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this information used later on? why do we need to specify the default branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it to be able to generate a new file link for creating the CITATION.cff
file, which is shown in the UI as a matter of convenience. It's not absolutely essential though.
op.add_column( | ||
# Nullable for now (see below) | ||
"vcs_repositories", | ||
sa.Column("html_url", sa.String(10000), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need html_url? what other urls we might keep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just called it this to distinguish it from a potential api_url
, this way we clarify what the name refers to. This URL points to the page in the VCS provider's UI that corresponds to the repository.
invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py
Outdated
Show resolved
Hide resolved
"vcs_repositories", sa.Column("license_spdx", sa.String(255), nullable=True) | ||
) | ||
op.alter_column("vcs_repositories", "user_id", new_column_name="enabled_by_id") | ||
op.drop_index("ix_github_repositories_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it OK to drop these indices? especially the id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have two indexes:
ix_github_repositories_name
on the repo nameix_github_repositories_github_id
on the repo's provider (GitHub) ID
We are replacing them with these two:
uq_vcs_repositories_provider_provider_id
on the combination ofprovider
andprovider_id
, since each repository must have a unique ID within the context of a provider.uq_vcs_repositories_provider_name
since each repository must have a unique full name (e.g.inveniosoftware/invenio-github
) within the context of a provider.
op.alter_column( | ||
"vcs_repositories", | ||
"github_id", | ||
new_column_name="provider_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the id of the repository supplied by the specific provider?
if this is the case, my first thought was that provider_id
means we assign an identifier to a provider (as in (github, 1), (gitlab, 2)...
) so the name of the column might not be descriptive enough to remove the ambiguity...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it's been mentioned before the provider
and provider_id
have confusing names, so it's probably worth changing them. Maybe id_from_provider
instead of provider_id
or something similar?
# | ||
# We need to recreate the SQLAlchemy models for `RemoteAccount` and `Repository` here but | ||
# in a much more lightweight way. We cannot simply import the models because (a) they depend | ||
# on the full Invenio app being initialised and all extensions available and (b) we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why we replicate oauth remote account, won't this recipe fail the moment we try to upgrade an existing instance? or this is not "really" creating the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just creating an SQLAlchemy model of the table so we can interact with it in a similar way to the rest of our codebase. It doesn't actually attempt to create or modify the table itself.
The alternative is using raw SQL to read/insert rows, which would be confusing for such a complex migration.
op.create_table( | ||
"vcs_repository_users", | ||
sa.Column("repository_id", UUIDType(), primary_key=True), | ||
sa.Column("user_id", sa.Integer(), primary_key=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it our user_id or user id from the VCS provider? I guess from vcs judging from the FK constraints, but can we be more explicit. Also from our previous experiences, having int type on ids can be very problematic, I would suggest another approach. What if some vcs provider has alphanumeric user ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's our ID in this case, this table is storing which Invenio users have access to which repo. The foreign key maps it to accounts_user.id
. Hence also why it's an int.
I agree the naming is confusing, maybe accounts_user_id
or something similar would work better?
…on for orphaned repos
RELEASE_STATUS_ICON = { | ||
"RECEIVED": "spinner loading icon", | ||
"PROCESSING": "spinner loading icon", | ||
"PUBLISHED": "check icon", | ||
"FAILED": "times icon", | ||
"DELETED": "times icon", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure about this given that it highly depends on the CSS frameworks in use.
Can we map this in the front end instead, via an overridable Jinja macro for example?
RELEASE_STATUS_COLOR = { | ||
"RECEIVED": "warning", | ||
"PROCESSING": "warning", | ||
"PUBLISHED": "positive", | ||
"FAILED": "negative", | ||
"DELETED": "negative", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above here
"provider_id", | ||
name="uq_vcs_repositories_provider_provider_id", | ||
), | ||
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id
, we should probably do it.
# Relationships | ||
# | ||
users = db.relationship(User, secondary=repository_user_association) | ||
enabled_by_user = db.relationship(User, foreign_keys=[enabled_by_user_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this record the last user who enabled, in case there are multiple disable/enable actions?
def add_user(self, user_id: int): | ||
"""Add permission for a user to access the repository.""" | ||
user = User(id=user_id) | ||
user = db.session.merge(user) | ||
self.users.append(user) | ||
|
||
def remove_user(self, user_id: int): | ||
"""Remove permission for a user to access the repository.""" | ||
user = User(id=user_id) | ||
user = db.session.merge(user) | ||
self.users.remove(user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand these methods, what happens with self.users
?
To fetch an existing user, we normally do something like this:
with db.session.no_autoflush:
user = current_datastore.get_user(...)
The no_autoflush
is needed because if by any chance the user obj is modified, it will be persisted in the DB.
if provider_id: | ||
repo = cls.query.filter( | ||
Repository.provider_id == provider_id, Repository.provider == provider | ||
).one_or_none() | ||
if not repo and full_name is not None: | ||
repo = cls.query.filter( | ||
Repository.full_name == full_name, Repository.provider == provider | ||
).one_or_none() | ||
|
||
return repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if provider_id: | |
repo = cls.query.filter( | |
Repository.provider_id == provider_id, Repository.provider == provider | |
).one_or_none() | |
if not repo and full_name is not None: | |
repo = cls.query.filter( | |
Repository.full_name == full_name, Repository.provider == provider | |
).one_or_none() | |
return repo | |
if provider_id: | |
.... | |
elif not repo and full_name is not None: | |
... | |
else: | |
raise .... ? | |
return repo |
|
||
def upgrade(): | ||
"""Upgrade database.""" | ||
op.rename_table("github_repositories", "vcs_repositories") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: given the complexity of this migration, would it be maybe easier to create a new table, copy over the data, and delete the old one instead?
Closes #188
Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the
github_...
tables tovcs_...
, adding a new column to the relevant tables to identify which provider the records relate to, and more.Added an Alembic migration, including moving the repository data from
oauthclient_remoteaccount
to thevcs_repositories
table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.Edit: see here for the upgrade guide for large instances.
We can improve the performance of this migration when perf(models): change
extra_data
toJSONB
invenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the
vcs-staging
branch and then merge them all intomaster
once we have a fully release-ready prototype. At that point, we will create a squash commit.