Skip to content

Conversation

palkerecsenyi
Copy link
Member

@palkerecsenyi palkerecsenyi commented Sep 25, 2025

Closes #188


  • Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the github_... tables to vcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.

  • Added an Alembic migration, including moving the repository data from oauthclient_remoteaccount to the vcs_repositories table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.

  • Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.

  • This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.

  • This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the vcs-staging branch and then merge them all into master once we have a fully release-ready prototype. At that point, we will create a squash commit.

@palkerecsenyi palkerecsenyi changed the title WIP: feat(vcs): new data model feat(vcs): new data model Sep 25, 2025
* Updated the data model to accommodate the new generic approach to VCS
integration. This involves renaming the `github_...` tables to
`vcs_...`, adding a new column to the relevant tables to identify which
provider the records relate to, and more.

* Added an Alembic migration, including moving the repository data from
`oauthclient_remoteaccount` to the `vcs_repositories` table, which is a
complex and long-running operation. This will be supplemented by a
manual migration guide for instances like Zenodo where a several-minute
full DB lock is not feasible. The difference between whether to use the
automated migration or the manual one will be clarified in the docs.

* Added a repo-user m-to-m mapping table. By not storing repos in the
Remote Accounts table, we need a different way of associating users with
the repos they have access to. This table is synced using code that will
be included in other PRs.

* This PR contains only the data model changes themselves and not the
associated functional changes needed to do anything useful.

* This commit on its own is UNRELEASABLE. We will merge multiple commits
related to the VCS upgrade into the `vcs-staging` branch and then merge
them all into `master` once we have a fully release-ready prototype. At
that point, we will create a squash commit.
Copy link
Member

@zzacharo zzacharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

"provider_id",
name="uq_vcs_repositories_provider_provider_id",
),
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

@palkerecsenyi
Copy link
Member Author

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the master branch of my fork which is kept up-to-date with the fragmented PRs.

For example the models.py file: master...palkerecsenyi:invenio-vcs:master#diff-a232ee65b447a8d90fbac12501761c411764f3570061d1b18e3e8181668fcc39

existing_type=sa.Integer(),
existing_nullable=True,
)
op.alter_column(
Copy link
Contributor

@kpsherva kpsherva Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this column and why we modify it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stores the provider-specific ID of the webhook if the repository has been activated. On GitHub, this ID is always an integer so until now we have stored it as an integer. It also happens to be an integer on GitLab. But we don't know that this will be the case for all VCSes so we change it here to be stored as a string which is more flexible.

op.add_column(
"vcs_repositories",
sa.Column(
"default_branch", sa.String(255), nullable=False, server_default="master"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this information used later on? why do we need to specify the default branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it to be able to generate a new file link for creating the CITATION.cff file, which is shown in the UI as a matter of convenience. It's not absolutely essential though.

https://github.com/palkerecsenyi/invenio-vcs/blob/b9c8884f99435c900234c1ebeb5abcb59c24b238/invenio_vcs/views/vcs.py#L135-L137

op.add_column(
# Nullable for now (see below)
"vcs_repositories",
sa.Column("html_url", sa.String(10000), nullable=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need html_url? what other urls we might keep?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just called it this to distinguish it from a potential api_url, this way we clarify what the name refers to. This URL points to the page in the VCS provider's UI that corresponds to the repository.

"vcs_repositories", sa.Column("license_spdx", sa.String(255), nullable=True)
)
op.alter_column("vcs_repositories", "user_id", new_column_name="enabled_by_id")
op.drop_index("ix_github_repositories_name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it OK to drop these indices? especially the id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently have two indexes:

  • ix_github_repositories_name on the repo name
  • ix_github_repositories_github_id on the repo's provider (GitHub) ID

We are replacing them with these two:

  • uq_vcs_repositories_provider_provider_id on the combination of provider and provider_id, since each repository must have a unique ID within the context of a provider.
  • uq_vcs_repositories_provider_name since each repository must have a unique full name (e.g. inveniosoftware/invenio-github) within the context of a provider.

op.alter_column(
"vcs_repositories",
"github_id",
new_column_name="provider_id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the id of the repository supplied by the specific provider?
if this is the case, my first thought was that provider_id means we assign an identifier to a provider (as in (github, 1), (gitlab, 2)... ) so the name of the column might not be descriptive enough to remove the ambiguity...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it's been mentioned before the provider and provider_id have confusing names, so it's probably worth changing them. Maybe id_from_provider instead of provider_id or something similar?

#
# We need to recreate the SQLAlchemy models for `RemoteAccount` and `Repository` here but
# in a much more lightweight way. We cannot simply import the models because (a) they depend
# on the full Invenio app being initialised and all extensions available and (b) we need
Copy link
Contributor

@kpsherva kpsherva Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why we replicate oauth remote account, won't this recipe fail the moment we try to upgrade an existing instance? or this is not "really" creating the table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just creating an SQLAlchemy model of the table so we can interact with it in a similar way to the rest of our codebase. It doesn't actually attempt to create or modify the table itself.

The alternative is using raw SQL to read/insert rows, which would be confusing for such a complex migration.

op.create_table(
"vcs_repository_users",
sa.Column("repository_id", UUIDType(), primary_key=True),
sa.Column("user_id", sa.Integer(), primary_key=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it our user_id or user id from the VCS provider? I guess from vcs judging from the FK constraints, but can we be more explicit. Also from our previous experiences, having int type on ids can be very problematic, I would suggest another approach. What if some vcs provider has alphanumeric user ids?

Copy link
Member Author

@palkerecsenyi palkerecsenyi Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's our ID in this case, this table is storing which Invenio users have access to which repo. The foreign key maps it to accounts_user.id. Hence also why it's an int.

I agree the naming is confusing, maybe accounts_user_id or something similar would work better?

Comment on lines +31 to +37
RELEASE_STATUS_ICON = {
"RECEIVED": "spinner loading icon",
"PROCESSING": "spinner loading icon",
"PUBLISHED": "check icon",
"FAILED": "times icon",
"DELETED": "times icon",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure about this given that it highly depends on the CSS frameworks in use.
Can we map this in the front end instead, via an overridable Jinja macro for example?

Comment on lines +39 to +45
RELEASE_STATUS_COLOR = {
"RECEIVED": "warning",
"PROCESSING": "warning",
"PUBLISHED": "positive",
"FAILED": "negative",
"DELETED": "negative",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above here

"provider_id",
name="uq_vcs_repositories_provider_provider_id",
),
# Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

# Relationships
#
users = db.relationship(User, secondary=repository_user_association)
enabled_by_user = db.relationship(User, foreign_keys=[enabled_by_user_id])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this record the last user who enabled, in case there are multiple disable/enable actions?

Comment on lines +206 to +216
def add_user(self, user_id: int):
"""Add permission for a user to access the repository."""
user = User(id=user_id)
user = db.session.merge(user)
self.users.append(user)

def remove_user(self, user_id: int):
"""Remove permission for a user to access the repository."""
user = User(id=user_id)
user = db.session.merge(user)
self.users.remove(user)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand these methods, what happens with self.users?
To fetch an existing user, we normally do something like this:

with db.session.no_autoflush:
    user = current_datastore.get_user(...)

The no_autoflush is needed because if by any chance the user obj is modified, it will be persisted in the DB.

Comment on lines +233 to +242
if provider_id:
repo = cls.query.filter(
Repository.provider_id == provider_id, Repository.provider == provider
).one_or_none()
if not repo and full_name is not None:
repo = cls.query.filter(
Repository.full_name == full_name, Repository.provider == provider
).one_or_none()

return repo

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if provider_id:
repo = cls.query.filter(
Repository.provider_id == provider_id, Repository.provider == provider
).one_or_none()
if not repo and full_name is not None:
repo = cls.query.filter(
Repository.full_name == full_name, Repository.provider == provider
).one_or_none()
return repo
if provider_id:
....
elif not repo and full_name is not None:
...
else:
raise .... ?
return repo


def upgrade():
"""Upgrade database."""
op.rename_table("github_repositories", "vcs_repositories")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: given the complexity of this migration, would it be maybe easier to create a new table, copy over the data, and delete the old one instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make invenio-github support other VCS providers

4 participants