feat(vcs): new data model #192

palkerecsenyi · 2025-09-25T10:10:42Z

Closes #188

Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the github_... tables to vcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.
Added an Alembic migration, including moving the repository data from oauthclient_remoteaccount to the vcs_repositories table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.
- Edit: see here for the upgrade guide for large instances.
- We can improve the performance of this migration when perf(models): change extra_data to JSONB invenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.
Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the vcs-staging branch and then merge them all into master once we have a fully release-ready prototype. At that point, we will create a squash commit.

* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.

zzacharo

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

zzacharo · 2025-10-15T12:41:31Z

invenio_vcs/models.py

+            "provider_id",
+            name="uq_vcs_repositories_provider_provider_id",
+        ),
+        # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),


I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.

Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

invenio_vcs/models.py

palkerecsenyi · 2025-10-15T13:12:09Z

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the master branch of my fork which is kept up-to-date with the fragmented PRs.

For example the models.py file: master...palkerecsenyi:invenio-vcs:master#diff-a232ee65b447a8d90fbac12501761c411764f3570061d1b18e3e8181668fcc39

kpsherva · 2025-10-22T09:45:04Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+        existing_type=sa.Integer(),
+        existing_nullable=True,
+    )
+    op.alter_column(


what is the purpose of this column and why we modify it?

This stores the provider-specific ID of the webhook if the repository has been activated. On GitHub, this ID is always an integer so until now we have stored it as an integer. It also happens to be an integer on GitLab. But we don't know that this will be the case for all VCSes so we change it here to be stored as a string which is more flexible.

kpsherva · 2025-10-22T09:46:37Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.add_column(
+        "vcs_repositories",
+        sa.Column(
+            "default_branch", sa.String(255), nullable=False, server_default="master"


how is this information used later on? why do we need to specify the default branch?

We need it to be able to generate a new file link for creating the CITATION.cff file, which is shown in the UI as a matter of convenience. It's not absolutely essential though.

https://github.com/palkerecsenyi/invenio-vcs/blob/b9c8884f99435c900234c1ebeb5abcb59c24b238/invenio_vcs/views/vcs.py#L135-L137

kpsherva · 2025-10-22T09:47:04Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.add_column(
+        # Nullable for now (see below)
+        "vcs_repositories",
+        sa.Column("html_url", sa.String(10000), nullable=True),


why do we need html_url? what other urls we might keep?

I just called it this to distinguish it from a potential api_url, this way we clarify what the name refers to. This URL points to the page in the VCS provider's UI that corresponds to the repository.

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

kpsherva · 2025-10-22T09:48:43Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+        "vcs_repositories", sa.Column("license_spdx", sa.String(255), nullable=True)
+    )
+    op.alter_column("vcs_repositories", "user_id", new_column_name="enabled_by_id")
+    op.drop_index("ix_github_repositories_name")


why is it OK to drop these indices? especially the id

We currently have two indexes:

ix_github_repositories_name on the repo name

ix_github_repositories_github_id on the repo's provider (GitHub) ID

We are replacing them with these two:

uq_vcs_repositories_provider_provider_id on the combination of provider and provider_id, since each repository must have a unique ID within the context of a provider.

uq_vcs_repositories_provider_name since each repository must have a unique full name (e.g. inveniosoftware/invenio-github) within the context of a provider.

kpsherva · 2025-10-22T09:52:24Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.alter_column(
+        "vcs_repositories",
+        "github_id",
+        new_column_name="provider_id",


I guess this is the id of the repository supplied by the specific provider?
if this is the case, my first thought was that provider_id means we assign an identifier to a provider (as in (github, 1), (gitlab, 2)... ) so the name of the column might not be descriptive enough to remove the ambiguity...

Yes I think it's been mentioned before the provider and provider_id have confusing names, so it's probably worth changing them. Maybe id_from_provider instead of provider_id or something similar?

kpsherva · 2025-10-22T09:54:23Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    #
+    # We need to recreate the SQLAlchemy models for `RemoteAccount` and `Repository` here but
+    # in a much more lightweight way. We cannot simply import the models because (a) they depend
+    # on the full Invenio app being initialised and all extensions available and (b) we need


I don't fully understand why we replicate oauth remote account, won't this recipe fail the moment we try to upgrade an existing instance? or this is not "really" creating the table?

This is just creating an SQLAlchemy model of the table so we can interact with it in a similar way to the rest of our codebase. It doesn't actually attempt to create or modify the table itself.

The alternative is using raw SQL to read/insert rows, which would be confusing for such a complex migration.

kpsherva · 2025-10-22T11:40:30Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.create_table(
+        "vcs_repository_users",
+        sa.Column("repository_id", UUIDType(), primary_key=True),
+        sa.Column("user_id", sa.Integer(), primary_key=True),


is it our user_id or user id from the VCS provider? I guess from vcs judging from the FK constraints, but can we be more explicit. Also from our previous experiences, having int type on ids can be very problematic, I would suggest another approach. What if some vcs provider has alphanumeric user ids?

It's our ID in this case, this table is storing which Invenio users have access to which repo. The foreign key maps it to accounts_user.id. Hence also why it's an int.

I agree the naming is confusing, maybe accounts_user_id or something similar would work better?

…on for orphaned repos

ntarocco · 2025-10-22T16:04:42Z

invenio_vcs/models.py

+RELEASE_STATUS_ICON = {
+    "RECEIVED": "spinner loading icon",
+    "PROCESSING": "spinner loading icon",
+    "PUBLISHED": "check icon",
+    "FAILED": "times icon",
+    "DELETED": "times icon",
+}


I am not 100% sure about this given that it highly depends on the CSS frameworks in use.
Can we map this in the front end instead, via an overridable Jinja macro for example?

ntarocco · 2025-10-22T16:05:00Z

invenio_vcs/models.py

+RELEASE_STATUS_COLOR = {
+    "RECEIVED": "warning",
+    "PROCESSING": "warning",
+    "PUBLISHED": "positive",
+    "FAILED": "negative",
+    "DELETED": "negative",
+}


Same comment as above here

ntarocco · 2025-10-22T16:08:01Z

invenio_vcs/models.py

+            "provider_id",
+            name="uq_vcs_repositories_provider_provider_id",
+        ),
+        # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),


Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

ntarocco · 2025-10-22T16:09:11Z

invenio_vcs/models.py

+    # Relationships
+    #
+    users = db.relationship(User, secondary=repository_user_association)
+    enabled_by_user = db.relationship(User, foreign_keys=[enabled_by_user_id])


does this record the last user who enabled, in case there are multiple disable/enable actions?

ntarocco · 2025-10-22T16:16:52Z

invenio_vcs/models.py

+    def add_user(self, user_id: int):
+        """Add permission for a user to access the repository."""
+        user = User(id=user_id)
+        user = db.session.merge(user)
+        self.users.append(user)
+
+    def remove_user(self, user_id: int):
+        """Remove permission for a user to access the repository."""
+        user = User(id=user_id)
+        user = db.session.merge(user)
+        self.users.remove(user)


I don't understand these methods, what happens with self.users?
To fetch an existing user, we normally do something like this:

with db.session.no_autoflush: user = current_datastore.get_user(...)

The no_autoflush is needed because if by any chance the user obj is modified, it will be persisted in the DB.

ntarocco · 2025-10-22T16:17:42Z

invenio_vcs/models.py

+        if provider_id:
+            repo = cls.query.filter(
+                Repository.provider_id == provider_id, Repository.provider == provider
+            ).one_or_none()
+        if not repo and full_name is not None:
+            repo = cls.query.filter(
+                Repository.full_name == full_name, Repository.provider == provider
+            ).one_or_none()
+
+        return repo


Suggested change

if provider_id:

repo = cls.query.filter(

Repository.provider_id == provider_id, Repository.provider == provider

).one_or_none()

if not repo and full_name is not None:

repo = cls.query.filter(

Repository.full_name == full_name, Repository.provider == provider

).one_or_none()

return repo

if provider_id:

....

elif not repo and full_name is not None:

...

else:

raise .... ?

return repo

ntarocco · 2025-10-22T16:20:37Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+
+def upgrade():
+    """Upgrade database."""
+    op.rename_table("github_repositories", "vcs_repositories")


Question: given the complexity of this migration, would it be maybe easier to create a new table, copy over the data, and delete the old one instead?

palkerecsenyi changed the title ~~WIP: feat(vcs): new data model~~ feat(vcs): new data model Sep 25, 2025

palkerecsenyi force-pushed the data-layer branch from 9f1e07b to 449f41d Compare September 25, 2025 10:11

palkerecsenyi mentioned this pull request Aug 15, 2025

Make invenio-github support other VCS providers #188

Open

14 tasks

palkerecsenyi linked an issue Sep 25, 2025 that may be closed by this pull request

Make invenio-github support other VCS providers #188

Open

14 tasks

palkerecsenyi force-pushed the data-layer branch from 79ba5a6 to fc8faf7 Compare October 9, 2025 09:10

chore: pydoc

66c42c0

palkerecsenyi force-pushed the data-layer branch from fc8faf7 to 66c42c0 Compare October 9, 2025 15:57

zzacharo reviewed Oct 15, 2025

View reviewed changes

WIP: models: JSONB for errors column

bf91a21

WIP: chore: license

24cfce3

kpsherva reviewed Oct 22, 2025

View reviewed changes

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py Outdated Show resolved Hide resolved

kpsherva reviewed Oct 22, 2025

View reviewed changes

feat(models): rename enabled_by_id -> enabled_by_user_id, add migrati…

f8a3d84

…on for orphaned repos

ntarocco reviewed Oct 22, 2025

View reviewed changes

feat(vcs): new data model #192

Are you sure you want to change the base?

feat(vcs): new data model #192

Uh oh!

Conversation

palkerecsenyi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zzacharo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

palkerecsenyi commented Oct 15, 2025

Uh oh!

kpsherva Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpsherva Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

palkerecsenyi commented Sep 25, 2025 •

edited

Loading

kpsherva Oct 22, 2025 •

edited

Loading

kpsherva Oct 22, 2025 •

edited

Loading

palkerecsenyi Oct 22, 2025 •

edited

Loading