Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a minimal repo #1335

Closed
hoijui opened this issue Sep 23, 2021 · 37 comments
Closed

Generate a minimal repo #1335

hoijui opened this issue Sep 23, 2021 · 37 comments

Comments

@hoijui
Copy link

hoijui commented Sep 23, 2021

... meaning: A repo that contains just a text file with the license IDs (one per line) - and maybe one like that with the exceptions.

The reason to have such a repo, is so it can easily be used as a sub-module in other projects, which need to perform a basic validity check on license identifiers. For example, I am writing a small command-line tool that checks certain meta-data/build files, if they use a correct SPDX license identifier. This tool also performs many other checks, and also needs data to verify them, which all comes from other places. The way it works: during compile-time, a local text-file is read, and converted into an internal, constant array of strings, which is then used for checking at run-time. If the file is available in a git sub-module, one gets reproducible builds, and one does not have to hack something together to download the file at compile-time, nor run extra scripts before compiling that would download the file.
This repo can be used for that, or the one already generated, but they are kind of overkill for that purpose, and it if I'd do a similar thing for all the data I need for all the checks, a full recursive clone of the repo would become big, and might look confusing to the human eye.

If you do not consider this a meaningful/worthwhile thing to have, I would try to generate it myself, using a scheduled github action (running once a day or once a week or so).
If you do consider it worthwhile, I could try to make a pull request too, if you prefer that.

@silverhook
Copy link
Collaborator

Something like this would be useful also for REUSE – whether within SPDX or to collaborate on (CC @mxmehl)

@seabass-labrax
Copy link
Contributor

I really like this idea, @hoijui - I'm always a fan of reproducible builds, and you've got a great point about using a Git submodule for this :) Any change that also improves REUSE is bound to be a good thing! @silverhook, are you imagining that this could replace the hard-coded licenses.json file in the REUSE Python tool?

As an alternative to a separate repository, we could also create a new branch containing only this minimal information in the license-list-data repository, to be used with a command like git submodule add -b plain-list https://github.com/spdx/license-list-data.

For some background, currently, a GitHub Actions workflow on this repository calls the Makefile, which in turn runs the LicenseListPublisher tool and deploys the output to the license-list-data repository. This happens on every push to this repository, so @swinslow manually tags our releases on both repositories.

@silverhook
Copy link
Collaborator

@silverhook, are you imagining that this could replace the hard-coded licenses.json file in the REUSE Python tool?

I’m not the right person to answer that. I would defer this to @mxmehl and @carmenbianca. Just a thought because such a thing was discussed before.

@goneall
Copy link
Member

goneall commented Sep 23, 2021

As an alternative to a separate repository, we could also create a new branch containing only this minimal information in the license-list-data repository

The license-list-data repository has several folders representing different data formats used by different tools.

There is currently a text folder which is similar to this request. The files are named using the exception and license ID's with a ".txt" extension.

Can this folder be used to satisfy the request? I realize is it not in its own repo, but you could reference just the files you need.

@silverhook
Copy link
Collaborator

There is currently a text folder which is similar to this request. The files are named using the exception and license ID's with a ".txt" extension.

AFAIK, that is currently the place where the reuse tool pulls the license texts from to put in the LICENSES/ folder, yes.

Personally, I would prefer that continuing to be the case, but once or twice it was brought up (esp. in the LGPL-3.0 + GPL-3.0 text discussion, which let’s not discuss in this issue) whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?).

@goneall
Copy link
Member

goneall commented Sep 23, 2021

once or twice it was brought up (esp. in the LGPL-3.0 + GPL-3.0 text discussion, which let’s not discuss in this issue)

Completely agree on not discussing - we've discussed that one enough - but I can understand why the issue was raised

whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?)

That's a good question for the legal team. I can provide information on where the text originates, but I'll leave the purpose and use to the larger legal community.

The text found in the above mentioned folder is an exact copy of the license test text associated with the license ID or a text rendered version of the license XML license text if no license test text is found. I'm not aware of any licenses where the license test text is missing, so almost all of the text is copied from there.

For new licenses, the submitter provides the license test text. It is typically (but not guaranteed to be) copied from the canonical license text which has the advantage of retaining all the original formatting which is lost in the XML rendering of text. When we converted over to the XML version of the license text, the license test text was copied over from the text files used to generate the spdx license list website. For many of these text files, we changed the formatting so it would render nicely on the web page but only whitespace characters were changed.

@silverhook
Copy link
Collaborator

silverhook commented Sep 23, 2021

whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?)

That's a good question for the legal team. I can provide information on where the text originates, but I'll leave the purpose and use to the larger legal community.

Unfortunately I have a hard time joining SPDX in the past years, since they clash with some internal (MMO-)calls at work :/

I think between the XML templates in license-list-XML and the license-list-data we should be able to cover both the use case for matching for license scanners and the use case as a repository of (quasi-)canonical license texts reuse.

But yes, ultimately that is a question that needs to be addressed, and SPDX Legal team is probably the best start.

@hoijui
Copy link
Author

hoijui commented Sep 23, 2021

the reason I do not want to use the https://github.com/spdx/license-list-data repo, is this:

git clone [email protected]:spdx/license-list-data.git
du -sh license-list-data/.git
234M	total

Though the source repo is already much better in this regard:

git clone [email protected]:spdx/license-list-XML.git
du -sh license-list-XML/.git
6.3M	total

The repo I am proposing would probably be in the ballpark of (wild guess) 1KB - 10KB.

If my tiny CLI project depends on a 234MB sub-module just for a list of a few hundred short strings, that would feel wrong to me. I might be able to live with the 6MB, even though it is still ~10 times the size of my project, and the license list is just one of such submodule dependencies.

@silverhook With an other branch in the same repo, would it not still download the whole history when recursively checking out the repo?

If you think the use-case for this is too small, I am totally fine with that. I only know of mine and one other use-case, personally.

@silverhook
Copy link
Collaborator

@hoijui Why don’t you just shallow clone? https://stackoverflow.com/a/1210012

Again, regarding how the reuse tool itself works and why, I’m not the right person to ask – @carmenbianca and @mxmehl would be the better people for that.

@hoijui
Copy link
Author

hoijui commented Sep 23, 2021

That changes the size to 25MB and 2.5MB respectively. 2.5MB is better then 6MB.
It is still 4 times as big as my project, and I don't know if it could be applied for everyone for a submodule, without having to tell people to manually shallow clone.

@zvr
Copy link
Member

zvr commented Sep 23, 2021

Of course, to actual check a license identifier, one should also specify which version of the License List should be consulted, so you want to also have the information when each identifier was introduced, etc. etc.

@seabass-labrax
Copy link
Contributor

@hoijui, could you try running git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data?

@mlinksva
Copy link
Contributor

I think between the XML templates in license-list-XML and the license-list-data we should be able to cover both the use case for matching for license scanners and the use case as a repository of (quasi-)canonical license texts reuse.

This might be a tangential observation, but FWIW the files in license-list-XML and license-list-data (the text directory anyway) aren't 1:1 which is a minor annoyance (to get to a common set of files with the same names, munging is required), including at least:

  • XML has exceptions in a subdirectory, text doesn't, and not all exceptions have exception in the filename
  • text has a bunch of deprecated_ prefix filenames, some but not all of which map to + files in XML

@silverhook
Copy link
Collaborator

* XML has exceptions in a subdirectory, text doesn't, and not all exceptions have exception in the filename

* text has a bunch of deprecated_ prefix filenames, some but not all of which map to + files in XML

Those are really good points. If all files (if at all possible) had the SPDX ID as their basename, that would make things easier.

@hoijui
Copy link
Author

hoijui commented Sep 24, 2021

git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data

$ git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data
$ du -sh license-list-data/.git
160K	total
$ ls license-list-data/
exceptions.txt  licenses.txt
$ cat license-list-data/licenses.txt | head -n 5
0BSD
AAL
Abstyles
Adobe-2006
Adobe-Glyph

That is how I imagined it, thank you @seabass-labrax ! :-)

So it does work as a separate branch in the same repo when cloning it manually. Still, I would not know how to use that in practice as a sub-module, without manual extra steps for everyone working on the project.

@mxmehl
Copy link

mxmehl commented Sep 24, 2021

I was told that one should "never change a running system", and the current workflow of the REUSE helper tool works well. It's a very lightweight procedure for us devs to update the licenses/exceptions:

https://github.com/fsfe/reuse-tool/blob/b7b9c7cbf5d3f69c5e98031a57e0ce08c07dc592/Makefile#L124-L129

So unless there are really good reasons for the tool to switch this behaviour, I'd prefer to not touch it.

@goneall
Copy link
Member

goneall commented Sep 24, 2021

I was told that one should "never change a running system", and the current workflow of the REUSE helper tool works well.

Good point. Many of the data formats and file naming conventions used in the license-list-data repo were requests made by people writing tools that use the license data. Any change in these formats are likely to break at least one tool.

Even though I agree that the naming isn't consistent and there are probably better ways to organize the data, I would suggest we leave it as is unless there is a compelling reason to change it.

@goneall
Copy link
Member

goneall commented Sep 24, 2021

@hoijui I would suggest you only use released versions of the license-list-data. They are all tagged, so you can just find the most recent release tag to use.

If you are using the master branch, you will have the very latest processed version of the license-list-XML repo which is subject to change. For example, we recently decided to change the SPDX ID of a submitted license. If you happened to use master between these changes, you would end up with a renamed SPDX ID which could break some tooling.

Release versions are much more stable - we take great care not to make changes which could break tooling after a release This includes deprecating rather than deleting or renaming a license ID.

@hoijui
Copy link
Author

hoijui commented Sep 24, 2021

Thanks for that info... I did not think about that yet. :-)

I think the way that makes most sense, is to initially generate a commit for .. maybe the last few release versions, and then for each commit (that affects the files we end up generating for "my" minimal data repo), plus for each coming release. we could then have a develop branch, tracking upstrema master, and a master branch, as the repos main branch, pointing to the latest release. and of course, mirror all the release tags.

@seabass-labrax , did you create your simple-list branch manually, or do you have scripts/a CI job for it already?

@seabass-labrax
Copy link
Contributor

@seabass-labrax , did you create your simple-list branch manually, or do you have scripts/a CI job for it already?

I did it manually; although it is certainly possible to do it with a CI job. The only thing that needs to be done manually is creating the branch - it starts with an orphaned commit to keep the download size low.

In any case, would having pre-release license identifiers be useful? Having one commit per release would save some effort and complexity by not having to set multiple tags when releasing.

The most important question is probably whether to have a separate repository or a new branch within this repository. Both options have their relative benefits!

I'm keen to hear your views, @hoijui, @goneall :)

@hoijui
Copy link
Author

hoijui commented Oct 21, 2021

sorry for the late answer @seabass-labrax.
I would go for a separate repo, again, because git and all its workflows and sub-module stuff .. are just not made, to deal with parts of a repo. That is the reason from my side, and for the SPDX guys here, it seems to me, they would also not want it here, as it could complicate things for them and introduce problems .. I don't see how, but I see little benefit in this being in the same repo anyway. modularization/UNIX philosophy would suggest a different repo, and it might even make it more easily visible/findable to others.

@hoijui
Copy link
Author

hoijui commented Oct 21, 2021

working on it

@hoijui
Copy link
Author

hoijui commented Oct 21, 2021

@goneall I just noticed, when listing all tags of this repo, that all the tags (there are only version tags) have a 'v' in front, except one, which is just 3.3, though there is also a v3.3. Maybe you want to remove the 3.3 one? (it is 3 commits ahead of v3.3).

@goneall
Copy link
Member

goneall commented Oct 22, 2021

@hoijui Thanks for pointing this out.

@jlovejoy @swinslow It looks like the 3 commits are updates to the release notes. Any concern with me removing the 3.3 tag? We could move the v3.3 tag up to where 3.3 is currently, but I'm always cautious about rewriting history and changing tag locations as may cause issues with people's local copies.

For reference, here's a log of the commits between 3.3 and v3.3:

commit d0005747d99a593d6cbaf87d68644c0f76c4a5f3 (tag: 3.3)
Author: Jilayne Lovejoy <[email protected]>
Date:   Thu Nov 1 10:08:53 2018 -0600

    add link to comparison

    add link to comparison from last version to this version

commit 568d57c971c134b630a7cbce18ddc93db4529b86
Author: Jilayne Lovejoy <[email protected]>
Date:   Thu Nov 1 10:07:43 2018 -0600

    updated release notes

    regarding adding CC0-1.0 license to some files

commit 30e8dbf6efc23707b09030956080a2eef7c5c2b9
Author: Jilayne Lovejoy <[email protected]>
Date:   Wed Oct 31 16:50:35 2018 -0600

    create RELEASE-NOTES

    Creating release notes file

commit 63b75027f8618b6ce76b8b0ae911e38022b1f421 (tag: v3.3)
Author: Gary O'Neall <[email protected]>
Date:   Wed Oct 24 21:28:17 2018 -0700

@hoijui
Copy link
Author

hoijui commented Oct 22, 2021

An other issue I came across, this time a bit bigger: #1345

I have the bulk of the code ready now, but I would like to use the same license as the original repo, so I might wait till it has one, or we discuss here what my rep should use (maybe just CC0-1.0?).

@goneall
Copy link
Member

goneall commented Oct 22, 2021

I have the bulk of the code ready now, but I would like to use the same license as the original repo, so I might wait till it has one, or we discuss here what my rep should use (maybe just CC0-1.0?).

@hoijui This has been discussed by the legal team. There is debate on whether a top level license would apply to a list of license text - see #683 for one of the discussions.

@hoijui
Copy link
Author

hoijui commented Oct 23, 2021

Thank you!
I will use CC0-1.0 for my repo then, as this is the license for the content, if I read that issue correctly.
I will make the repo REUSE compliant -> the license for each file is clear.

@zvr
Copy link
Member

zvr commented Oct 23, 2021

@hoijui Please note that the CC0 is only for the files generated for this repository (e.g. Makefile, JSON data, etc.). It does not apply to the actual data, the license files themselves.

@hoijui
Copy link
Author

hoijui commented Oct 23, 2021

@hoijui Please note that the CC0 is only for the files generated for this repository (e.g. Makefile, JSON data, etc.). It does not apply to the actual data, the license files themselves.

this comment in #683 sais, that the data is CC0, and I see no objection to it after that.
.. what is the correct license then for it?

@seabass-labrax
Copy link
Contributor

We can't claim any copyright over the license texts themselves, as we didn't write them. However, we could maybe claim copyright over the metadata we create in this repository and which goes into the license-list-data repository. Therefore we disclaim any copyright we might hold with CC0. If the minimal repository you're creating just has a list of our identifiers, then you're not including the license texts themselves, so CC0 is fine :)

@hoijui
Copy link
Author

hoijui commented Oct 23, 2021

thank you @seabass-labrax !! :-)
That clears everything up fine!
And yes, I am only using the identifiers from you, like in your repo. :-)

@hoijui
Copy link
Author

hoijui commented Oct 24, 2021

It's done (let's say: beta):

The generated repo: https://github.com/hoijui/SPDX-identifiers
The repo that does the generating: https://github.com/hoijui/SPDX-identifiers-generator

It runs once each Wednesday morning, and if the original repo (this one) had a new release, creates a new release for the generated (small) repo, and pushes it.

NOTE: the generator repo sows as REUSE incompliant because of a scanning bug in the REUSE tool: fsfe/reuse-tool#429

@seabass-labrax
Copy link
Contributor

@hoijui, nice work! 😀 I'll make sure to mention your repository if there's a similar request in the future. Should this issue be closed, or are there further things to bring up on this subject?

@hoijui
Copy link
Author

hoijui commented Dec 9, 2021

:-)
Can be closed.. thanks for everything @seabass-labrax !
.. I ended up not using it myself, cause I found a library that already does what I wanted (and more),
and has the data integrated. ;-)

@hoijui hoijui closed this as completed Dec 9, 2021
@jlovejoy
Copy link
Member

jlovejoy commented Dec 9, 2021

what library was that (so we know as well?)

@hoijui
Copy link
Author

hoijui commented Dec 9, 2021

@seabass-labrax
Copy link
Contributor

ooh, nice link @hoijui! Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants