Generate a minimal repo #1335

hoijui · 2021-09-23T06:12:31Z

... meaning: A repo that contains just a text file with the license IDs (one per line) - and maybe one like that with the exceptions.

The reason to have such a repo, is so it can easily be used as a sub-module in other projects, which need to perform a basic validity check on license identifiers. For example, I am writing a small command-line tool that checks certain meta-data/build files, if they use a correct SPDX license identifier. This tool also performs many other checks, and also needs data to verify them, which all comes from other places. The way it works: during compile-time, a local text-file is read, and converted into an internal, constant array of strings, which is then used for checking at run-time. If the file is available in a git sub-module, one gets reproducible builds, and one does not have to hack something together to download the file at compile-time, nor run extra scripts before compiling that would download the file.
This repo can be used for that, or the one already generated, but they are kind of overkill for that purpose, and it if I'd do a similar thing for all the data I need for all the checks, a full recursive clone of the repo would become big, and might look confusing to the human eye.

If you do not consider this a meaningful/worthwhile thing to have, I would try to generate it myself, using a scheduled github action (running once a day or once a week or so).
If you do consider it worthwhile, I could try to make a pull request too, if you prefer that.

silverhook · 2021-09-23T08:34:06Z

Something like this would be useful also for REUSE – whether within SPDX or to collaborate on (CC @mxmehl)

seabass-labrax · 2021-09-23T14:23:01Z

I really like this idea, @hoijui - I'm always a fan of reproducible builds, and you've got a great point about using a Git submodule for this :) Any change that also improves REUSE is bound to be a good thing! @silverhook, are you imagining that this could replace the hard-coded licenses.json file in the REUSE Python tool?

As an alternative to a separate repository, we could also create a new branch containing only this minimal information in the license-list-data repository, to be used with a command like git submodule add -b plain-list https://github.com/spdx/license-list-data.

For some background, currently, a GitHub Actions workflow on this repository calls the Makefile, which in turn runs the LicenseListPublisher tool and deploys the output to the license-list-data repository. This happens on every push to this repository, so @swinslow manually tags our releases on both repositories.

silverhook · 2021-09-23T15:38:04Z

@silverhook, are you imagining that this could replace the hard-coded licenses.json file in the REUSE Python tool?

I’m not the right person to answer that. I would defer this to @mxmehl and @carmenbianca. Just a thought because such a thing was discussed before.

goneall · 2021-09-23T15:45:48Z

As an alternative to a separate repository, we could also create a new branch containing only this minimal information in the license-list-data repository

The license-list-data repository has several folders representing different data formats used by different tools.

There is currently a text folder which is similar to this request. The files are named using the exception and license ID's with a ".txt" extension.

Can this folder be used to satisfy the request? I realize is it not in its own repo, but you could reference just the files you need.

silverhook · 2021-09-23T16:37:13Z

There is currently a text folder which is similar to this request. The files are named using the exception and license ID's with a ".txt" extension.

AFAIK, that is currently the place where the reuse tool pulls the license texts from to put in the LICENSES/ folder, yes.

Personally, I would prefer that continuing to be the case, but once or twice it was brought up (esp. in the LGPL-3.0 + GPL-3.0 text discussion, which let’s not discuss in this issue) whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?).

goneall · 2021-09-23T16:50:33Z

once or twice it was brought up (esp. in the LGPL-3.0 + GPL-3.0 text discussion, which let’s not discuss in this issue)

Completely agree on not discussing - we've discussed that one enough - but I can understand why the issue was raised

whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?)

That's a good question for the legal team. I can provide information on where the text originates, but I'll leave the purpose and use to the larger legal community.

The text found in the above mentioned folder is an exact copy of the license test text associated with the license ID or a text rendered version of the license XML license text if no license test text is found. I'm not aware of any licenses where the license test text is missing, so almost all of the text is copied from there.

For new licenses, the submitter provides the license test text. It is typically (but not guaranteed to be) copied from the canonical license text which has the advantage of retaining all the original formatting which is lost in the XML rendering of text. When we converted over to the XML version of the license text, the license test text was copied over from the text files used to generate the spdx license list website. For many of these text files, we changed the formatting so it would render nicely on the web page but only whitespace characters were changed.

silverhook · 2021-09-23T17:08:24Z

whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?)

That's a good question for the legal team. I can provide information on where the text originates, but I'll leave the purpose and use to the larger legal community.

Unfortunately I have a hard time joining SPDX in the past years, since they clash with some internal (MMO-)calls at work :/

I think between the XML templates in license-list-XML and the license-list-data we should be able to cover both the use case for matching for license scanners and the use case as a repository of (quasi-)canonical license texts reuse.

But yes, ultimately that is a question that needs to be addressed, and SPDX Legal team is probably the best start.

hoijui · 2021-09-23T17:25:17Z

the reason I do not want to use the https://github.com/spdx/license-list-data repo, is this:

git clone [email protected]:spdx/license-list-data.git
du -sh license-list-data/.git
234M	total

Though the source repo is already much better in this regard:

git clone [email protected]:spdx/license-list-XML.git
du -sh license-list-XML/.git
6.3M	total

The repo I am proposing would probably be in the ballpark of (wild guess) 1KB - 10KB.

If my tiny CLI project depends on a 234MB sub-module just for a list of a few hundred short strings, that would feel wrong to me. I might be able to live with the 6MB, even though it is still ~10 times the size of my project, and the license list is just one of such submodule dependencies.

@silverhook With an other branch in the same repo, would it not still download the whole history when recursively checking out the repo?

If you think the use-case for this is too small, I am totally fine with that. I only know of mine and one other use-case, personally.

silverhook · 2021-09-23T17:43:24Z

@hoijui Why don’t you just shallow clone? https://stackoverflow.com/a/1210012

Again, regarding how the reuse tool itself works and why, I’m not the right person to ask – @carmenbianca and @mxmehl would be the better people for that.

hoijui · 2021-09-23T18:02:07Z

That changes the size to 25MB and 2.5MB respectively. 2.5MB is better then 6MB.
It is still 4 times as big as my project, and I don't know if it could be applied for everyone for a submodule, without having to tell people to manually shallow clone.

zvr · 2021-09-23T19:11:03Z

Of course, to actual check a license identifier, one should also specify which version of the License List should be consulted, so you want to also have the information when each identifier was introduced, etc. etc.

seabass-labrax · 2021-09-23T19:39:48Z

@hoijui, could you try running git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data?

mlinksva · 2021-09-23T21:14:46Z

I think between the XML templates in license-list-XML and the license-list-data we should be able to cover both the use case for matching for license scanners and the use case as a repository of (quasi-)canonical license texts reuse.

This might be a tangential observation, but FWIW the files in license-list-XML and license-list-data (the text directory anyway) aren't 1:1 which is a minor annoyance (to get to a common set of files with the same names, munging is required), including at least:

XML has exceptions in a subdirectory, text doesn't, and not all exceptions have exception in the filename
text has a bunch of deprecated_ prefix filenames, some but not all of which map to + files in XML

silverhook · 2021-09-24T05:27:48Z

* XML has exceptions in a subdirectory, text doesn't, and not all exceptions have exception in the filename

* text has a bunch of deprecated_ prefix filenames, some but not all of which map to + files in XML

Those are really good points. If all files (if at all possible) had the SPDX ID as their basename, that would make things easier.

hoijui · 2021-09-24T06:44:09Z

git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data

$ git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data
$ du -sh license-list-data/.git
160K	total
$ ls license-list-data/
exceptions.txt  licenses.txt
$ cat license-list-data/licenses.txt | head -n 5
0BSD
AAL
Abstyles
Adobe-2006
Adobe-Glyph

That is how I imagined it, thank you @seabass-labrax ! :-)

So it does work as a separate branch in the same repo when cloning it manually. Still, I would not know how to use that in practice as a sub-module, without manual extra steps for everyone working on the project.

mxmehl · 2021-09-24T10:32:28Z

I was told that one should "never change a running system", and the current workflow of the REUSE helper tool works well. It's a very lightweight procedure for us devs to update the licenses/exceptions:

https://github.com/fsfe/reuse-tool/blob/b7b9c7cbf5d3f69c5e98031a57e0ce08c07dc592/Makefile#L124-L129

So unless there are really good reasons for the tool to switch this behaviour, I'd prefer to not touch it.

goneall · 2021-09-24T19:10:19Z

I was told that one should "never change a running system", and the current workflow of the REUSE helper tool works well.

Good point. Many of the data formats and file naming conventions used in the license-list-data repo were requests made by people writing tools that use the license data. Any change in these formats are likely to break at least one tool.

Even though I agree that the naming isn't consistent and there are probably better ways to organize the data, I would suggest we leave it as is unless there is a compelling reason to change it.

goneall · 2021-09-24T19:16:19Z

@hoijui I would suggest you only use released versions of the license-list-data. They are all tagged, so you can just find the most recent release tag to use.

If you are using the master branch, you will have the very latest processed version of the license-list-XML repo which is subject to change. For example, we recently decided to change the SPDX ID of a submitted license. If you happened to use master between these changes, you would end up with a renamed SPDX ID which could break some tooling.

Release versions are much more stable - we take great care not to make changes which could break tooling after a release This includes deprecating rather than deleting or renaming a license ID.

hoijui · 2021-09-24T20:40:36Z

Thanks for that info... I did not think about that yet. :-)

I think the way that makes most sense, is to initially generate a commit for .. maybe the last few release versions, and then for each commit (that affects the files we end up generating for "my" minimal data repo), plus for each coming release. we could then have a develop branch, tracking upstrema master, and a master branch, as the repos main branch, pointing to the latest release. and of course, mirror all the release tags.

@seabass-labrax , did you create your simple-list branch manually, or do you have scripts/a CI job for it already?

seabass-labrax · 2021-10-05T14:05:36Z

@seabass-labrax , did you create your simple-list branch manually, or do you have scripts/a CI job for it already?

I did it manually; although it is certainly possible to do it with a CI job. The only thing that needs to be done manually is creating the branch - it starts with an orphaned commit to keep the download size low.

In any case, would having pre-release license identifiers be useful? Having one commit per release would save some effort and complexity by not having to set multiple tags when releasing.

The most important question is probably whether to have a separate repository or a new branch within this repository. Both options have their relative benefits!

I'm keen to hear your views, @hoijui, @goneall :)

hoijui · 2021-10-21T21:16:11Z

sorry for the late answer @seabass-labrax.
I would go for a separate repo, again, because git and all its workflows and sub-module stuff .. are just not made, to deal with parts of a repo. That is the reason from my side, and for the SPDX guys here, it seems to me, they would also not want it here, as it could complicate things for them and introduce problems .. I don't see how, but I see little benefit in this being in the same repo anyway. modularization/UNIX philosophy would suggest a different repo, and it might even make it more easily visible/findable to others.

hoijui · 2021-10-21T21:55:11Z

working on it

hoijui · 2021-10-21T22:12:18Z

@goneall I just noticed, when listing all tags of this repo, that all the tags (there are only version tags) have a 'v' in front, except one, which is just 3.3, though there is also a v3.3. Maybe you want to remove the 3.3 one? (it is 3 commits ahead of v3.3).

goneall · 2021-10-22T00:05:40Z

@hoijui Thanks for pointing this out.

@jlovejoy @swinslow It looks like the 3 commits are updates to the release notes. Any concern with me removing the 3.3 tag? We could move the v3.3 tag up to where 3.3 is currently, but I'm always cautious about rewriting history and changing tag locations as may cause issues with people's local copies.

For reference, here's a log of the commits between 3.3 and v3.3:

commit d0005747d99a593d6cbaf87d68644c0f76c4a5f3 (tag: 3.3)
Author: Jilayne Lovejoy <[email protected]>
Date:   Thu Nov 1 10:08:53 2018 -0600

    add link to comparison

    add link to comparison from last version to this version

commit 568d57c971c134b630a7cbce18ddc93db4529b86
Author: Jilayne Lovejoy <[email protected]>
Date:   Thu Nov 1 10:07:43 2018 -0600

    updated release notes

    regarding adding CC0-1.0 license to some files

commit 30e8dbf6efc23707b09030956080a2eef7c5c2b9
Author: Jilayne Lovejoy <[email protected]>
Date:   Wed Oct 31 16:50:35 2018 -0600

    create RELEASE-NOTES

    Creating release notes file

commit 63b75027f8618b6ce76b8b0ae911e38022b1f421 (tag: v3.3)
Author: Gary O'Neall <[email protected]>
Date:   Wed Oct 24 21:28:17 2018 -0700

hoijui · 2021-10-22T08:32:45Z

An other issue I came across, this time a bit bigger: #1345

I have the bulk of the code ready now, but I would like to use the same license as the original repo, so I might wait till it has one, or we discuss here what my rep should use (maybe just CC0-1.0?).

goneall · 2021-10-22T21:37:13Z

I have the bulk of the code ready now, but I would like to use the same license as the original repo, so I might wait till it has one, or we discuss here what my rep should use (maybe just CC0-1.0?).

@hoijui This has been discussed by the legal team. There is debate on whether a top level license would apply to a list of license text - see #683 for one of the discussions.

hoijui · 2021-10-23T05:47:52Z

Thank you!
I will use CC0-1.0 for my repo then, as this is the license for the content, if I read that issue correctly.
I will make the repo REUSE compliant -> the license for each file is clear.

zvr · 2021-10-23T08:48:56Z

@hoijui Please note that the CC0 is only for the files generated for this repository (e.g. Makefile, JSON data, etc.). It does not apply to the actual data, the license files themselves.

hoijui · 2021-10-23T09:29:09Z

@hoijui Please note that the CC0 is only for the files generated for this repository (e.g. Makefile, JSON data, etc.). It does not apply to the actual data, the license files themselves.

this comment in #683 sais, that the data is CC0, and I see no objection to it after that.
.. what is the correct license then for it?

seabass-labrax · 2021-10-23T12:59:15Z

We can't claim any copyright over the license texts themselves, as we didn't write them. However, we could maybe claim copyright over the metadata we create in this repository and which goes into the license-list-data repository. Therefore we disclaim any copyright we might hold with CC0. If the minimal repository you're creating just has a list of our identifiers, then you're not including the license texts themselves, so CC0 is fine :)

hoijui · 2021-10-23T19:23:05Z

thank you @seabass-labrax !! :-)
That clears everything up fine!
And yes, I am only using the identifiers from you, like in your repo. :-)

hoijui · 2021-10-24T10:19:47Z

It's done (let's say: beta):

The generated repo: https://github.com/hoijui/SPDX-identifiers
The repo that does the generating: https://github.com/hoijui/SPDX-identifiers-generator

It runs once each Wednesday morning, and if the original repo (this one) had a new release, creates a new release for the generated (small) repo, and pushes it.

NOTE: the generator repo sows as REUSE incompliant because of a scanning bug in the REUSE tool: fsfe/reuse-tool#429

seabass-labrax · 2021-12-09T16:11:30Z

@hoijui, nice work! 😀 I'll make sure to mention your repository if there's a similar request in the future. Should this issue be closed, or are there further things to bring up on this subject?

hoijui · 2021-12-09T16:56:55Z

:-)
Can be closed.. thanks for everything @seabass-labrax !
.. I ended up not using it myself, cause I found a library that already does what I wanted (and more),
and has the data integrated. ;-)

jlovejoy · 2021-12-09T16:58:20Z

what library was that (so we know as well?)

hoijui · 2021-12-09T17:00:53Z

https://github.com/EmbarkStudios/spdx

seabass-labrax · 2021-12-09T17:24:00Z

ooh, nice link @hoijui! Thanks :)

jlovejoy added the technical issue label Oct 14, 2021

hoijui closed this as completed Dec 9, 2021

Generate a minimal repo #1335

Generate a minimal repo #1335

Comments

hoijui commented Sep 23, 2021

silverhook commented Sep 23, 2021

seabass-labrax commented Sep 23, 2021

silverhook commented Sep 23, 2021

goneall commented Sep 23, 2021

silverhook commented Sep 23, 2021

goneall commented Sep 23, 2021

silverhook commented Sep 23, 2021 • edited Loading

hoijui commented Sep 23, 2021 • edited Loading

silverhook commented Sep 23, 2021

hoijui commented Sep 23, 2021

zvr commented Sep 23, 2021

seabass-labrax commented Sep 23, 2021

mlinksva commented Sep 23, 2021

silverhook commented Sep 24, 2021

hoijui commented Sep 24, 2021

mxmehl commented Sep 24, 2021 • edited Loading

goneall commented Sep 24, 2021

goneall commented Sep 24, 2021

hoijui commented Sep 24, 2021

seabass-labrax commented Oct 5, 2021

hoijui commented Oct 21, 2021

hoijui commented Oct 21, 2021

hoijui commented Oct 21, 2021 • edited Loading

goneall commented Oct 22, 2021

hoijui commented Oct 22, 2021 • edited Loading

goneall commented Oct 22, 2021

hoijui commented Oct 23, 2021 • edited Loading

zvr commented Oct 23, 2021

hoijui commented Oct 23, 2021 • edited Loading

seabass-labrax commented Oct 23, 2021

hoijui commented Oct 23, 2021

hoijui commented Oct 24, 2021 • edited Loading

seabass-labrax commented Dec 9, 2021

hoijui commented Dec 9, 2021

jlovejoy commented Dec 9, 2021

hoijui commented Dec 9, 2021

seabass-labrax commented Dec 9, 2021

silverhook commented Sep 23, 2021 •

edited

Loading

hoijui commented Sep 23, 2021 •

edited

Loading

mxmehl commented Sep 24, 2021 •

edited

Loading

hoijui commented Oct 21, 2021 •

edited

Loading

hoijui commented Oct 22, 2021 •

edited

Loading

hoijui commented Oct 23, 2021 •

edited

Loading

hoijui commented Oct 23, 2021 •

edited

Loading

hoijui commented Oct 24, 2021 •

edited

Loading