-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a minimal repo #1335
Comments
I really like this idea, @hoijui - I'm always a fan of reproducible builds, and you've got a great point about using a Git submodule for this :) Any change that also improves REUSE is bound to be a good thing! @silverhook, are you imagining that this could replace the hard-coded As an alternative to a separate repository, we could also create a new branch containing only this minimal information in the license-list-data repository, to be used with a command like For some background, currently, a GitHub Actions workflow on this repository calls the Makefile, which in turn runs the LicenseListPublisher tool and deploys the output to the license-list-data repository. This happens on every push to this repository, so @swinslow manually tags our releases on both repositories. |
I’m not the right person to answer that. I would defer this to @mxmehl and @carmenbianca. Just a thought because such a thing was discussed before. |
The license-list-data repository has several folders representing different data formats used by different tools. There is currently a text folder which is similar to this request. The files are named using the exception and license ID's with a ".txt" extension. Can this folder be used to satisfy the request? I realize is it not in its own repo, but you could reference just the files you need. |
AFAIK, that is currently the place where the Personally, I would prefer that continuing to be the case, but once or twice it was brought up (esp. in the LGPL-3.0 + GPL-3.0 text discussion, which let’s not discuss in this issue) whether that folder is intended to be re-used for simple matching or as actual license texts to save into your project, or both (or none?). |
Completely agree on not discussing - we've discussed that one enough - but I can understand why the issue was raised
That's a good question for the legal team. I can provide information on where the text originates, but I'll leave the purpose and use to the larger legal community. The text found in the above mentioned folder is an exact copy of the license test text associated with the license ID or a text rendered version of the license XML license text if no license test text is found. I'm not aware of any licenses where the license test text is missing, so almost all of the text is copied from there. For new licenses, the submitter provides the license test text. It is typically (but not guaranteed to be) copied from the canonical license text which has the advantage of retaining all the original formatting which is lost in the XML rendering of text. When we converted over to the XML version of the license text, the license test text was copied over from the text files used to generate the spdx license list website. For many of these text files, we changed the formatting so it would render nicely on the web page but only whitespace characters were changed. |
Unfortunately I have a hard time joining SPDX in the past years, since they clash with some internal (MMO-)calls at work :/ I think between the XML templates in But yes, ultimately that is a question that needs to be addressed, and SPDX Legal team is probably the best start. |
the reason I do not want to use the https://github.com/spdx/license-list-data repo, is this: git clone [email protected]:spdx/license-list-data.git
du -sh license-list-data/.git
234M total Though the source repo is already much better in this regard: git clone [email protected]:spdx/license-list-XML.git
du -sh license-list-XML/.git
6.3M total The repo I am proposing would probably be in the ballpark of (wild guess) 1KB - 10KB. If my tiny CLI project depends on a 234MB sub-module just for a list of a few hundred short strings, that would feel wrong to me. I might be able to live with the 6MB, even though it is still ~10 times the size of my project, and the license list is just one of such submodule dependencies. @silverhook With an other branch in the same repo, would it not still download the whole history when recursively checking out the repo? If you think the use-case for this is too small, I am totally fine with that. I only know of mine and one other use-case, personally. |
@hoijui Why don’t you just shallow clone? https://stackoverflow.com/a/1210012 Again, regarding how the |
That changes the size to 25MB and 2.5MB respectively. 2.5MB is better then 6MB. |
Of course, to actual check a license identifier, one should also specify which version of the License List should be consulted, so you want to also have the information when each identifier was introduced, etc. etc. |
@hoijui, could you try running |
This might be a tangential observation, but FWIW the files in license-list-XML and license-list-data (the text directory anyway) aren't 1:1 which is a minor annoyance (to get to a common set of files with the same names, munging is required), including at least:
|
Those are really good points. If all files (if at all possible) had the SPDX ID as their basename, that would make things easier. |
$ git clone --depth=1 -b simple-list https://github.com/seabass-labrax/license-list-data
$ du -sh license-list-data/.git
160K total
$ ls license-list-data/
exceptions.txt licenses.txt
$ cat license-list-data/licenses.txt | head -n 5
0BSD
AAL
Abstyles
Adobe-2006
Adobe-Glyph That is how I imagined it, thank you @seabass-labrax ! :-) So it does work as a separate branch in the same repo when cloning it manually. Still, I would not know how to use that in practice as a sub-module, without manual extra steps for everyone working on the project. |
I was told that one should "never change a running system", and the current workflow of the REUSE helper tool works well. It's a very lightweight procedure for us devs to update the licenses/exceptions: https://github.com/fsfe/reuse-tool/blob/b7b9c7cbf5d3f69c5e98031a57e0ce08c07dc592/Makefile#L124-L129 So unless there are really good reasons for the tool to switch this behaviour, I'd prefer to not touch it. |
Good point. Many of the data formats and file naming conventions used in the license-list-data repo were requests made by people writing tools that use the license data. Any change in these formats are likely to break at least one tool. Even though I agree that the naming isn't consistent and there are probably better ways to organize the data, I would suggest we leave it as is unless there is a compelling reason to change it. |
@hoijui I would suggest you only use released versions of the license-list-data. They are all tagged, so you can just find the most recent release tag to use. If you are using the master branch, you will have the very latest processed version of the license-list-XML repo which is subject to change. For example, we recently decided to change the SPDX ID of a submitted license. If you happened to use master between these changes, you would end up with a renamed SPDX ID which could break some tooling. Release versions are much more stable - we take great care not to make changes which could break tooling after a release This includes deprecating rather than deleting or renaming a license ID. |
Thanks for that info... I did not think about that yet. :-) I think the way that makes most sense, is to initially generate a commit for .. maybe the last few release versions, and then for each commit (that affects the files we end up generating for "my" minimal data repo), plus for each coming release. we could then have a develop branch, tracking upstrema master, and a master branch, as the repos main branch, pointing to the latest release. and of course, mirror all the release tags. @seabass-labrax , did you create your |
I did it manually; although it is certainly possible to do it with a CI job. The only thing that needs to be done manually is creating the branch - it starts with an orphaned commit to keep the download size low. In any case, would having pre-release license identifiers be useful? Having one commit per release would save some effort and complexity by not having to set multiple tags when releasing. The most important question is probably whether to have a separate repository or a new branch within this repository. Both options have their relative benefits! |
sorry for the late answer @seabass-labrax. |
working on it |
@goneall I just noticed, when listing all tags of this repo, that all the tags (there are only version tags) have a 'v' in front, except one, which is just |
@hoijui Thanks for pointing this out. @jlovejoy @swinslow It looks like the 3 commits are updates to the release notes. Any concern with me removing the 3.3 tag? We could move the v3.3 tag up to where 3.3 is currently, but I'm always cautious about rewriting history and changing tag locations as may cause issues with people's local copies. For reference, here's a log of the commits between 3.3 and v3.3:
|
An other issue I came across, this time a bit bigger: #1345 I have the bulk of the code ready now, but I would like to use the same license as the original repo, so I might wait till it has one, or we discuss here what my rep should use (maybe just |
@hoijui This has been discussed by the legal team. There is debate on whether a top level license would apply to a list of license text - see #683 for one of the discussions. |
Thank you! |
@hoijui Please note that the CC0 is only for the files generated for this repository (e.g. Makefile, JSON data, etc.). It does not apply to the actual data, the license files themselves. |
this comment in #683 sais, that the data is CC0, and I see no objection to it after that. |
We can't claim any copyright over the license texts themselves, as we didn't write them. However, we could maybe claim copyright over the metadata we create in this repository and which goes into the license-list-data repository. Therefore we disclaim any copyright we might hold with CC0. If the minimal repository you're creating just has a list of our identifiers, then you're not including the license texts themselves, so CC0 is fine :) |
thank you @seabass-labrax !! :-) |
It's done (let's say: beta): The generated repo: https://github.com/hoijui/SPDX-identifiers It runs once each Wednesday morning, and if the original repo (this one) had a new release, creates a new release for the generated (small) repo, and pushes it. NOTE: the generator repo sows as REUSE incompliant because of a scanning bug in the REUSE tool: fsfe/reuse-tool#429 |
@hoijui, nice work! 😀 I'll make sure to mention your repository if there's a similar request in the future. Should this issue be closed, or are there further things to bring up on this subject? |
:-) |
what library was that (so we know as well?) |
ooh, nice link @hoijui! Thanks :) |
... meaning: A repo that contains just a text file with the license IDs (one per line) - and maybe one like that with the exceptions.
The reason to have such a repo, is so it can easily be used as a sub-module in other projects, which need to perform a basic validity check on license identifiers. For example, I am writing a small command-line tool that checks certain meta-data/build files, if they use a correct SPDX license identifier. This tool also performs many other checks, and also needs data to verify them, which all comes from other places. The way it works: during compile-time, a local text-file is read, and converted into an internal, constant array of strings, which is then used for checking at run-time. If the file is available in a git sub-module, one gets reproducible builds, and one does not have to hack something together to download the file at compile-time, nor run extra scripts before compiling that would download the file.
This repo can be used for that, or the one already generated, but they are kind of overkill for that purpose, and it if I'd do a similar thing for all the data I need for all the checks, a full recursive clone of the repo would become big, and might look confusing to the human eye.
If you do not consider this a meaningful/worthwhile thing to have, I would try to generate it myself, using a scheduled github action (running once a day or once a week or so).
If you do consider it worthwhile, I could try to make a pull request too, if you prefer that.
The text was updated successfully, but these errors were encountered: