Skip to content

Commit 32a73ef

Browse files
Copy archive state from 3/3/2023
Signed-off-by: Łukasz Gryglicki <[email protected]>
0 parents  commit 32a73ef

File tree

1,490 files changed

+15554425
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,490 files changed

+15554425
-0
lines changed

.gitattributes

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
src/github_users.json filter=lfs diff=lfs merge=lfs -text
2+
src/stripped.json filter=lfs diff=lfs merge=lfs -text
3+
src/affiliated.json filter=lfs diff=lfs merge=lfs -text

.github/dependabot.yml

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# To get started with Dependabot version updates, you'll need to specify which
2+
# package ecosystems to update and where the package manifests are located.
3+
# Please see the documentation for all configuration options:
4+
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
5+
6+
version: 2
7+
updates:
8+
- package-ecosystem: "" # See documentation for possible values
9+
directory: "/" # Location of package manifests
10+
schedule:
11+
interval: "weekly"
12+
dependabot.yml

.gitignore

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
*.pyc
2+
*~
3+
.venv
4+
*.swp
5+
*.swo
6+
src/clearbit_tools/all_clearbit_queries.csv
7+
src/clearbit_tools/cncf_enriched.csv
8+
src/clearbit_tools/input_enriched.csv
9+
src/clearbit_tools/new_round_enriched.csv
10+
src/clearbit_tools/unknown_emails_enriched.csv
11+
all.log
12+
git.log
13+
all.log.xz
14+
#all.txt
15+
all.csv
16+
database.dump
17+
datelc.csv
18+
errors.txt
19+
header
20+
x
21+
src/ghusers/*
22+
# Data files
23+
# *.txt
24+
# *.csv
25+
*.out
26+
# *.json
27+
*.log
28+
*.old
29+
*.dat
30+
*.db
31+
*.dump
32+
err
33+
out
34+
out1
35+
out2
36+
out.diff
37+
geodata.tsv
38+
geodata.tsv.xz
39+
partial.json
40+
backup.json
41+
affiliated.json
42+
stripped.json
43+
*.htm*
44+
src/check_spell
45+
src/mtp
46+
src/check_shas
47+
src/map_orgs
48+
src/get_aff_files
49+
src/git_logs/*.log
50+
src/git_logs/*.1
51+
src/git_logs/*.2
52+
src/flist.txt
53+
/src/git.log_*
54+
*.secret
55+
allCountries.zip
56+
allCountries.txt
57+
allCountries.tsv

ADD_PROJECT.md

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Add a non-cncf project/org ( project must be opensource ) to generate affiliations for it.
2+
1. Add the developers of your organization/project to be get affiliated in `./developers_affiliations.txt` in the proper format. `cd src/`. Now generate new email-map using `./import_affs.sh`, then: `mv email-map cncf-config/email-map`.
3+
For e.g.
4+
```
5+
developer1: email1@xyz, email2@abc, ...
6+
company1
7+
company2 until YYYY-MM-DD
8+
developer2: email3@xyz, email4@pqr, ...
9+
company3
10+
company4 until YYYY-MM-DD
11+
```
12+
2. Clone all repositories of the project at `~/dev/project_name/`. For cloning either you can use `cncf/velocity` project and writing sql query in BigQuery folder or you can create a new shellscript file in `~/dev/cncf/gitdm/` location with name `clone_project_name.sh`.
13+
And just copy paste this code in that file
14+
```
15+
#!/bin/bash
16+
mkdir ~/dev/project_name/ 2>/dev/null
17+
cd ~/dev/project_name || exit 1
18+
git clone github_repo_clone_url_for_your_project1 || exit 1
19+
git clone github_repo_clone_url_for_your_project2 || exit 1
20+
...
21+
echo "All project_name repos cloned"
22+
```
23+
Paste all repository's clone_url manually.
24+
Save file and run this script `chmod +x ./clone_project_name.sh`.
25+
and then run this script - `./clone_project_name.sh` . This will clone all repos at the place `~/dev/project_name/`.
26+
27+
**Notes** : replace project_name with your github organization name.
28+
29+
3. To generate `git.log` file, use this command `./all_repos_log.sh ~/dev/project_name/*`. Make it `uniq`.
30+
31+
4. To run `cncf/gitdm` on a generated `git.log` file do: `~/dev/cncf/gitdm/cncfdm.py -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./src/ -t -z -d -D -U -u -o all.txt -x all.csv -a all_affs.csv > all.out`
32+
33+
5. To generate human readable text affiliation files: `SKIP_COMPANIES="(Unknown)" ./gen_aff_files.sh`
34+
35+
6. If updating via `ghusers.sh` or `ghusers_cached.sh` (step 6), please update `repos` array in `./ghusers.rb` with your org/project repos lists, then run `generate_actors.sh` too. But before it, make sure that you had set devstats and update `./generate_actors.sh` after first line with `sudo -u postgres psql -tA your_pg_database_name < ~/dev/go/src/devstats/util_sql/all_actors.sql > actors.txt`. now run `./generate_actors.sh`.
36+
37+
7. Consider `./ghusers_cached.sh` or `./ghusers.sh` (if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run `./ghusers.sh` without cache.
38+
39+
8. `ghusers_partially_cached.sh` will refetch repos metadata and commits and get users data from `github_users.json` so you can save a lot of API points.
40+
41+
9. To update (enchance) `github_users.json` with new affiliations `./enchance_json.sh`.
42+
43+
10. To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run: `./merge_github_logins.sh`.
44+
11. Because this can find new affiliations you can now use `./import_from_github_users.sh` to import back from `github_users.json` and then restart from step 3.
45+
46+
12. Run `./correlation.sh` and examine its output `correlations.txt` to try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences.
47+
48+
13. Run `./lookup_json.sh` and examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work.
49+
50+
14. ALWAYS before any commit to GitHub run: `./handle_forbidden_data.sh` to remove any forbiden affiliations, please also see `FORBIDDEN_DATA.md`.
51+
52+
15. You can use `./clear_affiliations_in_json.sh` to clear all affiliations on a generated `github_users.json`.
53+
54+
16. You can create smaller final json for `cncf/devstats` using `./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/devstats/github_users.json`.
55+

FORBIDDEN_DATA.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# How to remove affiliations data
2+
3+
If you do not want your personal data like names and/or emails to be listed you can do the following.
4+
5+
- Clone cncf/gitdm locally
6+
- `cd src/`
7+
- Run `./add_forbidden_data.rb 'youremail!domain.com'` or `./add_forbidden_data.rb 'YourName' '[email protected]' 'your!email.com'.
8+
- Phrase to be removed should not contain: `,`, `;`, `'`, `"`, `/`, `\` characters.
9+
- Program will generate SHA256 hashes of data provided from command line arguments and add them to `cncf-config/forbidden.csv` file.
10+
- Create PR with updated `cncf-config/forbidden.csv` file. That way your sensitive data won't be visible in a PR.
11+
- We will run `./handle_forbidden_data.sh` on your PR that will generate report with files containing that information.
12+
- We will remove requested informations and merge your PR.

0 commit comments

Comments
 (0)