Research on mining Data Science repositories.
Figshare: Extract contents of results.tar.gz to output directory, then jump to Analyse results (in Jupyter) section.
From scratch: Clone this repository then follow steps below to identify, clone, and analyse the repositories.
If docker is not present
Link to install (docker install)
We have four directories: data, input_drive, input, and output:
-
The
datafolder holds project metadata fetched from GitHub (97 MB, committed to this Git repo for convenience) -
The
input_drivefolder is for the cloned repositories (4.6 TB in total, so we suggest using a network storage drive) -
The
symlink_inputtask will create symlinks within theinputfolder to theinput_drive. -
The
outputdirectory holds metrics and the final analysis results (2 GB when compressed, shared on Figshare).
Go to https://github.com/settings/tokens/new to generate a new token with the perimissions public_repo and read:packages, and update mining_nlp_repositories/github.py with your ACCESS_TOKEN.
-
To list all tasks
surround run list -
Build the docker image
surround run build -
Fetch project meta-data from GitHub (requires GitHub Access Token)
surround run fetch_data_science_projects surround run fetch_non_data_science_projects -
Clone projects from GitHub
surround run clone_data_science_projects surround run clone_non_data_science_projectsMove
data/boa/cloned-repostoinput_drive/cloned-repos/boaMove
data/non-data-science/cloned-repostoinput_drive/cloned-repos/non-data-scienceManually create
input_drive/cloned-repos/boa-zip-downloadand extract any unclonable DS repos hereManually create
input_drive/cloned-repos/non-data-science-zip-downloadand extract any unclonable non-DS repos here -
Specify list of repositories to extract metrics for
Run
notebooks/create-lists-to-extract.ipynbnotebook and move results fromdata/selectedtoinput_drive/selectedManually modify lists as needed. E.g.
repo_ids_ds_chunk_000801-001552_filt.csvexcludes repo858127as it contains a file that causes Pylint to hang indefinitely. -
Populate input directory with symlinks (requires repos in
input_drivedirectory)surround run symlink_input -
Extract metrics (requires
inputdirectory to be populated)surround run analyse_imports surround run analyse_2to3 surround run analyse_pylint surround run analyse_radon_cc surround run analyse_loc surround run analyse_gitEach of the analyse tasks support an optional argument to limit the list of repositories analysed, e.g.
surround run analyse_pylint input/repos-ids.csv(useful for splitting up large jobs). If not provided, all repos will be analysed.The exact commands used are listed below. Due to a limitation of Surround (Issue #230) it was necessary to call
doitdirectly in order to run multiple Surround tasks simultaneously:mkdir -p output/ds-t1; nohup time doit --backend sqlite3 analyse_2to3 --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t1" > output/ds-t1/nohup.out & mkdir -p output/ds-t2; nohup time doit --backend sqlite3 analyse_imports --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t2" > output/ds-t2/nohup.out & mkdir -p output/ds-t3; nohup time doit --backend sqlite3 analyse_radon_cc --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t3" > output/ds-t3/nohup.out & # Skipped: Takes 302 hours: # mkdir -p output/ds-t4; nohup time doit --backend sqlite3 analyse_radon_raw --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t4" > output/ds-t4/nohup.out & mkdir -p output/ds-t5; nohup time doit --backend sqlite3 analyse_version --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t5" > output/ds-t5/nohup.out & mkdir -p output/ds-t6; nohup time doit --backend sqlite3 analyse_loc --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t6" > output/ds-t6/nohup.out & mkdir -p output/ds-t7; nohup time doit --backend sqlite3 analyse_git --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t7" > output/ds-t7/nohup.out & mkdir -p output/nonds-t1; nohup time doit --backend sqlite3 analyse_2to3 --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t1" > output/nonds-t1/nohup.out & mkdir -p output/nonds-t2; nohup time doit --backend sqlite3 analyse_imports --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t2" > output/nonds-t2/nohup.out & mkdir -p output/nonds-t3; nohup time doit --backend sqlite3 analyse_radon_cc --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t3" > output/nonds-t3/nohup.out & # Skipped: Hangs indefinitely on repo 67065438: # mkdir -p output/nonds-t4; nohup time doit --backend sqlite3 analyse_radon_raw --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t4" > output/nonds-t4/nohup.out & mkdir -p output/nonds-t5; nohup time doit --backend sqlite3 analyse_version --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t5" > output/nonds-t5/nohup.out & mkdir -p output/nonds-t6; nohup time doit --backend sqlite3 analyse_loc --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t6" > output/nonds-t6/nohup.out & mkdir -p output/nonds-t7; nohup time doit --backend sqlite3 analyse_git --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t7" > output/nonds-t7/nohup.out & mkdir -p output/ds-chunk11; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000001-000800.csv output/ds-chunk11" > output/ds-chunk11/nohup.out & # Revised: Hangs indefinitely on repo 858127: # mkdir -p output/ds-chunk2; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000801-001552.csv output/ds-chunk2" > output/ds-chunk2/nohup.out & mkdir -p output/ds-chunk13; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000801-001552_filt.csv output/ds-chunk13" > output/ds-chunk13/nohup.out & mkdir -p output/nonds-chunk11; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_000001-000800.csv output/nonds-chunk11" > output/nonds-chunk11/nohup.out & mkdir -p output/nonds-chunk12; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_000801-001600.csv output/nonds-chunk12" > output/nonds-chunk12/nohup.out & mkdir -p output/nonds-chunk13; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_001601-002400.csv output/nonds-chunk13" > output/nonds-chunk13/nohup.out & mkdir -p output/nonds-chunk14; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_002401-002511.csv output/nonds-chunk14" > output/nonds-chunk14/nohup.out &Each command takes between 1 hour (LOC over DS repos) to 52 hours (Pylint over chunk of 800 repos), and may consume up to 8GB of memory each. (We assigned ~4 concurrent tasks to each node)
-
Merge the chunks back together (results will be written to
output/merged):merge_chunks-cc.ipynb merge_chunks-imports.ipynb merge_chunks.ipynb merge_chunks-loc.ipynb merge_chunks-version.ipynb merge_chunks-git.ipynb -
Analyse project imports and Python version (intermediate results, will be written to
output/notebooks_out):analyse_imports.ipynb analyse_py_ver.ipynb -
Refine the final selection of DS and non-DS repos to control for the distribution of stars, age, etc.:
distributions-sel.ipynbAnalyse differences between the final selection of DS versus non-DS repos:
ml-distribution.ipynb -
Tables and figures for the paper will be exported to
output/notebooks_out -
Remove the docker image
surround run remove
- The GitHub API pages results, thus the number of contributors is limited to 30, so this should be interpreted as 30+. This does not affect the figure in the paper (as the axis is limited to 30)
- The old project name was
mining_nlp_repositories, as we initially trailed the analysis on a corpus of NLP projects. The new project nameMining Data Science Repositoriesreflects the broader scope of the project to include all types of DS repositories (but the source code still contains references to the old project name).