-
Notifications
You must be signed in to change notification settings - Fork 48
Scan a repository
-
Install the dependencies (possibly using a virtualenv)
-
Instantiate the client (either Postgres or sqlite)
from credentialdigger import PgClient c = PgClient(dbhost='xxx.xxx.xxx.xxx', dbport=NUM, dbname='mydbname', dbuser='myusername', dbpassword='mypassword')
or
from credentialdigger import SqliteClient c = SqliteClient(path='/path/to/data.db')
-
[OPTIONAL] Add the repository
c.add_repo(url='https://github.com/user/repo')
-
Launch the scan of the repo
new_discoveries = c.scan(repo_url=REPO_URL, category=CATEGORY, models=MODELS, force=FORCE, local_repo=LOCAL_REPO, similarity=SIMILARITY, git_username=GIT_USERNAME, git_token=GIT_TOKEN, debug=DEBUG)
- REPO_URL: the url of the repo we want to scan.
- CATEGORY: the category of rules to be used for the scan. If no category is selected, the scanner uses all the rules that are actually stored in the database
- MODELS: A list of models that we want to apply to auto-classify false positives (the models are applied in cascade, sequentially). If no models are specified, then do not use any. Refer to Models page to know more on models
- FORCE: True if we want to force the complete scan of a repository. Indeed, in case the repository has already been scanned, we would consider only the new commits
- LOCAL_REPO: if True, get the repository from a local directory instead of the web
-
GENERATOR: True if we want to generate an adapted extractor for the snippet model. This only works if the[DEPRECATED IN v4.4]SnippetModel
is in MODELS, and if there are still discoveries to classify when the time for theSnippetModel
comes - SIMILARITY: if True, build the embedding model, compute and store embeddings of all discoveries, and allow for automatic update of similar discoveries
- DEBUG: True if we want visual feedbacks (progress bars) when the scan is in progress, False otherwise (the default choice)
- GIT_USERNAME: the git username to be used to authenticate (only enforced if also the
git_token
is set) - GIT_TOKEN: git personal access token used to authenticate (needed for private repos or for some git server where authentication is mandatory)
new_discoveries
is a list of ids of discoveries that have automatically been inserted into the db as new
. If we set MODELS, then the discoveries classified as false positives are automatically updated in the db (as false_positive
) without user intervention, and do not appear in new_discoveries
.
new_discoveries
are supposed to be analyzed manually by the user, and their state will be manually changed by the user.
for disc_id in new_discoveries:
this_discovery = c.get_discovery(disc_id)
# Analyze it
# Change its state (if needed)
c.update(disc_id, 'new state')
Refer to States for the states supported by the system.
Credential Digger also provides a method to scan the whole repository at a given point in time, i.e., either at a specific commit id or at the last commit of a specific branch. A subsequent scan_snapshot on the same repository (but at a different commit id, obviously), will only take into consideration the diff between the new snapshot and the previously scanned once (unless the force
parameter is set to True
). Moreover, when scanning a snapshot, the timestamp for the last_scan
of the repo is set to the timestamp of the commit id of the chosen snapshot instead of the date of the scan. This way, users can have a (more useful) indication on the coverage of a repo.
After instantiating the client, this scan can be run as follows:
new_discoveries = c.scan_snapshot(repo_url=REPO_URL,
branch_or_commit=BRANCH_OR_COMMIT,
category=CATEGORY,
models=MODELS,
force=FORCE,
similarity=SIMILARITY,
git_username=GIT_USERNAME,
git_token=GIT_TOKEN,
debug=DEBUG,
max_depth=MAX_DEPTH,
ignore_list=IGNORE_LIST)
- CATEGORY, MODELS, SIMILARITY, GIT_USERNAME, GIT_TOKEN, and DEBUG work same as in
scan
method. - BRANCH_OR_COMMIT: the branch name or the commit id where the whole repository will be scanned.
- MAX_DEPTH: The maximum depth to which traverse the subdirectories tree. A negative value will not affect the scan. The default value is
-1
. - IGNORE_LIST: A list of paths to ignore during the scan.
Credential Digger also provides a method to scan a pull request, i.e., all the new lines of code introduced in the commits part of a pull request. The scan of a pull request requires the repo has never been scanned before (not to clash with the definition of "diff" of scan
and scan_snapshot
).
After instantiating the client, this scan mode can be run as follows:
new_discoveries = c.scan_pull_request(repo_url=REPO_URL,
pr_number=PR_NUMBER,
api_endpoint=API_ENDPOINT,
category=CATEGORY,
models=MODELS,
force=FORCE,
similarity=SIMILARITY,
git_token=GIT_TOKEN,
debug=DEBUG)
- CATEGORY, MODELS, SIMILARITY, GIT_TOKEN, and DEBUG work same as in
scan
method. - PR_NUMBER: the id of the pull request that will be scanned.
- API_ENDPOINT: The github endpoint the PR has been opened on. The default value is
https://api.github.com
i.e., the publicgithub.com
platform.
Credential Digger also provides a method to scan all the repositories belonging to a user.
After instantiating the client, this scan can be run as follows:
new_repos_discoveries = c.scan_user(username=GITHUB_USERNAME,
category=CATEGORY,
models=MODELS,
similarity=SIMILARITY,
debug=DEBUG,
git_token=GIT_TOKEN,
api_endpoint=API_ENDPOINT,
forks=FORKS)
- CATEGORY, MODELS, SIMILARITY, GIT_TOKEN, and DEBUG work same as in
scan
method. - USERNAME: the username as appearing on GitHub. All the repositories in this account will be considered for the scan. Please note that this parameter is different from
GIT_USERNAME
in thescan
. - FORKS: True if we want to scan also forked repositories, False otherwise (the default choice)
- API_ENDPOINT: it's the api endpoint for the git server. If not set,
github.com
api endpoint, i.e.,https://api.github.com
, is set.
new_repos_discoveries
is a dictionary, where keys are urls of the repositories scanned, and, for each repository, its value is a
list of ids of discoveries that have automatically been inserted into the db as new (i.e., the return value from the scan
function).
- Installation instructions: Readme
- Preparation for the scanner's rules
- Deploy over HTTPS (Optional)
- How to update the project
- How to install on MacOS ARM
- Python library
- CLI
- Web UI through the Docker installation
- Pre-commit hook