-
Notifications
You must be signed in to change notification settings - Fork 48
Machine Learning Models
Since regex scanners are prone to produce a lot of false positive discoveries, machine learning models can be used to reduce the number of discoveries to be manually analysed. In particular, models automatically classify discoveries as false_positive
(i.e., spam).
The models need an implementation (in credentialdigger/models
folder). Possible binaries are automatically downloaded on-the-fly, i.e., starting from Credential Digger v4.4 there's no need to download the binaries of the models anymore 🎉.
If you want to propose a new model to reduce false positive discoveries, please contact us (or open an issue in the project)
The Path Model empowers regular expressions to match typical files that contain fake credentials.
After a pre-processing phase, the file path of a discovery is matched with a regular expression to guess whether the credentials contained in it will be real ones or not. Indeed, according to our observations, documentation (e.g., README and .md
files in general), tutorials, tests, virtual environments and dependencies pushed to the repository (e.g., node_modules
), don't contain real secrets used in production.
Up to v4.3 we used a ML approach based on fasttext
, but we shifted to regular expressions in v4.4 since it proved to be more performing without loss of precision. Please visit the OLD machine learning models page for further information regarding the old Path Model.
The approach of the OLD Snippet Model was revolutionized in v4.4 in favour of a more efficient strategy. Indeed, fasttext
and the double-model strategy (i.e., Snippet Extractor and Snippet Classifier) has been deprecated and replaced by a unique, open source, Password Model.
The new Password Model is based on NLP and provides a higher precision compared to the old Snippet Model, but it only works with passwords and it's slower.
The similarity feature can be enabled before running a scan in order to reduce the manual workload of assessing the discoveries. Indeed, if this feature is enabled, similarity scores are computed among the snippets of a repo after a scan. This way, every time the user changes the state of a discovery using the UI (e.g., marking a snippet as a False Positive), or calling the update_similar_snippets
method in the library, all the discoveries with similar snippets will be classified accordingly.
- Installation instructions: Readme
- Preparation for the scanner's rules
- Deploy over HTTPS (Optional)
- How to update the project
- How to install on MacOS ARM
- Python library
- CLI
- Web UI through the Docker installation
- Pre-commit hook