Skip to content

Commit

Permalink
Refactoring and ML retrain (#575)
Browse files Browse the repository at this point in the history
* Refactoring&retrain

* fix PYLINT.USELESS_ELSE_ON_LOOP

* skip lambda usage

* Apply suggestions from code review

* test data fix

* optimisation

* style
  • Loading branch information
babenek authored Jul 8, 2024
1 parent aa2888f commit 31dcd1d
Show file tree
Hide file tree
Showing 98 changed files with 5,849 additions and 2,138 deletions.
18 changes: 8 additions & 10 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ jobs:
path: data
key: cred-data-${{ hashFiles('checksums.md5') }}

- name: Set up Python 3.8
- name: Set up Python 3.10
if: steps.cache-data.outputs.cache-hit != 'true'
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"

- name: Update PIP
run: python -m pip install --upgrade pip
Expand Down Expand Up @@ -97,10 +97,10 @@ jobs:
if: steps.cache-data.outputs.cache-hit == 'true'
run: ls -al . && ls -al data

- name: Set up Python 3.8
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.10"

- name: Update PIP
run: python -m pip install --upgrade pip
Expand Down Expand Up @@ -167,7 +167,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [ "3.8", "3.9", "3.10", "3.11" ]
python-version: [ "3.10", "3.9", "3.10", "3.11" ]

steps:

Expand Down Expand Up @@ -384,11 +384,11 @@ jobs:
mv data ${{ github.workspace }}/CredData/
mv meta ${{ github.workspace }}/CredData/
- name: Set up Python 3.8
- name: Set up Python 3.10
if: steps.cache-data.outputs.cache-hit != 'true'
uses: actions/setup-python@v3
with:
python-version: "3.8"
python-version: "3.10"

- name: Update PIP
run: python -m pip install --upgrade pip
Expand Down Expand Up @@ -419,10 +419,8 @@ jobs:
# check whether credsweeper is available as module
python -m credsweeper --banner
# use only 2 epochs for the test
sed -i 's/epochs=42,/epochs=2,/' main.py
sed -i 's/max_epochs = .*/max_epochs = 2/' main.py
python main.py --data ${{ github.workspace }}/CredData -j $(( 2 * $(nproc) ))
ls -al results #dbg
python -m tf2onnx.convert --saved-model $(find results -mindepth 1 -maxdepth 1 -type d) --output ../credsweeper/ml_model/ml_model.onnx --verbose
# dbg
git diff
# crc32 should be changed
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 1cbfbd7fb1e657d137c9eeec26a07ad4
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 62d92ab2f91a18e861d846a7b8a0c3a7
# # # Python setup

Expand Down
97 changes: 49 additions & 48 deletions cicd/benchmark.txt

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions credsweeper/common/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@ class KeywordPattern:
# there will be inserted a keyword
key_right = r")" \
r"[^:='\"`<>{?!&]*)[`'\"]*)" # <variable>
# Authentication scheme ( oauth | basic | bearer | apikey ) precedes to credential
separator = r"\s*\]?\s*" \
r"(?P<separator>:( [a-z]{3,9}[?]? )?=" \
r"|:( oauth | basic | bearer | apikey | accesskey )?" \
r"|=>|!=|===|==|=)" \
r"|:|=>|!=|===|==|=)" \
r"((?!\s*ENC(\(|\[))(\s|\w)*\((\s|\w|=|\()*|\s*)"
value = r"(?P<value_leftquote>((b|r|br|rb|u|f|rf|fr|\\)?[`'\"])+)?" \
# Authentication scheme ( oauth | basic | bearer | apikey ) precedes to credential
value = r"(?P<value_leftquote>((b|r|br|rb|u|f|rf|fr|\\{0,8})?[`'\"]){1,4})?" \
r"( ?(oauth|bot|basic|bearer|apikey|accesskey) )?" \
r"(?P<value>(?:\{[^}]{3,8000}\})|(?:<[^>]{3,8000}>)|" \
r"(?(value_leftquote)(?:\\[tnrux0-7][0-9a-f]*|[^`'\"\\])|(?:\\n|\\r|\\?[^\s`'\"\\])){3,8000})" \
r"(?P<value_rightquote>(\\?[`'\"])+)?"
r"(?(value_leftquote)(?:\\[tnrux0-7][0-9a-f]*|[^`'\"\\])|(?:\\n|\\r|\\?[^\s`'\"\\,;])){3,8000})" \
r"(?(value_leftquote)(?P<value_rightquote>(\\{0,8}[`'\"]){1,4})?)"

@classmethod
def get_keyword_pattern(cls, keyword: str) -> re.Pattern:
Expand Down
16 changes: 9 additions & 7 deletions credsweeper/common/keyword_checklist.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from functools import cached_property
from typing import Set
from typing import Set, List

from credsweeper.app import APP_PATH

Expand All @@ -13,21 +13,23 @@ class KeywordChecklist:

def __init__(self) -> None:
# used suggested text read style. split() is preferred because it strips 0x0A on end the file
self.__keyword_list = self.KEYWORD_PATH.read_text().split()
self.__keyword_list.sort(key=str.__len__, reverse=True)
self.__keyword_set = set(self.KEYWORD_PATH.read_text().split())
# The list of morphemes can be combined to form words.
# The value is considered a variable if at least two exist.
self.__morpheme_set = set(self.MORPHEME_PATH.read_text().split())

@cached_property
def keyword_set(self) -> Set[str]:
"""Get set with keywords.
Return:
Set of strings
"""
"""Get set with keywords"""
return self.__keyword_set

@cached_property
def keyword_list(self) -> List[str]:
"""Get list with keywords in descended order of length"""
return self.__keyword_list

@cached_property
def keyword_len(self) -> int:
"""Length of keyword_set"""
Expand Down
22 changes: 20 additions & 2 deletions credsweeper/common/keyword_checklist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ animation
another
anony
apache
api
appearance
apple
application
Expand Down Expand Up @@ -102,6 +103,7 @@ border
bottle
bottom
bound
brain
branch
brand
break
Expand Down Expand Up @@ -200,6 +202,7 @@ continue
control
convenience
convert
copy
cookie
coordinator
corner
Expand Down Expand Up @@ -285,6 +288,7 @@ editing
editor
effect
either
elastic
element
email
empty
Expand All @@ -297,7 +301,7 @@ ensure
entity
entries
entry
environ
environment
equal
equals
erase
Expand Down Expand Up @@ -331,6 +335,7 @@ feedback
fetch
field
figure
file
files
filename
filter
Expand Down Expand Up @@ -533,6 +538,7 @@ notice
notification
null
number
oauth
object
oblique
observe
Expand Down Expand Up @@ -581,6 +587,7 @@ patch
paths
pattern
pause
peer
payload
payment
pending
Expand All @@ -602,6 +609,7 @@ plain
platform
player
point
pool
policy
portal
portfolio
Expand Down Expand Up @@ -754,6 +762,11 @@ session
setting
setter
setup
sha256
sha1
sha2
sha224
sha512
shadow
shallow
shape
Expand All @@ -765,6 +778,7 @@ showing
shown
shutdown
sidebar
signature
sign
similar
simple
Expand All @@ -786,6 +800,7 @@ solid
sorted
source
space
spaces
spacing
spark
speak
Expand Down Expand Up @@ -845,7 +860,7 @@ tablet
target
tasks
teacher
teams
team
temp
terms
test
Expand Down Expand Up @@ -932,13 +947,15 @@ warning
watch
waves
weight
whatever
where
whether
which
while
white
width
window
with
within
without
world
Expand All @@ -949,6 +966,7 @@ written
xxxxx
yellow
yield
your
zeros
.json
.xml
Loading

0 comments on commit 31dcd1d

Please sign in to comment.