Skip to content

Commit

Permalink
Merge pull request #87 from CCBR/issue86
Browse files Browse the repository at this point in the history
add `--geezers` option
  • Loading branch information
kopardev authored Feb 16, 2024
2 parents d8412be + 77f82c9 commit b222850
Show file tree
Hide file tree
Showing 7 changed files with 132 additions and 24 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,21 @@

### Bug fixes

## spacesavers2 0.11.4

### New features

- `--geezers`, `--geezerage` and `--geezersize` arguments are added to `spacesavers2_catalog` to report really old files per user.
- documents updated

### Bug fixes

## spacesavers2 0.11.3

### New features

- brokenlinks are reported on a per user basis
- progress bar added to `spacesavers2_catalog` and `spacesavers2_mimeo`

## spacesavers2 0.11.2

Expand Down
Binary file modified docs/assets/images/spacesavers2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 39 additions & 9 deletions docs/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,17 @@ For each file it also computes a unique hash (using xxHash library) for:
- `--se`: Comma-separated special extensions for home `spacesavers2` ignores the headers before extracting the top chunk for xxHash calculation. The default is "bam,bai,bigwig,bw,csi".
- `--bottomhash`: Default False. Use the bottom chunk along with the top chunk of the file for xxHash calculation.
- `--outfile`: If not supplied then the optput is written to the screen.
- `--brokenlink`: Default False. Create a file listing broken symlinks per-user.
- `--geezer`: Default False. Create a file listing really old files per-user. Output files have 3 columns: age, size, path
- `--geezerage`: Default 5 * 365. age in days to be considered a geezer file.
- `--geezersize`: Default 10 MiB. minimum size in bytes of geezer file to be reported.

> NOTE: `spacesavers2_catalog` reports errors (eg. cannot read file) to STDERR
```bash
% spacesavers2_catalog --help
usage: spacesavers2_catalog [-h] -f FOLDER [-p THREADS] [-b BUFFERSIZE] [-i IGNOREHEADERSIZE] [-s SE] [-o OUTFILE] [-e | --bottomhash | --no-bottomhash]
usage: spacesavers2_catalog [-h] -f FOLDER [-p THREADS] [-b BUFFERSIZE] [-i IGNOREHEADERSIZE] [-x SE] [-s ST_BLOCK_BYTE_SIZE] [-o OUTFILE]
[-e | --bottomhash | --no-bottomhash] [-l | --brokenlink | --no-brokenlink] [-g | --geezers | --no-geezers]
[-a GEEZERAGE] [-z GEEZERSIZE] [-v]

spacesavers2_catalog: get per file info.

Expand All @@ -36,25 +41,39 @@ options:
-f FOLDER, --folder FOLDER
spacesavers2_catalog will be run on all files in this folder and its subfolders
-p THREADS, --threads THREADS
number of threads to be used
number of threads to be used (default 4)
-b BUFFERSIZE, --buffersize BUFFERSIZE
buffersize for xhash creation
buffersize for xhash creation (default=128 * 1028 bytes)
-i IGNOREHEADERSIZE, --ignoreheadersize IGNOREHEADERSIZE
this sized header of the file is ignored before extracting buffer of buffersize for xhash creation (only for special extension files)
-s SE, --se SE comma separated list of special extentions
this sized header of the file is ignored before extracting buffer of buffersize for xhash creation (only for special
extensions files) default = 1024 * 1024 * 1024 bytes
-x SE, --se SE comma separated list of special extensions (default=bam,bai,bigwig,bw,csi)
-s ST_BLOCK_BYTE_SIZE, --st_block_byte_size ST_BLOCK_BYTE_SIZE
st_block_byte_size on current filesystem (default 512)
-o OUTFILE, --outfile OUTFILE
outfile ... catalog file .. by default output is printed to screen
-e, --bottomhash, --no-bottomhash
separately calculated second hash for the bottom/end of the file.
-l, --brokenlink, --no-brokenlink
output per-user broken links list.
-g, --geezers, --no-geezers
output per-user geezer files list.
-a GEEZERAGE, --geezerage GEEZERAGE
age in days to be considered a geezer file (default 5yrs ... 5 * 365).
-z GEEZERSIZE, --geezersize GEEZERSIZE
minimum size in bytes of geezer file to be reported (default 10MiB ... 10 * 1024 * 1024).
-v, --version show program's version number and exit
Version:
v0.5
v0.11.4
Example:
> spacesavers2_catalog -f /path/to/folder -p 56 -e
> spacesavers2_catalog -f /path/to/folder -p 56 -e -l -g
```
### Output
## catalog file
`spacesavers2_catalog` creates one semi-colon seperated output line per input file. Here is an example line:
```bash
Expand All @@ -80,4 +99,15 @@ The 13 items in the line are as follows:
| 12 | top chunk xxHash | 4707e661a1f3beca1861b9e0e0177461 |
| 13 | bottom chunk xxHash | 52e5038016c3dce5b6cdab635765cc79 |
> NOTE: Some files may have ";" or spaces in their name, hence adding quotes around the absolute file path.
> NOTE: Some files may have ";" or spaces in their name, hence adding quotes around the absolute file path.
## broken links file
- simply lists the paths which are symbolic links, but the destination files do not exist anymore!
- one file per username.
## geezer file
- lists really old file.
- list has 3 columns: age, size and path.
- one file per username.
79 changes: 67 additions & 12 deletions spacesavers2_catalog
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def main():
Version:
{}
Example:
> spacesavers2_catalog -f /path/to/folder -p 56 -e
> spacesavers2_catalog -f /path/to/folder -p 56 -e -l -g
""".format(
__version__
)
Expand Down Expand Up @@ -115,6 +115,40 @@ def main():
action=argparse.BooleanOptionalAction,
help="separately calculated second hash for the bottom/end of the file.",
)
parser.add_argument(
"-l",
"--brokenlink",
dest="brokenlink",
required=False,
action=argparse.BooleanOptionalAction,
help="output per-user broken links list.",
)
parser.add_argument(
"-g",
"--geezers",
dest="geezers",
required=False,
action=argparse.BooleanOptionalAction,
help="output per-user geezer files list.",
)
parser.add_argument(
"-a",
"--geezerage",
dest="geezerage",
required=False,
default= 5 * 365,
type=int,
help="age in days to be considered a geezer file (default 5yrs ... 5 * 365).",
)
parser.add_argument(
"-z",
"--geezersize",
dest="geezersize",
required=False,
default= 10 * 1024 * 1024,
type=int,
help="minimum size in bytes of geezer file to be reported (default 10MiB ... 10 * 1024 * 1024).",
)
parser.add_argument("-v", "--version", action="version", version=__version__)

global args
Expand All @@ -132,6 +166,7 @@ def main():
files.extend(files2)

broken_links = dict()
geezers = dict()

if args.outfile:
outfh = open(args.outfile, 'w')
Expand All @@ -140,26 +175,46 @@ def main():

with Pool(processes=args.threads) as pool:
for fd in tqdm.tqdm(pool.imap_unordered(task, files),total=len(files)):
uid = fd.get_userid()
if fd.get_type() == "L": # broken link
uid = fd.get_userid()
if not uid in broken_links: broken_links[uid] = list()
broken_links[uid].append(fd.get_filepath())
else:
result = "%s" % (fd)
if not result == "":
outfh.write(f"{result}\n")
if fd.get_type() == "f":
age = fd.get_age()
size = fd.get_size()
if age > args.geezerage and size > args.geezersize:
x = list()
x.append("{0:.2f} yrs".format(age/365))
x.append(fd.get_size_human_readable())
x.append(fd.get_filepath())
if not uid in geezers: geezers[uid] = list()
geezers[uid].append("\t".join(x))

if args.outfile:
outfh.close()
for uid in broken_links:
username = get_username_groupname(uid)
outfilename = args.outfile
if outfilename.endswith(".txt"):
outfilename = outfilename[:-4]
outfh_broken_links = open(outfilename + "." + username + ".brokenlinks.txt", 'w')
for l in broken_links[uid]:
outfh_broken_links.write(f"{l}\n")
outfh_broken_links.close()

if args.brokenlink: # spit out broken links
for uid in broken_links:
username = get_username_groupname(uid)
outfilename = args.outfile
if outfilename.endswith(".txt"):
outfilename = outfilename[:-4]
outfh_broken_links = open(outfilename + "." + username + ".brokenlinks.txt", 'w')
for l in broken_links[uid]:
outfh_broken_links.write(f"{l}\n")
outfh_broken_links.close()
if args.geezers:
for uid in geezers:
username = get_username_groupname(uid)
outfilename = args.outfile
if outfilename.endswith(".txt"):
outfilename = outfilename[:-4]
outfh_geezers = open(outfilename + "." + username + ".geezers.txt", 'w')
for l in geezers[uid]:
outfh_geezers.write(f"{l}\n")
outfh_geezers.close()
if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions spacesavers2_e2e
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ spacesavers2_catalog \
--threads $THREADS \
--outfile ${outfile_catalog} \
--bottomhash \
--brokenlink \
--geezers \
> ${outfile_catalog_log} 2> ${outfile_catalog_err}
EOF
)
Expand Down
15 changes: 13 additions & 2 deletions src/FileDetails.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ def __init__(self):
self.fdl = "u" # is it file or directory or link or unknown or absent ... values are f d l u a
self.size = -1
self.calculated_size = -1
self.calculated_size_human_readable = ""
self.dev = -1
self.inode = -1
self.nlink = -1
Expand All @@ -83,12 +84,13 @@ def initialize(self,f,thresholdsize=THRESHOLDSIZE, buffersize=BUFFERSIZE, tb=TB,
st = self.apath.stat(follow_symlinks=False) # gather stat results
self.size = st.st_size # size in bytes
self.calculated_size = st.st_blocks * st_block_byte_size # st_blocks gives number of 512 bytes blocks used
self.calculated_size_human_readable = get_human_readable_size(self.calculated_size)
self.dev = st.st_dev # Device id
self.inode = st.st_ino # Inode
self.nlink = st.st_nlink # number of hardlinks
self.atime = convert_time_to_age(st.st_atime) # access time
self.mtime = convert_time_to_age(st.st_mtime) # modification time
self.ctime = convert_time_to_age(st.st_ctime) # creation time
self.ctime = convert_time_to_age(st.st_ctime) # change time
self.uid = st.st_uid # user id
self.gid = st.st_gid # group id
if self.fld == "f":
Expand Down Expand Up @@ -223,4 +225,13 @@ def get_type(self):
return self.fld

def get_userid(self):
return self.uid
return self.uid

def get_age(self):
return self.mtime

def get_size(self):
return self.calculated_size

def get_size_human_readable(self):
return self.calculated_size_human_readable
2 changes: 1 addition & 1 deletion src/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.11.3
0.11.4

0 comments on commit b222850

Please sign in to comment.