Notes here for multi-volume version of FSAudit
Past multi-volume work has been on MGI here: /gscuser/mwyczalk/projects/FSAudit/FSAudit.dev/FSAudit/multi-run
Processing proceeds in these steps:
- Evaluate volume. Traverse entire volume (essentially
find | stat
) obtain information about all files in a specified filesystem. Writesrawstat
file.- May be run
sudo
to provides complete information for all files regardless of permissions
- May be run
- Process stats. Secondary analysis of above data, writes
filestat
file - Summarize stat. Merge above data according to owner and extension, writes
summary
file - Plot stats. Generate visualization figures
This package requires python 3, GNU parallel, and R packages plyr and ggplot2.
This can be managed with Conda
The VolumeList file (default config/VolumeList.dat
) contains the following two fields for every volume to be audited, tab-separated:
VOLUME_NAME
: Short name of system and volume, used for filenamesVOLUME
: This is the base path we are analyzing
Example VolumeList.dat file:
MGI.gc2500 /gscmnt/gc2500/dinglab
MGI.gc2508 /gscmnt/gc2508/dinglab
MGI.gc2509 /gscmnt/gc2509/dinglab
*.rawstat.gz
# file_name file_type file_size owner_name time_mod hard_links
*.filestat.gz
# dirname filename ext file_type file_size owner_name time_mod hard_links
*.summary.dat
ext owner_name count cumulative_size
- Create
config/VolumeList.dat
tmux new -s FSAudit
- Optional call to starttmux
. This is useful because run is time consuming- If on MGI,
0_start_MGI_docker.sh
To debug and test processing, run in dryrun mode and only the first one.
bash 1_start_runs.sh -d1
will show the call to process_FS.sh, and
bash 1_start_runs.sh -dd1
shows processing of individual steps
To run all with four jobs at a time,
bash 1_start_runs.sh -J 4
On MGI, use conda environment p3R
. Conda cheat sheet
conda activate p3R
Confirm this...
The following plots are generated
All output is written to ./dat
, ./logs
, ./img
Get details for given extension and user:
zcat /gscmnt/gc3020/dinglab/mwyczalk/gc2737.20190612.filestat.gz | awk -v FS="\t" '{if ($3 == ".chr20" && $6 == "rmashl") print}'
From man stat
--printf=FORMAT
like --format, but interpret backslash escapes, and do not output a mandatory trailing newline; if you want a newline, include \n in FORMAT
What I want in order
%n file name
%F file type
%s total size, in bytes
%U user name of owner
%y time of last modification, human-readable
%h number of hard links
This package requires python 3. R packages which need to be installed: plyr, ggplot2 Also require GNU parallel
This requires python 3. Python 2 yields errors like this:
TypeError: open() got an unexpected keyword argument 'encoding'